[PYTHON] Try using Elasticsearch as the foundation of a question answering system

This article is the 24th day of Elastic stack Advent Calendar 2016.

Self-introduction

――I am an engineer at a company called Acroquest Technology. ――When I was a student, I was doing natural language processing and information retrieval.

Overview

I want to create a question answering system based on Elasticsearch.

――It's like spitting out answers to questions written in Japanese.

--There are many sources in the world that can be sources of knowledge. Ideally, you should be able to choose flexibly if possible, and the information should be constantly updated.

――I think Elasticsearch, which is easy to scale, may be good when the knowledge source becomes huge. (Of course, I'm not motivated to form a large cluster individually)

--Write in multiple articles. → In this article, I will write about what I have tried as a preliminary preparation.

environment

This flow

  1. Write policy
  2. Put the data that will be the source of knowledge into elasticsearch for the time being
  3. Make it possible to get related documents from the Python side

policy

First of all, the definition of "question answering" is fluffy. The range of difficulty varies greatly depending on the type of question, so This time, as the first step, we will focus on the "authenticity judgment problem," which seems to be the simplest.

For example, in response to a sentence such as "Ieyasu Tokugawa opened the Edo Shogunate" that expresses a specific fact. It is for authenticity judgment.

If this

  1. Knowledge source data is held correctly
  2. Can interpret the question
  3. You can search for the correct information from knowledge sources If the conditions are met, you should be able to answer correctly in 100 shots.

In theory.

Put the data that will be the source of knowledge into elasticsearch for the time being

This time, we will create some sample data and put it into Elasticsearch. For the time being, I tried to insert the text data itself and the one that was divided. (I'd be happy if I could see the keywords visually, I'm leaving the text because I think I'll want to parse it later)

As a data flow Data source → Python → elasticsearch → Python → Output I think it is better to do.

The data that went into it looks like this

スクリーンショット 2016-12-24 19.25.41.png

It's not directly related to what you want to do this time, but it's fun to use Graph if you bring it in an array

スクリーンショット 2016-12-24 21.08.13.png

Graph I thought it didn't make sense this time, but it was rather important ... If you look at this, you can see at a glance that "19" and "century" appear separately and the mysterious word "ka" is extracted. (What is "ka" ...) It's not wrong as a process, but I'd be happy if "○○ Century" was a set. It looks like we need to improve the way we divide words. I will review the dictionary separately.

For the time being, make sure that you can search from the python side with an appropriate keyword.

ruby::


from elasticsearch import Elasticsearch
import json

es = Elasticsearch(['http://USER:PASSWORD@localhost:9200'])


request_body="{\"size\":10,\"query\":{\"term\":{\"words.keyword\":\"Japan\"}}}"

output = open("search_result.json","w")
json.dump(es.search(index="test",body=request_body),output, ensure_ascii=False, indent=4, sort_keys=True, separators=(',', ': '))

If you write like this, documents containing the word "Japan" will be pulled. The result will be returned as ↓ スクリーンショット 2016-12-24 21.51.06.png

When it comes to authenticity After that, it seems that it can be judged by analyzing the question text and the text of the returned document. For the time being, I will do the preparation so far this time. Please look forward to the continuation.

Summary

For the time being, I prepared to make a question answering system. (Maybe it's just preparations ...)

In the next article, I want to make a child who can answer the question properly.

Recommended Posts

Try using Elasticsearch as the foundation of a question answering system
Try using the collections module (ChainMap) of python3
Try to simulate the movement of the solar system
Cut a part of the string using a Python slice
Output the output result of sklearn.metrics.classification_report as a CSV file
(Note) A story about creating a question answering system using Spring Boot and machine learning (SVM)
Avoiding the pitfalls of using a Mac (for Linux users?)
Reuse the behavior of the @property method by using a descriptor [16/100]
Try a similar search for Image Search using the Python SDK [Search]
Try to model a multimodal distribution using the EM algorithm
Extract the value of dict or list as a string
The story of creating a database using the Google Analytics API
Try using [Tails], a purveyor of hackers (?), By USB booting.
Problems when using Elasticsearch as a data source in Redash
The story of making a question box bot with discord.py
Try using the Twitter API
Try using the Twitter API
A memorandum of using eigen3
Understand the function of convolution using image processing as an example
Save the result of the life game as a gif with python
Precautions when using a list or dictionary as the default argument
Try to edit a new image using the trained StyleGAN2 model
Evaluate the performance of a simple regression model using LeaveOneOut cross-validation
Write a script to calculate the distance with Elasticsearch 5 system painless
Finding the optimum value of a function using a genetic algorithm (Part 1)
[Anomaly detection] Try using the latest method of deep distance learning
[Kaggle] I made a collection of questions using the Titanic tutorial
[Fabric] I was addicted to using boolean as an argument, so make a note of the countermeasures.
Try using the Python Cmd module
The story of writing a program
A story that visualizes the present of Qiita with Qiita API + Elasticsearch + Kibana
If you want a singleton in python, think of the module as a singleton
Things to be aware of when building a recommender system using Item2Vec
Try to get the road surface condition using big data of road surface management
Try using n to downgrade the version of Node.js you have installed
[Python] [Word] [python-docx] Try to create a template of a word sentence in Python using python-docx
What Java users thought of using the Go language for a day