[PYTHON] From Elasticsearch installation to data entry

Recently, I started using Elasticsearch for work. As a memo, I will write down what I used and learned from data entry to scoring.

It will be quite long to write everything, so I will write it separately in the front and back. The first part is from installation to data input In the second part, I will talk about search and scoring.

Installation

For the time being, install what you need. The development environment is CentOS7.

Elasticsearch requires at least Java 8. Specifically as of this writing, it is recommended that you use the Oracle JDK version 1.8.0_73.

Requires Java version 8 or higher.

Java installation

First, install Java 8. Proceed while referring to here. If you already have an older version of Java 1.7, the link above also tells you how to switch to the Java VM.

You can remove Java 7 and reinstall Java 8.

$ sudo yum remove -y java-1.7.0-openjdk
$ sudo yum install -y java-1.8.0-openjdk-devel
$ sudo yum install -y java-1.8.0-openjdk-debuginfo --enablerepo=*debug*

Version confirmation

$ java -version
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)

Elasticsearch installation

You can install older versions of 2.x with yum. Since Elasticsearch 5.0 is used here, install Elasticsearch 5.0. (It seems that 6 has been released recently.) The installation should follow the steps in Elasticsearch Docs, but the startup did not work. (； ∀ ；)

Install from rpm as Alternative. (Currently, 6 cannot be installed at rpm.)

# rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
# vim /etc/yum.repos.d/elasticsearch.repo
[elasticsearch-5.x]
name=Elasticsearch repository for 5.x packages
baseurl=https://artifacts.elastic.co/packages/5.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

# yum install elasticsearch
# systemctl enable elasticsearch
# systemctl start elasticsearch

Test if it could be started

# curl localhost:9200
{
  "name" : "3Y-W_H1",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "cYenb8Q8S22EHcxJPL7k2Q",
  "version" : {
    "number" : "5.0.0",
    "build_hash" : "253032b",
    "build_date" : "2016-10-26T04:37:51.531Z",
    "build_snapshot" : false,
    "lucene_version" : "6.2.0"
  },
  "tagline" : "You Know, for Search"
}

You were able to start it.

Install kibana

Refer to Elastic Doc.

# rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
# vim /etc/yum.repos.d/kibana.repo
[kibana-5.x]
name=Kibana repository for 5.x packages
baseurl=https://artifacts.elastic.co/packages/5.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

# yum install kibana
# systemctl enable kibana
# systemctl start kibana

Set up the connection.

# vim /etc/kibana/kibana.yml
network.host: "0.0.0.0"
elasticsearch.url: "http://localhost:9200"
# systemctl restart kibana

Let's open it in a browser.

http://192.168.216.128:5601

キャプチャ.PNG

I connected it safely. ((´∀ ｀))

Installation of Kuromoji

Let's also install the plugin for Japanese analysis.

# /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-kuromoji
# systemctl restart elasticsearch
# curl -X GET 'http://localhost:9200/_nodes/plugins?pretty'
…
"plugins" : [
        {
          "name" : "analysis-kuromoji",
          "version" : "5.0.0",
          "description" : "The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.",
          "classname" : "org.elasticsearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin"
        }
      ],
…

bonus

There is also a cloud version of Elastic Cloud (https://cloud.elastic.co/). (There is a 14-day free trial period.) Elastic uses an AWS server instead of his own server. GUI operation is completely convenient with a browser, but it is inconvenient if you do not put it on the back side (CUI) of the server.

Input data using Python

First, I want to use kuromoji, so I will take in Twitter information and analyze it in Japanese. Secondly, I would like to use the map of kibana, so I will input the earthquake information.

1. Analysis of Twitter information

I am not very familiar with Twitter collection, so I will proceed by referring to the contents of here.

# python -V
Python 3.5.2 :: Anaconda 4.1.1 (64-bit)

First, install the required packages.

# pip install twitter
# pip install elasticsearch

mapping

Like SQL, Elasticsearch also has to decide the data structure in advance and then input the data. Elasticsearch has an index-type-id structure. When inputting data, index and type must be specified, but id does not have to be specified. In this case, 22-digit UUIDs are automatically entered.

SQL	Mongo	Elastic
DB	DB	index
table	collection	type

You can map it using curl, but I find it more useful with kibana's Dev Tools.

PUT /twi_index
{
	"settings":{
		"index":{
			"analysis":{
			 	"tokenizer" : {
          				"kuromoji" : {
           					"type" : "kuromoji_tokenizer",
          					"mode" : "search"
          				}
         			},
				"analyzer" : {
          				"japanese" : {
            					"type": "custom",
            					"tokenizer" : "kuromoji",
            					"filter" : ["pos_filter"]
          				}
        			},
        			"filter" : {
          				"pos_filter" : {
            					"type" : "kuromoji_part_of_speech",
            					"stoptags" :["conjunction","Particle","Particle-格Particle","Particle-格Particle-General","Particle-格Particle-Quote","Particle-格Particle-Collocation","Particle-接続Particle","Particle-係Particle","Particle-副Particle","Particle-Intermittent throwParticle","Particle-並立Particle","Particle-終Particle","Particle-副Particle／並立Particle／終Particle","Particle-Attributive","Particle-Adverbization","Particle-Special","Auxiliary verb","symbol","symbol-General","symbol-Comma","symbol-Kuten","symbol-Blank","symbol-Open parentheses","symbol-Parentheses closed","Other-Intermittent throw","Filler","Non-speech"]
          				}
        			}
			}
		}
	},
	"mappings": {
		"twi_type":{
			"properties":{
				"created_at" : {
					"type" : "date"
				},
				"text" : {
					"type" : "text",
					"analyzer": "japanese",
					 "fielddata": true 
				},
				"track": {
					"type": "keyword"
				}
			}
		}
	}
}

Define ʻanalyzer in settings. Since Japanese analysis is required for text to save Twitter contents, specify ʻanalyzer. We don't want to analyze track, so type specifies keyword.

For the setting method of kuromoji, refer to here. Here, we use the kuromoji_part_of_speech filter to exclude specific part of speech (particles, case particles, symbols).

Data input

Almost whole [yoppe's script](http://qiita.com/yoppe/items/3e61fd567ae1d4c40a96#%E3%83%84%E3%82%A4%E3%83%BC%E3%83%88%E3 % 81% AE% E5% 8F% 8E% E9% 9B% 86) is used. I put the slightly modified script on Github.

result

Let's check the data in Kinaba after a while. Import the data into kibana.

Select Discover --track. Visualize - pie chart - twi_index

On this day (16/11/11), "Red Pig" was broadcast on TV, so it became a hot topic on Twitter. (´∀ ｀)

Other -You can check the index list using GET _aliases. ・ About Field data type -In 2.x, the text uses the string type, but from 5.x it is text and keyword.

Keyword fields are only searchable by their exact value. If you need to index structured content such as email addresses, hostnames, status codes, or tags, it is likely that you should rather use a keyword field.

-Text is used when analysis is required, and keywors is used when an exact match is required when searching.

Finally, I used kuromoji, so let's check the effect.

Without using kuromoji:

Use kuromoji:

It's subtle, but it's a little better. Well, there are many dedicated nouns in the information that is posted on Twitter, so it is difficult to analyze without defining a user dictionary.

2. Earthquake information

The information source used is JSON API provided by P2P Earthquake Information.

mapping

PUT /earthquakes
{
    "mappings": {
        "earthquake": {
            "properties": {
                "time": {
                    "type": "date",
                    "format":"yyyy/MM/dd HH:mm:ssZ"
                },
                "place": {
                    "type": "text" 
                },
                "location": {
                    "type": "geo_point"
                },
                "magnitude": {
                    "type": "float"
                },
                "depth": {
                    "type": "float"
                }
            }
        }
    }
}

Data input

I put the script on Github. Actually, it's the code I wrote half a year ago. You can do it, but there are some strange things. If I have time, I'll fix it.

result

Check it out in Kibana.

This is the end of the first part. I am thinking about what kind of data should be analyzed for the search and scoring of the second part.