[PYTHON] From Elasticsearch installation to data entry

Recently, I started using Elasticsearch for work. As a memo, I will write down what I used and learned from data entry to scoring.

It will be quite long to write everything, so I will write it separately in the front and back. The first part is from installation to data input In the second part, I will talk about search and scoring.

Installation

For the time being, install what you need. The development environment is CentOS7.

Elasticsearch requires at least Java 8. Specifically as of this writing, it is recommended that you use the Oracle JDK version 1.8.0_73.

Requires Java version 8 or higher.

Java installation

First, install Java 8. Proceed while referring to here. If you already have an older version of Java 1.7, the link above also tells you how to switch to the Java VM.

You can remove Java 7 and reinstall Java 8.

$ sudo yum remove -y java-1.7.0-openjdk
$ sudo yum install -y java-1.8.0-openjdk-devel
$ sudo yum install -y java-1.8.0-openjdk-debuginfo --enablerepo=*debug*

Version confirmation

$ java -version
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)

Elasticsearch installation

You can install older versions of 2.x with yum. Since Elasticsearch 5.0 is used here, install Elasticsearch 5.0. (It seems that 6 has been released recently.) The installation should follow the steps in Elasticsearch Docs, but the startup did not work. (; ∀ ;)

Install from rpm as Alternative. (Currently, 6 cannot be installed at rpm.)

# rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
# vim /etc/yum.repos.d/elasticsearch.repo
[elasticsearch-5.x]
name=Elasticsearch repository for 5.x packages
baseurl=https://artifacts.elastic.co/packages/5.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

# yum install elasticsearch
# systemctl enable elasticsearch
# systemctl start elasticsearch

Test if it could be started

# curl localhost:9200
{
  "name" : "3Y-W_H1",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "cYenb8Q8S22EHcxJPL7k2Q",
  "version" : {
    "number" : "5.0.0",
    "build_hash" : "253032b",
    "build_date" : "2016-10-26T04:37:51.531Z",
    "build_snapshot" : false,
    "lucene_version" : "6.2.0"
  },
  "tagline" : "You Know, for Search"
}

You were able to start it.

Install kibana

Refer to Elastic Doc.

# rpm --import https://artifacts.elastic.co/GPG-KEY-elasticsearch
# vim /etc/yum.repos.d/kibana.repo
[kibana-5.x]
name=Kibana repository for 5.x packages
baseurl=https://artifacts.elastic.co/packages/5.x/yum
gpgcheck=1
gpgkey=https://artifacts.elastic.co/GPG-KEY-elasticsearch
enabled=1
autorefresh=1
type=rpm-md

# yum install kibana
# systemctl enable kibana
# systemctl start kibana

Set up the connection.

# vim /etc/kibana/kibana.yml
network.host: "0.0.0.0"
elasticsearch.url: "http://localhost:9200"
# systemctl restart kibana

Let's open it in a browser.

http://192.168.216.128:5601

キャプチャ.PNG

I connected it safely. ((´∀ `))

Installation of Kuromoji

Let's also install the plugin for Japanese analysis.

# /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-kuromoji
# systemctl restart elasticsearch
# curl -X GET 'http://localhost:9200/_nodes/plugins?pretty'
…
"plugins" : [
        {
          "name" : "analysis-kuromoji",
          "version" : "5.0.0",
          "description" : "The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.",
          "classname" : "org.elasticsearch.plugin.analysis.kuromoji.AnalysisKuromojiPlugin"
        }
      ],
…

bonus

There is also a cloud version of Elastic Cloud (https://cloud.elastic.co/). (There is a 14-day free trial period.) Elastic uses an AWS server instead of his own server. GUI operation is completely convenient with a browser, but it is inconvenient if you do not put it on the back side (CUI) of the server.

Input data using Python

First, I want to use kuromoji, so I will take in Twitter information and analyze it in Japanese. Secondly, I would like to use the map of kibana, so I will input the earthquake information.

1. Analysis of Twitter information

I am not very familiar with Twitter collection, so I will proceed by referring to the contents of here.

# python -V
Python 3.5.2 :: Anaconda 4.1.1 (64-bit)

First, install the required packages.

# pip install twitter
# pip install elasticsearch

mapping

Like SQL, Elasticsearch also has to decide the data structure in advance and then input the data. Elasticsearch has an index-type-id structure. When inputting data, index and type must be specified, but id does not have to be specified. In this case, 22-digit UUIDs are automatically entered.

SQL Mongo Elastic
DB DB index
table collection type

You can map it using curl, but I find it more useful with kibana's Dev Tools.

PUT /twi_index
{
	"settings":{
		"index":{
			"analysis":{
			 	"tokenizer" : {
          				"kuromoji" : {
           					"type" : "kuromoji_tokenizer",
          					"mode" : "search"
          				}
         			},
				"analyzer" : {
          				"japanese" : {
            					"type": "custom",
            					"tokenizer" : "kuromoji",
            					"filter" : ["pos_filter"]
          				}
        			},
        			"filter" : {
          				"pos_filter" : {
            					"type" : "kuromoji_part_of_speech",
            					"stoptags" :["conjunction","Particle","Particle-格Particle","Particle-格Particle-General","Particle-格Particle-Quote","Particle-格Particle-Collocation","Particle-接続Particle","Particle-係Particle","Particle-副Particle","Particle-Intermittent throwParticle","Particle-並立Particle","Particle-終Particle","Particle-副Particle/並立Particle/終Particle","Particle-Attributive","Particle-Adverbization","Particle-Special","Auxiliary verb","symbol","symbol-General","symbol-Comma","symbol-Kuten","symbol-Blank","symbol-Open parentheses","symbol-Parentheses closed","Other-Intermittent throw","Filler","Non-speech"]
          				}
        			}
			}
		}
	},
	"mappings": {
		"twi_type":{
			"properties":{
				"created_at" : {
					"type" : "date"
				},
				"text" : {
					"type" : "text",
					"analyzer": "japanese",
					 "fielddata": true 
				},
				"track": {
					"type": "keyword"
				}
			}
		}
	}
}

Define ʻanalyzer in settings. Since Japanese analysis is required for text to save Twitter contents, specify ʻanalyzer. We don't want to analyze track, so type specifies keyword.

For the setting method of kuromoji, refer to here. Here, we use the kuromoji_part_of_speech filter to exclude specific part of speech (particles, case particles, symbols).

Data input

Almost whole [yoppe's script](http://qiita.com/yoppe/items/3e61fd567ae1d4c40a96#%E3%83%84%E3%82%A4%E3%83%BC%E3%83%88%E3 % 81% AE% E5% 8F% 8E% E9% 9B% 86) is used. I put the slightly modified script on Github.

result

Let's check the data in Kinaba after a while. Import the data into kibana. aaa.png

Select Discover --track. Visualize - pie chart - twi_index

bbb.png

On this day (16/11/11), "Red Pig" was broadcast on TV, so it became a hot topic on Twitter. (´∀ `)

Keyword fields are only searchable by their exact value. If you need to index structured content such as email addresses, hostnames, status codes, or tags, it is likely that you should rather use a keyword field.

-Text is used when analysis is required, and keywors is used when an exact match is required when searching.

Finally, I used kuromoji, so let's check the effect.

Without using kuromoji: eee.png

Use kuromoji: ccc.png

It's subtle, but it's a little better. Well, there are many dedicated nouns in the information that is posted on Twitter, so it is difficult to analyze without defining a user dictionary.

2. Earthquake information

The information source used is JSON API provided by P2P Earthquake Information.

mapping

PUT /earthquakes
{
    "mappings": {
        "earthquake": {
            "properties": {
                "time": {
                    "type": "date",
                    "format":"yyyy/MM/dd HH:mm:ssZ"
                },
                "place": {
                    "type": "text" 
                },
                "location": {
                    "type": "geo_point"
                },
                "magnitude": {
                    "type": "float"
                },
                "depth": {
                    "type": "float"
                }
            }
        }
    }
}

Data input

I put the script on Github. Actually, it's the code I wrote half a year ago. You can do it, but there are some strange things. If I have time, I'll fix it.

result

Check it out in Kibana.

ddd.png

This is the end of the first part. I am thinking about what kind of data should be analyzed for the search and scoring of the second part.

Recommended Posts

From Elasticsearch installation to data entry
OpenMPI installation from download to pass-through
SIGNATE Quest ① From data reading to preprocessing
[Note] [PyTorch] From installation to easy usage
Flask tutorial (from installation to hello world)
Introduction to Scapy ① (From installation to execution of Scapy)
Sum from 1 to 10
[Ansible installation procedure] From installation to execution of playbook
[Kaggle] From data reading to preprocessing and encoding
[Python] How to read data from CIFAR-10 and CIFAR-100
Data preprocessing (2) Data is changed from Categorical to Numerical.
Data retrieval from MacNote3 and migration to Write
[Python] Flow from web scraping to data analysis
From ROS for Windows installation to operation check
From easy git installation to docker startup python
[AWS] Migrate data from DynamoDB to Aurora MySQL
Extract data from S3
How to scrape image data from flickr with python
Changes from Python 3.0 to Python 3.5
Automatic data migration from yahoo root lab to Strava
Changes from Python 2 to Python 3.0
Transition from WSL1 to WSL2
Send log data from the server to Splunk Cloud
RaspberryPi3 (STRETCH) setup from OS installation to Hello World
Send data from Python to Processing via socket communication
Python development environment construction 2020 [From Python installation to poetry introduction]
DataNitro, implementation of function to read data from sheet
From editing to execution
Input Zaim data to Amazon Elasticsearch Service with Logstash
[CentOS 7.7] From desktop environment installation to remote desktop connection possible (from minimum installation)
SIGNATE Quest ② From creation of targeting model to creation of submitted data
[Linux] Copy data from Linux to Windows with a shell script
Explanation from installation of Sphinx to use of external theme (Bootswatch)
[Introduction to matplotlib] Read the end time from COVID-19 data ♬
RabbitMQ Tutorial (1) -From Installation to Hello World (Official Site Translation)
The story of copying data from S3 to Google's TeamDrive
Hit REST in Python to get data from New Relic
Meteorology x Python ~ From weather data acquisition to spectrum analysis ~
Send a request from AWS Lambda to Amazon Elasticsearch Service
Pass OpenCV data from the original C ++ library to Python
I tried to get data from AS / 400 quickly using pypyodbc
Get structural data from CHEMBLID
Post from Python to Slack
Cheating from PHP to Python
Export 3D data from QGIS
Porting from argparse to hydra
Migrating from Chainer v1 to Chainer v2
Anaconda updated from 4.2.0 to 4.3.0 (python3.5 updated to python3.6)
Migrated from Flask-RESTPlus to Flask-RESTX
Initial settings from Kubuntu installation
Update python-social-auth from 0.1.x to 0.2.x
Migrate from requirements.txt to pipenv
Switch from python2.7 to python3.6 (centos7)
Connect to sqlite from python
How to handle data frames
Ported from R language of "Sazae-san's rock-paper-scissors data analysis" to Python
How to avoid duplication of data when inputting from Python to SQLite.
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
[Spark Data Frame] Change a column from horizontal to vertical (Scala)
From ubuntu installation to running kinect with docker and ros (overview)
I tried to get data from AS / 400 quickly using pypyodbc Preparation 1