[PYTHON] I tried Learning-to-Rank with Elasticsearch!

This article is the 21st day article of Just a Group AdventCalendar 2019.

Introduction

The day before the day in charge, I was told "I want to do Learning-to-rank with Elasticsearch, so please summarize the procedure for building the environment and how to use it. Yoropiko!", So this time I will explain the procedure for building Elasticsearch with learning-to-rank. I will show you how to use it.

The one created this time is here

What is Learning-to-rank?

Learning-to-rank in search engines is a technique that uses machine learning and the data you search to improve the ranking order of search results. It is also called sequential learning or ranking learning.

This time, I will use Elasticsearch's learning-to-rank plugin. I would like to experience ranking improvement using the demo in the learning-to-rank repository.

Environment

To try the demo, build the following environment in advance.

--Java execution environment --Python3 execution environment --Introduced learning-to-rank plugin to Elasticsearch

https://github.com/elastic/elasticsearch
https://github.com/o19s/elasticsearch-learning-to-rank

Building Elasticsearch with plugins is easy with docker images.

https://hub.docker.com/_/elasticsearch

Below is a sample of Elasticsearch with Learning-to-Rank docker image

FROM elasticsearch:7.4.1

RUN bin/elasticsearch-plugin install -b http://es-learn-to-rank.labs.o19s.com/ltr-1.1.2-es7.4.1.zip

After building the environment, clone the Learning-to-Rank repository and move it to the demo directory.

$ git clone https://github.com/o19s/elasticsearch-learning-to-rank.git
$ cd elasticsearch-learning-to-rank/demo

After moving to the demo directory, run the scripts in order to experience Learning-to-Rank!

Data and library preparation

Run prepare.py to download the library (Ranklib) that creates the search data and training model.

$ python prepare.py

When executed, the movie data (tmdb.json) and the ranking learning library (RankLibPlus-0.1.0.jar) will be downloaded. (Note that tmdb.json is large and takes a long time to download!) After the download is complete, prepare the Elasticsearch environment.

Prepare the environment for Elasticsearch

Launch Elasticsearch with the Learning-to-Rank plugin and populate it. After starting Elasticsearch, execute index_ml_tmdb.py to set the index and insert data. Run the following script to insert tmdb.json into Elasticsearch.

$ python index_ml_tmdb.py

After setting the index and inserting the data, the next step is to set the feature conversion query to be used for training. Execute the following script to set the field used for learning. (See demo / 1.json and demo / 2.json for the fields to be set. In demo, the search score of title and overview is set as feature.)

$ python load_features.py

When the field is ready to convert the data to a feature, it's time to model it.

Deploy your model to Elasticsearch

Run train.py to create a model and deploy it to Elasticsearch.

$ python train.py

In train.py, the following processing is performed.

Preprocessing --Parse sample_judgements.txt and convert it to ranklib format. --Model generation --Run Ranklib to create the model. --Deploying the model --Introduce the model to Elasticsearch.

Preprocessing

In demo, learn sample_judgements.txt to improve the ranking of search results. .. sample_judgements.txt represents the search results (# 7555 Rambo, # 1370 Rambo III, ...) for the search query (qid), and the evaluation value (grade) is set for each pair of the search query and the search result. I will. In demo, there are three types of queries:

# qid:1: rambo
# qid:2: rocky
# qid:3: bullwinkle

Set a grade for the search results for each query. (The higher the number, the better.)

# grade (0-4)	queryid	docId	title
4	qid:1 #	7555	Rambo
3	qid:1 #	1370	Rambo III
3	qid:1 #	1369	Rambo: First Blood Part II
3	qid:1 #	1368	First Blood
0	qid:1 #	136278	Blood
0	qid:1 #	102947	First Daughter
0	qid:1 #	13969	First Daughter
0	qid:1 #	61645	First Love
0	qid:1 #	14423	First Sunday
0	qid:1 #	54156	First Desires
4	qid:2 #	1366	Rocky
3	qid:2 #	1246	Rocky Balboa
3	qid:2 #	60375	Rocky VI
3	qid:2 #	1371	Rocky III
3	qid:2 #	1375	Rocky V
3	qid:2 #	1374	Rocky IV
0	qid:2 #	110123	Incredible Rocky Mountain Race
0	qid:2 #	17711	The Adventures of Rocky & Bullwinkle
0	qid:2 #	36685	The Rocky Horror Picture Show
4	qid:3 #	17711	The Adventures of Rocky & Bullwinkle
0	qid:3 #	1246	Rocky Balboa
0	qid:3 #	60375	Rocky VI
0	qid:3 #	1371	Rocky III
0	qid:3 #	1375	Rocky V
0	qid:3 #	1374	Rocky IV

Converting this data to Ranklib Format results in the following data.

4       qid:1   1:12.318474     2:10.573917 # 7555      rambo
3       qid:1   1:10.357875     2:11.950391 # 1370      rambo
3       qid:1   1:7.010513      2:11.220095 # 1369      rambo
3       qid:1   1:0.0   2:11.220095 # 1368      rambo
0       qid:1   1:0.0   2:0.0 # 136278  rambo
0       qid:1   1:0.0   2:0.0 # 102947  rambo
0       qid:1   1:0.0   2:0.0 # 13969   rambo
0       qid:1   1:0.0   2:0.0 # 61645   rambo
0       qid:1   1:0.0   2:0.0 # 14423   rambo
0       qid:1   1:0.0   2:0.0 # 54156   rambo
4       qid:2   1:10.686391     2:8.814846 # 1366       rocky
3       qid:2   1:8.985554      2:9.984511 # 1246       rocky
3       qid:2   1:8.985554      2:8.067703 # 60375      rocky
3       qid:2   1:8.985554      2:5.660549 # 1371       rocky
3       qid:2   1:8.985554      2:7.300772 # 1375       rocky
3       qid:2   1:8.985554      2:8.814846 # 1374       rocky
0       qid:2   1:6.815921      2:0.0 # 110123  rocky
0       qid:2   1:6.0816855     2:8.725066 # 17711      rocky
0       qid:2   1:6.0816855     2:5.9764795 # 36685     rocky
4       qid:3   1:7.6720834     2:12.722421 # 17711     bullwinkle
0       qid:3   1:0.0   2:0.0 # 1246    bullwinkle
0       qid:3   1:0.0   2:0.0 # 60375   bullwinkle
0       qid:3   1:0.0   2:0.0 # 1371    bullwinkle
0       qid:3   1:0.0   2:0.0 # 1375    bullwinkle
0       qid:3   1:0.0   2:0.0 # 1374    bullwinkle

We will use this data to create a model.

Model generation

Create a model. The demo will generate the following model:

MART
RankNet
RankBoost
AdaRank
coord Ascent
LambdaMART
ListNET
Random Forest
Linear Regression

(Details of each model will be omitted this time.)

Model upload

Upload your model to Elasticsearch. Upload the model generated by the Post request. Below is a sample.

POST _ltr/_featureset/movie_features/_createmodel
{
    "model": {
        "name": "test_9",
        "model": {
            "type": "model/ranklib",
            "definition": "## Linear Regression\n## Lambda = 1.0E-10\n0:0.2943936467995844 1:0.2943936467995844 2:0.12167703031808977"
        }
    }
}

When searching using the uploaded model, specify model.name.

Search for

Let's actually search and experience the improvement of search results by Learning-to-Rank! When you run search.py, you will get the following search results:

$ python search.py Rambo
{"query": {"multi_match": {"query": "Rambo", "fields": ["title", "overview"]}}, "rescore": {"query": {"rescore_query": {"sltr": {"params": {"keywords": "Rambo"}, "model": "test_6"}}}}}
Rambo
Rambo III
Rambo: First Blood Part II
First Blood
In the Line of Duty: The F.B.I. Murders
Son of Rambow
Spud 
$

Since it is difficult to understand, I prepared the result of comparing with / without Learning-to-Rank. The following results are obtained for each.

## search with learning-to-rank
1 Rambo
2 Rambo III
3 Rambo: First Blood Part II
4 First Blood
5 In the Line of Duty: The F.B.I. Murders
6 Son of Rambow
7 Spud

## search without learning-to-rank
1 Rambo
2 Rambo III
3 First Blood
4 Rambo: First Blood Part II
5 In the Line of Duty: The F.B.I. Murders
6 Son of Rambow
7 Spud

The learning results are reflected! !!

Summary

This time, I introduced the procedure to try Learning-to-rank with Elasticsearch and reflected Learning-to-Rank in the search engine. We have put together a Repository so that you can easily try the library used this time, so we hope you will use it. Also, in this article, I introduced the execution procedure, but if I have a chance, I would like to introduce each logic of Learning-to-Rank and the details of the mechanism of the library used this time.