This article is the 21st day article of Just a Group AdventCalendar 2019.
The day before the day in charge, I was told "I want to do Learning-to-rank with Elasticsearch, so please summarize the procedure for building the environment and how to use it. Yoropiko!", So this time I will explain the procedure for building Elasticsearch with learning-to-rank. I will show you how to use it.
The one created this time is here
Learning-to-rank in search engines is a technique that uses machine learning and the data you search to improve the ranking order of search results. It is also called sequential learning or ranking learning.
This time, I will use Elasticsearch's learning-to-rank plugin. I would like to experience ranking improvement using the demo in the learning-to-rank repository.
To try the demo, build the following environment in advance.
--Java execution environment --Python3 execution environment --Introduced learning-to-rank plugin to Elasticsearch
Building Elasticsearch with plugins is easy with docker images.
Below is a sample of Elasticsearch with Learning-to-Rank docker image
FROM elasticsearch:7.4.1
RUN bin/elasticsearch-plugin install -b http://es-learn-to-rank.labs.o19s.com/ltr-1.1.2-es7.4.1.zip
After building the environment, clone the Learning-to-Rank repository and move it to the demo directory.
$ git clone https://github.com/o19s/elasticsearch-learning-to-rank.git
$ cd elasticsearch-learning-to-rank/demo
After moving to the demo directory, run the scripts in order to experience Learning-to-Rank!
Run prepare.py to download the library (Ranklib) that creates the search data and training model.
$ python prepare.py
When executed, the movie data (tmdb.json) and the ranking learning library (RankLibPlus-0.1.0.jar) will be downloaded. (Note that tmdb.json is large and takes a long time to download!) After the download is complete, prepare the Elasticsearch environment.
Launch Elasticsearch with the Learning-to-Rank plugin and populate it. After starting Elasticsearch, execute index_ml_tmdb.py to set the index and insert data. Run the following script to insert tmdb.json into Elasticsearch.
$ python index_ml_tmdb.py
After setting the index and inserting the data, the next step is to set the feature conversion query to be used for training. Execute the following script to set the field used for learning. (See demo / 1.json and demo / 2.json for the fields to be set. In demo, the search score of title and overview is set as feature.)
$ python load_features.py
When the field is ready to convert the data to a feature, it's time to model it.
Run train.py to create a model and deploy it to Elasticsearch.
$ python train.py
In train.py, the following processing is performed.
In demo, learn sample_judgements.txt to improve the ranking of search results. .. sample_judgements.txt represents the search results (# 7555 Rambo, # 1370 Rambo III, ...) for the search query (qid), and the evaluation value (grade) is set for each pair of the search query and the search result. I will. In demo, there are three types of queries:
# qid:1: rambo
# qid:2: rocky
# qid:3: bullwinkle
Set a grade for the search results for each query. (The higher the number, the better.)
# grade (0-4) queryid docId title
4 qid:1 # 7555 Rambo
3 qid:1 # 1370 Rambo III
3 qid:1 # 1369 Rambo: First Blood Part II
3 qid:1 # 1368 First Blood
0 qid:1 # 136278 Blood
0 qid:1 # 102947 First Daughter
0 qid:1 # 13969 First Daughter
0 qid:1 # 61645 First Love
0 qid:1 # 14423 First Sunday
0 qid:1 # 54156 First Desires
4 qid:2 # 1366 Rocky
3 qid:2 # 1246 Rocky Balboa
3 qid:2 # 60375 Rocky VI
3 qid:2 # 1371 Rocky III
3 qid:2 # 1375 Rocky V
3 qid:2 # 1374 Rocky IV
0 qid:2 # 110123 Incredible Rocky Mountain Race
0 qid:2 # 17711 The Adventures of Rocky & Bullwinkle
0 qid:2 # 36685 The Rocky Horror Picture Show
4 qid:3 # 17711 The Adventures of Rocky & Bullwinkle
0 qid:3 # 1246 Rocky Balboa
0 qid:3 # 60375 Rocky VI
0 qid:3 # 1371 Rocky III
0 qid:3 # 1375 Rocky V
0 qid:3 # 1374 Rocky IV
Converting this data to Ranklib Format results in the following data.
4 qid:1 1:12.318474 2:10.573917 # 7555 rambo
3 qid:1 1:10.357875 2:11.950391 # 1370 rambo
3 qid:1 1:7.010513 2:11.220095 # 1369 rambo
3 qid:1 1:0.0 2:11.220095 # 1368 rambo
0 qid:1 1:0.0 2:0.0 # 136278 rambo
0 qid:1 1:0.0 2:0.0 # 102947 rambo
0 qid:1 1:0.0 2:0.0 # 13969 rambo
0 qid:1 1:0.0 2:0.0 # 61645 rambo
0 qid:1 1:0.0 2:0.0 # 14423 rambo
0 qid:1 1:0.0 2:0.0 # 54156 rambo
4 qid:2 1:10.686391 2:8.814846 # 1366 rocky
3 qid:2 1:8.985554 2:9.984511 # 1246 rocky
3 qid:2 1:8.985554 2:8.067703 # 60375 rocky
3 qid:2 1:8.985554 2:5.660549 # 1371 rocky
3 qid:2 1:8.985554 2:7.300772 # 1375 rocky
3 qid:2 1:8.985554 2:8.814846 # 1374 rocky
0 qid:2 1:6.815921 2:0.0 # 110123 rocky
0 qid:2 1:6.0816855 2:8.725066 # 17711 rocky
0 qid:2 1:6.0816855 2:5.9764795 # 36685 rocky
4 qid:3 1:7.6720834 2:12.722421 # 17711 bullwinkle
0 qid:3 1:0.0 2:0.0 # 1246 bullwinkle
0 qid:3 1:0.0 2:0.0 # 60375 bullwinkle
0 qid:3 1:0.0 2:0.0 # 1371 bullwinkle
0 qid:3 1:0.0 2:0.0 # 1375 bullwinkle
0 qid:3 1:0.0 2:0.0 # 1374 bullwinkle
We will use this data to create a model.
Create a model. The demo will generate the following model:
(Details of each model will be omitted this time.)
Upload your model to Elasticsearch. Upload the model generated by the Post request. Below is a sample.
POST _ltr/_featureset/movie_features/_createmodel
{
"model": {
"name": "test_9",
"model": {
"type": "model/ranklib",
"definition": "## Linear Regression\n## Lambda = 1.0E-10\n0:0.2943936467995844 1:0.2943936467995844 2:0.12167703031808977"
}
}
}
When searching using the uploaded model, specify model.name.
Let's actually search and experience the improvement of search results by Learning-to-Rank! When you run search.py, you will get the following search results:
$ python search.py Rambo
{"query": {"multi_match": {"query": "Rambo", "fields": ["title", "overview"]}}, "rescore": {"query": {"rescore_query": {"sltr": {"params": {"keywords": "Rambo"}, "model": "test_6"}}}}}
Rambo
Rambo III
Rambo: First Blood Part II
First Blood
In the Line of Duty: The F.B.I. Murders
Son of Rambow
Spud
$
Since it is difficult to understand, I prepared the result of comparing with / without Learning-to-Rank. The following results are obtained for each.
## search with learning-to-rank
1 Rambo
2 Rambo III
3 Rambo: First Blood Part II
4 First Blood
5 In the Line of Duty: The F.B.I. Murders
6 Son of Rambow
7 Spud
## search without learning-to-rank
1 Rambo
2 Rambo III
3 First Blood
4 Rambo: First Blood Part II
5 In the Line of Duty: The F.B.I. Murders
6 Son of Rambow
7 Spud
The learning results are reflected! !!
This time, I introduced the procedure to try Learning-to-rank with Elasticsearch and reflected Learning-to-Rank in the search engine. We have put together a Repository so that you can easily try the library used this time, so we hope you will use it. Also, in this article, I introduced the execution procedure, but if I have a chance, I would like to introduce each logic of Learning-to-Rank and the details of the mechanism of the library used this time.
Recommended Posts