[PYTHON] Vectorization of horse racing pedigree using fastText

Overview

fastText is a tool published by Facebook for natural language processing. Natural language processing can be performed at high speed. GitHub fastText

Please refer to the explanation site for the mechanism. What can you do with fastText that learns 1 billion words published by Facebook in minutes

This time, I would like to use that fastText to vectorize the pedigree of horse racing instead of natural language processing. The idea of using fastText to do more than natural language processing was inspired by the article below. Use fastText to get distributed representations of non-words

Execution result

I'm uploading a vector file and a jupyter notebook file for how to on github. github keiba_ketto_vec

How to make

data set

Pedigree The past 3 generations are put into the fastText format.

It is a pedigree table for the racehorse Satono Diamond.

Pedigree Horse name
Child Satono diamond
father deep Impact
mother Malpensa
Father Sunday Silence
Parents Wind in her hair
Mother father Orpen
Mother mother Marsella
Father father Halo
Parents Wishing Well
Parents father Alzao
Parents Burghclere
Mother father father Lure
Mother parents Bonita Francita
Mother mother father Southern Halo
Mother mother mother Riviere

Convert the horse names in the above table into one line separated by half-width spaces. The same is true for other racehorses.

input.csv


Satono Diamond Deep Impact Malpensa Sunday Silence Wind in Her Hair Orpen Marsella Halo WishingWell Alzao Burghclere Lure Bonita Francita Southern Halo Riviere
Simon Trunale Gold Allure Humoresque Sunday Silence Nikiya Afleet Allie Win Halo WishingWell Nureyev ReluctantGuest Mr.Prospector PoliteLady Alydar FleetVictress
Water Lourdes Water League Water Henin Dehere Solo BostonHarbor Scrape DeputyMinister SisterDot Halo MineOnly Capote HarborSprings Mr.Prospector File
...

Vectorization

Use the fasttext skipgram command for vectorization. When you run it, you should have generated bin and vec files.

$fasttext skipgram -input input.csv -output ketto_model -minn 50

Regarding the minn option, as a mechanism of fastText, in addition to the words separated by spaces, it seems that each word is further decomposed at the character level and analyzed. Looked at the implementation of fastText

This feature, for example, puts the horses "Gold Allure" and "Gold Ship" together in "Gold". This time, the name itself has no meaning, so use the minn option to disable the feature and prevent it from breaking down to the character level.

Check the result

Use gensim to read the vectorized file and perform vector operations.

Check if you can calculate Linate, the younger sister of your father, using vector operations from Satono Diamond.

So

Satono diamond+Stay Gold-deep Impact=Linate

Should hold.

howto.py


import gensim

#Read vector data using gensim
model = gensim.models.KeyedVectors.load_word2vec_format('ketto_model.vec', binary=False)

# most_Operate using similar methods
#Pass the data to be added to positive in a list, and pass the data to be subtracted to negative in a list.

model.most_similar(
    positive=["Stay Gold", "Satono diamond"],
    negative=["deep Impact"]
)

Check the following for how to use gensim. gensim models.word2vec

result.


[('Paulen', 0.8220623731613159),
 ('Marquessa', 0.8190209865570068),
 ('Malpensa', 0.814713716506958),
 ('Linate', 0.80884850025177),
 ('Shapira', 0.8080180287361145),
 ('Moonlight knight', 0.8041872382164001),
 ('Semplice', 0.7995823621749878),
 ('OnAir', 0.7940067648887634),
 ('Fusion lock', 0.7933699488639832),
 ('Orpen', 0.7927322387695312)]

The result was that Paulen of the mother and father Orpen system was the most similar, but following the mother Malpensa and the half-sister Malpensa (father Orfevre, father Stay Gold), Linate also appeared firmly in the calculation result. Therefore, it can be said that the vectorization has been successful.

that's all.

Recommended Posts

Vectorization of horse racing pedigree using fastText
I tried to get a database of horse racing using Pandas
How to scrape horse racing data using pandas read_html
Memorandum of fastText (editing)
Example of using lambda
[Horse Racing] I tried to quantify the strength of racehorses
Emotional polarity judgment of sentences using the text classifier fastText