[PYTHON] Get a distributed representation of words in Fast with fastText on Facebook

"France"-"Paris" + "Tokyo" = "Japan"

It was Word2Vec announced by Google that it became a hot topic that such words can be calculated. In short, this is a technology that expresses words numerically, which makes it possible to measure the "closeness" of words and perform the above operations. This numerical representation of words is called a distributed representation. FastText announced by Facebook introduced this time is an extension of this Word2Vec, and you can learn more accurate expressions at high speed. In this article, I will explain the mechanism and how to apply it to Japanese documents.

How fastText works

In fastText, it is a model that can summarize "inflection form" that was not considered in Word2Vec and its type model until then. Specifically, go, goes, and going are all "go", but they are all different literally, so they are treated as separate words in the conventional methods. Therefore, we are proposing a method of making words that are close to each other have a cohesive meaning by considering the words that have been decomposed into components (in the image, "go" and "es" for goes). Subword model).

image

Regarding the components of words, those with 3 or more letters and less than 6 letters are used in the dissertation, and those with less than 3 letters are treated as prefixes and suffixes. The more word components you use, the more variations you can make in combination, which improves your expressiveness, but it will take longer to calculate. This is a trade-off, but I limit it to 2 million in my dissertation. If you want to know more details, please refer to the paper.

In addition, fastText also implements a document classification function that uses distributed representation based on this method (the paper is below).

The above is the mechanism of fastText. From here, I will introduce the procedure for actually using fastText.

How to use fastText

Using fastText itself is very simple.

./fasttext skipgram -input data.txt -output model

As you can see on the official page, if you pass the document data data.txt, you can create a model, which is as easy as it gets. However, unlike English, words are not separated by spaces in Japanese, so it is necessary to cut out the word "separate". The procedure for that is explained below.

The repository to be used for this work is prepared below. If you clone this and proceed according to the procedure, it will be OK, so I hope you will make use of it.

icoxfog417/fastTextJapaneseTutorial

(It will be encouraging to receive Star m (_ _) m)

Advance preparation

This time, we will use Python for processing, so a Python environment is required. Also, since MeCab is used to divide Japanese, you need to install MeCab. In Windows, there are many difficulties around MeCab, so in the case of Windows 10, it is easier to work in the Ubuntu environment using bash on Windows.

1. Prepare documents to be used for learning

First, prepare a document to use for learning (in the natural language community, this is called a corpus). The most familiar one is Wikipedia. This time, I used the dump data of Japanese Wikipedia.

[Wikipedia: Database Download](https://en.wikipedia.org/wiki/Wikipedia:%E3%83%87%E3%83%BC%E3%82%BF%E3%83%99%E3%83% BC% E3% 82% B9% E3% 83% 80% E3% 82% A6% E3% 83% B3% E3% 83% AD% E3% 83% BC% E3% 83% 89)

Go to the dump storage location from "Wikipedia Japanese version dump" on the above page and get the latest date data. There are many types, but only the overview (jawiki-xxxxxxxx-abstract.xml) and the full article (jawiki-xxxxxxxx-pages-articles.xml.bz2) are useful.

Of course, other documents such as newspaper articles and blog articles are also acceptable. It is better to prepare a corpus suitable for the purpose of distributed expression. Wikipedia is just a dictionary, so even if you look at the item "Disneyland", for example, its history is written in detail, and keywords such as "fun" and the names of attractions never appear. Since the nature of distributed expressions is determined by the words that appear around them, is it better to include elements such as fun and famous when saying "Disneyland", or should it include elements such as Los Angeles and the 1950s? The corpus you should use for learning depends on the type.

2. Extract the text

When using Wikipedia, the dump data is XML, so it is necessary to extract pure text data from it. There are various tools that can do this, but this time I used the Wikipedia Extractor made by Python.

Wikipedia Extractor

It can be executed as follows. The -b option separates the files every 500M.

python wikiextractor/WikiExtractor.py -b 500M -o (Output folder) jawiki-xxxxxxxx-pages-articles-multistream.xml.bz2

In addition, Wikipedia Extractor assumes a file whose file format is bz2, and does not support abstract files. If you want to try it with an abstract file, please use parser.py in the repository because it can be processed.

Finally, the extracted text data is combined into one text file. This completes the text extraction.

3. Divide the text into words (separate)

Well, the main subject is from here. Unlike English, words are not separated by spaces in Japanese, so it is necessary to cut out each word by processing it as a word-separator. MeCab is used for this work. This time, we don't need morpheme information just by dividing it, so we will process it with the following command.

mecab (Target text file) -O wakati -o (Output file)

You now have a file with words separated by spaces.

In addition, MeCab has a dictionary of words for word-separation. The more vocabulary you have in this dictionary, the more accurate your word-separation will be, and you can use mecab-neologd to recognize more modern words. You can write it separately, so please use it as needed.

4. Learn with fastText

Now that you have a word-separated file, just like in English, all you have to do is run fastText. Clone the fastText repository and build with make as documented.

There are various setting parameters, but referring to the paper, the size of the numerical expression of the word (vector dimension) is as follows depending on the data set to be handled (* There was no mention of what unit token is. But it's probably a word count).

The point is that a small dataset has a small dimension. In the case of all Wikipedia cases, it corresponds to full 300 dimensions, so process as follows.

./fasttext skipgram -input (Divided file) -output model -dim 300

If you want to use the same parameters as Word2Vec learning, it will be as follows (parameter settings etc. are detailed in here. ).

./fasttext skipgram -input (Divided file) -output model -dim 200 -neg 25 -ws 8

(Although it is also posted on Issue, it seems that it changes considerably depending on the parameters](https://github.com/facebookresearch/fastText/issues/5).

When the learning is completed, two types of files, .bin and .vec, will be created for the file name specified by -output. These are the files that contain the learned distributed representations. In particular, .vec is a simple text file in which words and distributed expressions are paired, so I think that it can be read and used in languages other than Python.

However, in the case of all Wikipeida cases, the data size is too large and when trying to read the model file, it sometimes jumps with MemoryError, and there are cases where encoding problems occur (or rather, it occurred). But). In such a case, create a dictionary of words once (create a dictionary that converts words to IDs such as "morning"-> 11), and convert the text file to a column of word IDs.

5. Take advantage of fastText

Now, let's actually use fastText. There is a Python interface, but as mentioned above, the file structure is simple, so there is no need to use it. We have eval.py in the repository, so you can use it to search for similar words.

python eval.py EXILE

Result is···

EXILE, 0.9999999999999999
Exile, 0.8503456049215405
ATSUSHI, 0.8344220054003253

Well, the similarities are neat (the numbers indicate the cosine similarity; the closer it is to 1, the more similar it is, and the closer it is to 0, the less similar it is). Conversely, let's look at words that are not similar.

python eval.py EXILE --negative

Result is····

sovereignty, 0.011989817895453175
Great, 0.03867233333573319
Hospital, 0.10808885165592982
Pressure, 0.11396957694584102
Electronic bulletin board, 0.12102514551120924
Shiite, 0.13388425615685776
Filipino, 0.134102069272474
Connect, 0.13871080016061785
Cannes, 0.1560228702600865
Iseki, 0.16740051927385632
SaaS, 0.1938341440200136
Kaisei Junior High School / High School, 0.19798593808666984
Natural history illustration, 0.23079469060502433
butterfly, 0.23615273153248512
P5, 0.2795962625371914
AH, 0.2919494095090802

EXILE and Shiites are not quite similar. It is very sensible to think it that way(???).

With this kind of feeling, you can easily use it. Please give it a try.

Recommended Posts

Get a distributed representation of words in Fast with fastText on Facebook
Let's use the distributed expression of words quickly with fastText!
Get a list of files in a folder with python without a path
Get the number of readers of a treatise on Mendeley in Python
Get a list of packages installed in your current environment with python
Get the caller of a function in Python
Get a list of IAM users with Boto3
Get a glimpse of machine learning in Python
How to get a list of files in the same directory with python
Get the value of a specific key in a list from the dictionary type in the list with Python
A note on the default behavior of collate_fn in PyTorch
Create a list in Python with all followers on twitter
Get the id of a GPU with low memory usage
Get UNIXTIME at the beginning of today with a command
Get the number of specific elements in a python list
Get a list of purchased DMM eBooks with Python + Selenium
How to get a list of built-in exceptions in python
How to get a quadratic array of squares in a spiral!
Get the host name of the host PC with Docker on Linux
Yield in a class that inherits unittest.TestCase didn't work with nose (depending on the version of nose?)
How is the progress? Let's get on with the boom ?? in Python
Process the contents of the file in order with a shell script
Try to get a list of breaking news threads in Python.
I wrote a script to get you started with AtCoder fast!
When I get an error with Pylint in Atom on Windows
Create a word cloud with only positive / negative words on Twitter
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
Here's a brief summary of how to get started with Django
Get the URL of a JIRA ticket created with the jira-python library
How to get the vertex coordinates of a feature in ArcPy
Find the rank of a matrix in the XOR world (rank of a matrix on F2)
Why is distributed representation of words important for natural language processing?
Create a function to get the contents of the database in Go