"France"-"Paris" + "Tokyo" = "Japan"

It was Word2Vec announced by Google that it became a hot topic that such words can be calculated. In short, this is a technology that expresses words numerically, which makes it possible to measure the "closeness" of words and perform the above operations. This numerical representation of words is called a distributed representation. FastText announced by Facebook introduced this time is an extension of this Word2Vec, and you can learn more accurate expressions at high speed. In this article, I will explain the mechanism and how to apply it to Japanese documents.

How fastText works

In fastText, it is a model that can summarize "inflection form" that was not considered in Word2Vec and its type model until then. Specifically, go, goes, and going are all "go", but they are all different literally, so they are treated as separate words in the conventional methods. Therefore, we are proposing a method of making words that are close to each other have a cohesive meaning by considering the words that have been decomposed into components (in the image, "go" and "es" for goes). Subword model).

Since words and components are distinguished, for example, "as" as a word used in as you and "as" as a component of a paste word are distinguished.

Regarding the components of words, those with 3 or more letters and less than 6 letters are used in the dissertation, and those with less than 3 letters are treated as prefixes and suffixes. The more word components you use, the more variations you can make in combination, which improves your expressiveness, but it will take longer to calculate. This is a trade-off, but I limit it to 2 million in my dissertation. If you want to know more details, please refer to the paper.

Enriching Word Vectors with Subword Information

In addition, fastText also implements a document classification function that uses distributed representation based on this method (the paper is below).

Bag of Tricks for Efficient Text Classification

The above is the mechanism of fastText. From here, I will introduce the procedure for actually using fastText.

How to use fastText

Using fastText itself is very simple.

./fasttext skipgram -input data.txt -output model

As you can see on the official page, if you pass the document data data.txt, you can create a model, which is as easy as it gets. However, unlike English, words are not separated by spaces in Japanese, so it is necessary to cut out the word "separate". The procedure for that is explained below.

The repository to be used for this work is prepared below. If you clone this and proceed according to the procedure, it will be OK, so I hope you will make use of it.

icoxfog417/fastTextJapaneseTutorial

(It will be encouraging to receive Star m (_ _) m)

Advance preparation

This time, we will use Python for processing, so a Python environment is required. Also, since MeCab is used to divide Japanese, you need to install MeCab. In Windows, there are many difficulties around MeCab, so in the case of Windows 10, it is easier to work in the Ubuntu environment using bash on Windows.

1. Prepare documents to be used for learning

First, prepare a document to use for learning (in the natural language community, this is called a corpus). The most familiar one is Wikipedia. This time, I used the dump data of Japanese Wikipedia.

[Wikipedia: Database Download](https://en.wikipedia.org/wiki/Wikipedia:%E3%83%87%E3%83%BC%E3%82%BF%E3%83%99%E3%83% BC% E3% 82% B9% E3% 83% 80% E3% 82% A6% E3% 83% B3% E3% 83% AD% E3% 83% BC% E3% 83% 89)

Go to the dump storage location from "Wikipedia Japanese version dump" on the above page and get the latest date data. There are many types, but only the overview (jawiki-xxxxxxxx-abstract.xml) and the full article (jawiki-xxxxxxxx-pages-articles.xml.bz2) are useful.

Of course, other documents such as newspaper articles and blog articles are also acceptable. It is better to prepare a corpus suitable for the purpose of distributed expression. Wikipedia is just a dictionary, so even if you look at the item "Disneyland", for example, its history is written in detail, and keywords such as "fun" and the names of attractions never appear. Since the nature of distributed expressions is determined by the words that appear around them, is it better to include elements such as fun and famous when saying "Disneyland", or should it include elements such as Los Angeles and the 1950s? The corpus you should use for learning depends on the type.

2. Extract the text

When using Wikipedia, the dump data is XML, so it is necessary to extract pure text data from it. There are various tools that can do this, but this time I used the Wikipedia Extractor made by Python.

Wikipedia Extractor

It can be executed as follows. The -b option separates the files every 500M.

python wikiextractor/WikiExtractor.py -b 500M -o (Output folder) jawiki-xxxxxxxx-pages-articles-multistream.xml.bz2

In addition, Wikipedia Extractor assumes a file whose file format is bz2, and does not support abstract files. If you want to try it with an abstract file, please use parser.py in the repository because it can be processed.

Finally, the extracted text data is combined into one text file. This completes the text extraction.

3. Divide the text into words (separate)

Well, the main subject is from here. Unlike English, words are not separated by spaces in Japanese, so it is necessary to cut out each word by processing it as a word-separator. MeCab is used for this work. This time, we don't need morpheme information just by dividing it, so we will process it with the following command.

mecab (Target text file) -O wakati -o (Output file)

You now have a file with words separated by spaces.

In addition, MeCab has a dictionary of words for word-separation. The more vocabulary you have in this dictionary, the more accurate your word-separation will be, and you can use mecab-neologd to recognize more modern words. You can write it separately, so please use it as needed.

In English words, go and gone are the same one word, but in Japanese, "go" and "go" are verbs for "go", and "go" is "go (verb)" + "ta (particle)". , And the words will be separated. Considering the nature of fastText, it may be better to combine such cases where verbs + particles are separated, but this time it is divided normally because the processing becomes complicated.

4. Learn with fastText

Now that you have a word-separated file, just like in English, all you have to do is run fastText. Clone the fastText repository and build with make as documented.

Building on Windows seems to be quite difficult, so you need to use bash on Windows as prepared in advance, or if not, use mingw etc. to do your best.

There are various setting parameters, but referring to the paper, the size of the numerical expression of the word (vector dimension) is as follows depending on the data set to be handled (* There was no mention of what unit token is. But it's probably a word count).

small(50M tokens): 100
mediam(200M tokens) :200
full:300

The point is that a small dataset has a small dimension. In the case of all Wikipedia cases, it corresponds to full 300 dimensions, so process as follows.

./fasttext skipgram -input (Divided file) -output model -dim 300

If you want to use the same parameters as Word2Vec learning, it will be as follows (parameter settings etc. are detailed in here. ).

./fasttext skipgram -input (Divided file) -output model -dim 200 -neg 25 -ws 8

(Although it is also posted on Issue, it seems that it changes considerably depending on the parameters](https://github.com/facebookresearch/fastText/issues/5).

When the learning is completed, two types of files, .bin and .vec, will be created for the file name specified by -output. These are the files that contain the learned distributed representations. In particular, .vec is a simple text file in which words and distributed expressions are paired, so I think that it can be read and used in languages other than Python.

However, in the case of all Wikipeida cases, the data size is too large and when trying to read the model file, it sometimes jumps with MemoryError, and there are cases where encoding problems occur (or rather, it occurred). But). In such a case, create a dictionary of words once (create a dictionary that converts words to IDs such as "morning"-> 11), and convert the text file to a column of word IDs.

5. Take advantage of fastText

Now, let's actually use fastText. There is a Python interface, but as mentioned above, the file structure is simple, so there is no need to use it. We have eval.py in the repository, so you can use it to search for similar words.

python eval.py EXILE

Result is···

EXILE, 0.9999999999999999
Exile, 0.8503456049215405
ATSUSHI, 0.8344220054003253

Well, the similarities are neat (the numbers indicate the cosine similarity; the closer it is to 1, the more similar it is, and the closer it is to 0, the less similar it is). Conversely, let's look at words that are not similar.

python eval.py EXILE --negative

Result is····

sovereignty, 0.011989817895453175
Great, 0.03867233333573319
Hospital, 0.10808885165592982
Pressure, 0.11396957694584102
Electronic bulletin board, 0.12102514551120924
Shiite, 0.13388425615685776
Filipino, 0.134102069272474
Connect, 0.13871080016061785
Cannes, 0.1560228702600865
Iseki, 0.16740051927385632
SaaS, 0.1938341440200136
Kaisei Junior High School / High School, 0.19798593808666984
Natural history illustration, 0.23079469060502433
butterfly, 0.23615273153248512
P5, 0.2795962625371914
AH, 0.2919494095090802

EXILE and Shiites are not quite similar. It is very sensible to think it that way(???).

With this kind of feeling, you can easily use it. Please give it a try.

[PYTHON] Get a distributed representation of words in Fast with fastText on Facebook