[PYTHON] Create Word2Vec data from Japanese Wikipedia data (Mac compatible)

SS 192.png  テキストデータを手元でベクトル化してみたい。

things to do

I tried the word2vec demo last time, so I'd like to try it in Japanese this time. It seems that Japanese data for word2vec that can be used immediately is also released, but I would like to do preprocessing for learning at hand so that I can create vectors from various data in the future. Here, we will create vector data from the public data of the Japanese version of Wikipedia.

environment

--Linux system --Valided on MacBookAir 2015. --Small fan (notebook PC tends to get hot during file processing calculation)

manner

install word2vec

(Reference) Previous article

python


$ cd word2vec #Please move to any working directory
$ git clone https://github.com/svn2github/word2vec.git
$ cd word2vec
$ make
$ chmod +x *.sh

If word2vec is already installed, just move to the word2vec directory. In the previous article, I tried everything from installation to playing with a demo.

Download the Japanese wiki data.

python


$ curl -O https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2
 #Or$ wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2

The latest Japanese version of wikidata is available at https://dumps.wikimedia.org/jawiki/latest/ at https://dumps.wikimedia.org/, so download the latest version of the text file. I will. The file size will be about 3G.

Check the contents of Japanese wikidata
Unzip the file here and look at the contents to deepen your understanding.
You can skip the process of thawing and checking the contents. Preprocessing is possible without decompressing.

python


$ bunzip2 jawiki-latest-pages-articles.xml.bz2

The .bz2 compressed file can be decompressed with the bunzip2 command. When your PC gets hot, let's cool it with a fan. After defrosting, it will be about 13G.

python


$ head -n 50 jawiki-latest-pages-articles.xml 

Take a peek at the first 50 lines of the unzipped file. You can see that the file will be heavy just by the tags.

Clean up xml from data with wiki extractor

We are grateful that a dedicated tool called wiki extractor has been released. Reference) http://kzkohashi.hatenablog.com/entry/2018/07/22/212913

python


$ git clone https://github.com/attardi/wikiextractor.git

The above git URL is subject to change, so if you can't download it, search for it and replace it with a valid one.

Once downloaded, run wikiextractor against jawiki-latest-pages-articles.xml.bz2.

python


$ python ./wikiextractor/WikiExtractor.py jawiki-latest-pages-articles.xml.bz2 -q -b 10M -o wiki_texts

The wiki extractor will work with the .bz2 compressed file as it is. For options, -q Stop output report (speed up) -b Specify the size to separate the output files. This time every 10M. -o Storage location for cleanup files. This time I created wiki_texts. Is instructed. While working, I get some errors, but I can output to the end.

python


#Error example
WARNING: Template errors in article '1996 Australian Open' (744789):

This is a process that takes some time. If you look at the wiki_texts directory from your desktop etc., files will be generated every moment, so you can check the progress. With the above split settings, it fits in three directories: AA, AB, and AC. Files from wiki_00 to wiki_99 are stored in one directory. It took about an hour at hand, and the files were created up to ./wiki_texts/AC/wiki_94. Since the notebook gets hot, I work while cooling it with a fan.

python


$ head -n 5 ./wiki_texts/AA/wiki_00

<doc id="5" url="https://ja.wikipedia.org/wiki?curid=5" title="Ampersand">
Ampersand

Ampersand(&,English name:)Is a symbol that means the parallel particle "... and ...". It is a Latin ligature and is displayed as in the Trebuchet MS font."et"It is easy to see that it is a ligature of. ampersa, i.e."and per se and", The meaning is"and [the symbol which] by itself [is] and"Is.

If you check the file generated by head, some tags and line breaks remain. We will continue to clean up from here.

Concatenate files

python


$ find wiki_texts/ | grep wiki | awk '{system("cat "$0" >> wiki_all.txt")}'

Concatenate files by borrowing the command of Reference site. In this example, the files including the wiki are grouped in wiki_all.txt (capacity: about 3.09GB) in the directory under ./wiki_texts.

If that doesn't work, you may succeed by concatenating them one by one.
Sometimes it didn't work in my environment (the data contained noise), so I tried to execute it separately as follows.

python


$ find ./wiki_texts/AA/ | grep wiki_ | awk '{system("cat "$0" >> wiki_allAA.txt")}'
$ find ./wiki_texts/AB/ | grep wiki_ | awk '{system("cat "$0" >> wiki_allAB.txt")}'
$ find ./wiki_texts/AC/ | grep wiki_ | awk '{system("cat "$0" >> wiki_allAC.txt")}'
$ cat wiki_allAA.txt wiki_allAB.txt wiki_allAC.txt > wiki_all.txt
$ rm wiki_allAA.txt wiki_allAB.txt wiki_allAC.txt 
Just in case, do the following as well. Overwrite the character code and arrange the data. It will take some time.

python


$ brew install nkf
$ nkf -w --overwrite wiki_all.txt 

Do a finishing cleanup

Remove the remaining tags

python


$ sed -e 's/<[^>]*>//g' ./wiki_all.txt > ./wiki_notag.txt

Delete the tag attached to the beginning of each item. At this point the file will be 2.96GB.

Remove the contents of the parentheses

python


#Full-width parentheses to half-width
$ sed -i -e 's/(/(/g' ./wiki_notag.txt && sed -i -e 's/)/)/g' ./wiki_notag.txt

#Delete in parentheses
$ sed -i -e 's/([^)]*)//g' ./wiki_notag.txt

I feel that the contents of the parentheses are important elements, but this time I deleted them. 2.77GB.

Finally, I removed line breaks and margins.

python


#Remove blank lines after filling in all whitespace
$ sed -i -e 's/ //g' ./wiki_notag.txt && sed -i -e '/^$/d' ./wiki_notag.txt

It becomes a neat file and is 2.75GB.

To check the contents of the file

python


$ less wiki_notag.txt

will do. I feel that each item will be closer to the unrelated items before and after, but if processing around that is necessary, eventually.

Install mecab and divide

Now that I've made the sentence exactly, I'll divide it into words and separate them with spaces. Reference) https://qiita.com/paulxll/items/72a2bea9b1d1486ca751 Reference) http://kzkohashi.hatenablog.com/entry/2018/07/22/212913 Reference) https://akamist.com/blog/archives/2815

install mecab

Introduce the necessary files related to mecab to divide.

python


$ brew install mecab
$ brew install mecab-ipadic
$ brew install xz

Other than Mac, use apt instead of brew. Also, if you are instructed by the console, reinstall brew reinstall mecab or brew reinstall mecab -ipadic.

We will also introduce a dictionary that supports new words.

python


$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
$ cd mecab-ipadic-neologd
$ ./bin/install-mecab-ipadic-neologd -n -a

The third line will be executed with all new words by adding -a to the option. It will take some time. At the end, you will be asked Do you want to install mecab-ipadic-NEologd ?, so enter yes. When the installation is complete, Finish .. will be displayed.

Also install mecab for python.

python


$ cd ../ #Return to working directory
$ pip install mecab-python3

Execute word-separation

python


$ mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd -Owakati wiki_notag.txt -o wiki_wakati.txt -b 163840

If there are no mistakes in the current directory location or the path specification of mecab-ipadic-neologd, the word-separation will be executed. If the number after -b is small, it failed.

python


$ nkf -w --overwrite wiki_wakati.txt

It may be good to organize the data again just in case.

python


$ less wiki_wakati.txt

The file after dividing is 3.24GB. You can check the state of the text file after dividing it with the above command. Now you are ready to go.

Perform vectorization

Perform vectorization with word2vec.

python


$ ./word2vec -train wiki_wakati.txt -output wiki_wakati_w2v.bin -size 200 -window 5 -sample le-3 -negative 5 -hs 0 -threads 1 -binary 1 -iter 1

Reference) https://qiita.com/dskst/items/a9571bdd74a30a5e8d55

The meaning of the option is

-train File used for training -output File name to output the learning result -size Vector dimension -window Maximum number of words in context -sample Frequency of ignoring words -negative Number of words used for negative sampling -hs Whether to use layered softmax for learning -threads Number of threads used for learning -binary Whether to output in binary format -iter Training iteration count

It will be. There are various other options such as outputting a word list. It seems that you can adjust the size and accuracy in various ways, such as reducing the number of dimensions and increasing the number of learnings.

Wait for a while and the process will end. It took about 30 minutes at hand. wiki_wakati_w2v.bin is the vector data generated this time.

Vocab size: 1290027 Words in train file: 1004450612 It became the vector data of 1.29 million words. (By the way, the file size was 1.06GB for the notebook and 0.73GB for the desktop. I did almost the same procedure, but the desktop was more accurate, so there is a possibility that the notebook at hand has failed to create data. It is high.)

Try a new dictionary

python


$ ./distance wiki_wakati_w2v.bin

Look for highly relevant words.

python


Enter word or sentence (EXIT to break):Amuro

Word:Amuro Position in vocabulary: 30293

                                              Word       Cosine distance
------------------------------------------------------------------------
Char 0.837395
Aslan 0.772420
Amuro Ray 0.766061
Shinji 0.742949
Gohan 0.739751
Camille 0.725921
Kitaro 0.724508

I got something like that. As I mentioned earlier, it was better to try it on a desktop PC. On notebook PCs, there were many words in which two or more words were connected by a? Symbol, so it seems that some processing has failed.

What is Tokyo for Japan and France? I also tried the problem.

python


$ ./word-analogy wiki_wakati_w2v.bin

What is Darth Vader against Luke Skywalker against Nobita?

python


Enter three words (EXIT to break):Luke Skywalker Darth Vader Nobita

Word:Luke Skywalker Position in vocabulary: 95245
Word:Darth Vader Position in vocabulary: 68735
Word:Nobita Position in vocabulary: 11432

                                              Word              Distance
------------------------------------------------------------------------
Gian 0.652843
Suneo 0.645669
Shizuka 0.614481
Doraemon 0.609560
Keroro 0.608829
Kitaro 0.607345

It feels good that Gian and Suneo are in the top ranks. For some reason, there is Kitaro again.

in conclusion

By following similar steps, it seems that you will be able to utilize the corpus provided by various organizations as text files. I was also interested in creating small, high-performance vector data.

Recommended Posts

Create Word2Vec data from Japanese Wikipedia data (Mac compatible)
IntelliJ IDEA Shortcut List Japanese / English & Windows & Mac compatible