テキストデータを手元でベクトル化してみたい。
I tried the word2vec demo last time, so I'd like to try it in Japanese this time. It seems that Japanese data for word2vec that can be used immediately is also released, but I would like to do preprocessing for learning at hand so that I can create vectors from various data in the future. Here, we will create vector data from the public data of the Japanese version of Wikipedia.
--Linux system --Valided on MacBookAir 2015. --Small fan (notebook PC tends to get hot during file processing calculation)
(Reference) Previous article
python
$ cd word2vec #Please move to any working directory
$ git clone https://github.com/svn2github/word2vec.git
$ cd word2vec
$ make
$ chmod +x *.sh
If word2vec is already installed, just move to the word2vec directory. In the previous article, I tried everything from installation to playing with a demo.
python
$ curl -O https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2
#Or$ wget https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2
The latest Japanese version of wikidata is available at https://dumps.wikimedia.org/jawiki/latest/ at https://dumps.wikimedia.org/, so download the latest version of the text file. I will. The file size will be about 3G.
python
$ bunzip2 jawiki-latest-pages-articles.xml.bz2
The .bz2 compressed file can be decompressed with the bunzip2 command. When your PC gets hot, let's cool it with a fan. After defrosting, it will be about 13G.
python
$ head -n 50 jawiki-latest-pages-articles.xml
Take a peek at the first 50 lines of the unzipped file. You can see that the file will be heavy just by the tags.
We are grateful that a dedicated tool called wiki extractor has been released. Reference) http://kzkohashi.hatenablog.com/entry/2018/07/22/212913
python
$ git clone https://github.com/attardi/wikiextractor.git
The above git URL is subject to change, so if you can't download it, search for it and replace it with a valid one.
Once downloaded, run wikiextractor against jawiki-latest-pages-articles.xml.bz2.
python
$ python ./wikiextractor/WikiExtractor.py jawiki-latest-pages-articles.xml.bz2 -q -b 10M -o wiki_texts
The wiki extractor will work with the .bz2 compressed file as it is. For options, -q Stop output report (speed up) -b Specify the size to separate the output files. This time every 10M. -o Storage location for cleanup files. This time I created wiki_texts. Is instructed. While working, I get some errors, but I can output to the end.
python
#Error example
WARNING: Template errors in article '1996 Australian Open' (744789):
This is a process that takes some time. If you look at the wiki_texts directory from your desktop etc., files will be generated every moment, so you can check the progress. With the above split settings, it fits in three directories: AA, AB, and AC. Files from wiki_00 to wiki_99 are stored in one directory. It took about an hour at hand, and the files were created up to ./wiki_texts/AC/wiki_94. Since the notebook gets hot, I work while cooling it with a fan.
python
$ head -n 5 ./wiki_texts/AA/wiki_00
<doc id="5" url="https://ja.wikipedia.org/wiki?curid=5" title="Ampersand">
Ampersand
Ampersand(&,English name:)Is a symbol that means the parallel particle "... and ...". It is a Latin ligature and is displayed as in the Trebuchet MS font."et"It is easy to see that it is a ligature of. ampersa, i.e."and per se and", The meaning is"and [the symbol which] by itself [is] and"Is.
If you check the file generated by head, some tags and line breaks remain. We will continue to clean up from here.
python
$ find wiki_texts/ | grep wiki | awk '{system("cat "$0" >> wiki_all.txt")}'
Concatenate files by borrowing the command of Reference site. In this example, the files including the wiki are grouped in wiki_all.txt (capacity: about 3.09GB) in the directory under ./wiki_texts.
python
$ find ./wiki_texts/AA/ | grep wiki_ | awk '{system("cat "$0" >> wiki_allAA.txt")}'
$ find ./wiki_texts/AB/ | grep wiki_ | awk '{system("cat "$0" >> wiki_allAB.txt")}'
$ find ./wiki_texts/AC/ | grep wiki_ | awk '{system("cat "$0" >> wiki_allAC.txt")}'
$ cat wiki_allAA.txt wiki_allAB.txt wiki_allAC.txt > wiki_all.txt
$ rm wiki_allAA.txt wiki_allAB.txt wiki_allAC.txt
python
$ brew install nkf
$ nkf -w --overwrite wiki_all.txt
python
$ sed -e 's/<[^>]*>//g' ./wiki_all.txt > ./wiki_notag.txt
Delete the tag attached to the beginning of each item. At this point the file will be 2.96GB.
python
#Full-width parentheses to half-width
$ sed -i -e 's/(/(/g' ./wiki_notag.txt && sed -i -e 's/)/)/g' ./wiki_notag.txt
#Delete in parentheses
$ sed -i -e 's/([^)]*)//g' ./wiki_notag.txt
I feel that the contents of the parentheses are important elements, but this time I deleted them. 2.77GB.
Finally, I removed line breaks and margins.
python
#Remove blank lines after filling in all whitespace
$ sed -i -e 's/ //g' ./wiki_notag.txt && sed -i -e '/^$/d' ./wiki_notag.txt
It becomes a neat file and is 2.75GB.
To check the contents of the file
python
$ less wiki_notag.txt
will do. I feel that each item will be closer to the unrelated items before and after, but if processing around that is necessary, eventually.
Now that I've made the sentence exactly, I'll divide it into words and separate them with spaces. Reference) https://qiita.com/paulxll/items/72a2bea9b1d1486ca751 Reference) http://kzkohashi.hatenablog.com/entry/2018/07/22/212913 Reference) https://akamist.com/blog/archives/2815
Introduce the necessary files related to mecab to divide.
python
$ brew install mecab
$ brew install mecab-ipadic
$ brew install xz
Other than Mac, use apt instead of brew. Also, if you are instructed by the console, reinstall brew reinstall mecab or brew reinstall mecab -ipadic.
We will also introduce a dictionary that supports new words.
python
$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
$ cd mecab-ipadic-neologd
$ ./bin/install-mecab-ipadic-neologd -n -a
The third line will be executed with all new words by adding -a to the option. It will take some time. At the end, you will be asked Do you want to install mecab-ipadic-NEologd ?, so enter yes. When the installation is complete, Finish .. will be displayed.
Also install mecab for python.
python
$ cd ../ #Return to working directory
$ pip install mecab-python3
python
$ mecab -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd -Owakati wiki_notag.txt -o wiki_wakati.txt -b 163840
If there are no mistakes in the current directory location or the path specification of mecab-ipadic-neologd, the word-separation will be executed. If the number after -b is small, it failed.
python
$ nkf -w --overwrite wiki_wakati.txt
It may be good to organize the data again just in case.
python
$ less wiki_wakati.txt
The file after dividing is 3.24GB. You can check the state of the text file after dividing it with the above command. Now you are ready to go.
Perform vectorization with word2vec.
python
$ ./word2vec -train wiki_wakati.txt -output wiki_wakati_w2v.bin -size 200 -window 5 -sample le-3 -negative 5 -hs 0 -threads 1 -binary 1 -iter 1
Reference) https://qiita.com/dskst/items/a9571bdd74a30a5e8d55
The meaning of the option is
-train File used for training -output File name to output the learning result -size Vector dimension -window Maximum number of words in context -sample Frequency of ignoring words -negative Number of words used for negative sampling -hs Whether to use layered softmax for learning -threads Number of threads used for learning -binary Whether to output in binary format -iter Training iteration count
It will be. There are various other options such as outputting a word list. It seems that you can adjust the size and accuracy in various ways, such as reducing the number of dimensions and increasing the number of learnings.
Wait for a while and the process will end. It took about 30 minutes at hand. wiki_wakati_w2v.bin is the vector data generated this time.
Vocab size: 1290027 Words in train file: 1004450612 It became the vector data of 1.29 million words. (By the way, the file size was 1.06GB for the notebook and 0.73GB for the desktop. I did almost the same procedure, but the desktop was more accurate, so there is a possibility that the notebook at hand has failed to create data. It is high.)
python
$ ./distance wiki_wakati_w2v.bin
Look for highly relevant words.
python
Enter word or sentence (EXIT to break):Amuro
Word:Amuro Position in vocabulary: 30293
Word Cosine distance
------------------------------------------------------------------------
Char 0.837395
Aslan 0.772420
Amuro Ray 0.766061
Shinji 0.742949
Gohan 0.739751
Camille 0.725921
Kitaro 0.724508
I got something like that. As I mentioned earlier, it was better to try it on a desktop PC. On notebook PCs, there were many words in which two or more words were connected by a? Symbol, so it seems that some processing has failed.
What is Tokyo for Japan and France? I also tried the problem.
python
$ ./word-analogy wiki_wakati_w2v.bin
What is Darth Vader against Luke Skywalker against Nobita?
python
Enter three words (EXIT to break):Luke Skywalker Darth Vader Nobita
Word:Luke Skywalker Position in vocabulary: 95245
Word:Darth Vader Position in vocabulary: 68735
Word:Nobita Position in vocabulary: 11432
Word Distance
------------------------------------------------------------------------
Gian 0.652843
Suneo 0.645669
Shizuka 0.614481
Doraemon 0.609560
Keroro 0.608829
Kitaro 0.607345
It feels good that Gian and Suneo are in the top ranks. For some reason, there is Kitaro again.
By following similar steps, it seems that you will be able to utilize the corpus provided by various organizations as text files. I was also interested in creating small, high-performance vector data.