Introduction

I will write an article about natural language processing for the first time in a while. This time, I took Word2vec from wikipedia, took a .xml file, divided it, extracted similar words, and applied it to perform sentiment analysis. Also, from this time, I bought a USB for machine learning and created a directory there. I will also keep a record of how to connect to an external disk and launch jupyter notebook!

The article I referred to this time was based on [Python] How to use Word2Vec.

I will update it at a later date so that I can do it with Google Colaboratory that I used before, but first I will touch on the local method!

What is word2vec?

I think it's a word that you often hear when you've touched natural language a little, but word2vec means changing a word into a vector expression. This vectorized expression is also called a distributed expression.

Basically, words are vectorized based on the idea that words with similar meanings and usages appear in similar contexts. For this reason, words that have similar meanings and usages are close to each other in the vector space, that is, they have a high cosine similarity.

Cosine similarity is one of the contents I want to write in the article later, so to explain it briefly, when considering vectors, focus on the size and weight of the vectors and focus only on whether they are more similar. The guessed measurement method is called cosine similarity. The measurement method that emphasizes size and weight is the familiar Euclidean distance in high school mathematics. In natural language processing, the length of the vector is not so relevant, and we are more interested in whether the words are similar, so we often adopt cosine similarity.

Since Word2Vec can think about meaning, it is interesting to be able to add and subtract in the meaning of words. For example

--"King"-"Men" + "Women" = "Princess" --"Japan"-"Tokyo" + "Seoul" = "Korea"

I think it's interesting to be able to say that if you remove the "male" element from the "king" and add the "female" element, you will become a "princess" or "queen".

I actually tried it

Create a directory on USB

This time, create a new folder in the location where you want to create a file or move the code with Word2Vec. You need to access this folder from your terminal.

$ cd /Volumes/~USB or external hardware name~

You can specify it with / Volumes/~~ like this. It's easy! !! As a versatility of this, when I want to learn using very heavy data, I think that it is very good because it is possible to learn while avoiding pressure on the PC body by doing it with external hardware.

Download necessary data (prepare a corpus)

First, you need to prepare the corpus needed to create the model. This time, we will use Wikipedia data that is available to everyone as shown in the reference article.

Dumped data of Japanese Wikipedia is available. https://dumps.wikimedia.org/jawiki/latest/ If you look at this site, you can collect only the titles of wikipedia, or you can change the data collected by type depending on the purpose. This time I want the article text of Wikipedia, so

$ curl https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 -o jawiki-latest-pages-articles.xml.bz2

This .bz2 file is about 3G and is very heavy data, so I used an external hardware. If you can drop it, you can unzip it by clicking on the file. It took a long time to defrost.

Create a virtual environment

As I mentioned last time, I created a virtual environment. We will do this in a virtual environment called a wiki. If the command goes well, the terminal will change like (wiki) $.

$ conda create -n wiki python=3.7
$ conda activate wiki

Install and run WikiExtractor.

Type the following command.

$ pip install wikiextractor
$ python -m wikiextractor.WikiExtractor jawiki-latest-pages-articles.xml

When you hit this command A folder called text is created, the" AA, AB, ... "folder is created in it, and the text files" wiki_00 to wiki_99 "are created under it. (It took a long time.)

Combine the text files into wiki.txt with the following command.

When the above is finished, type the command to summarize.

$ find text/ | grep wiki | awk '{system("cat "$0" >> wiki.txt")}'

wiki.txt is now 3.6G at hand. (It's a big file .. you can see how big the wiki is.)

Delete the doc tag and blank lines with the following command. (Data shaping)

$ sed -i '' '/^<[^>]*>$/d' wiki.txt
$ sed -i '' '/^$/d' wiki.txt

As you can see by checking the first 100 lines of the data in the terminal, there are blank lines and doc tags, so I deleted them and formatted the data. (Although it is not listed in the reference material, I felt it was necessary, so I investigated various things.)

$ head -n100 wiki.txt

By typing the command like this, you can output the first 100 lines and check it!

Check the number of lines and characters in wiki.txt

I was interested, so I looked it up and found the command. (It was about 14 million lines and 1 billion characters at hand ...)

$ wc -ml wiki.txt

This completes the preparation before creating the model.

Modeling

I used a library called gensim.

$ pip install --upgrade gensim

It seems that it is necessary to divide it in order to use gensim, so I used MeCab to divide it. Since I used MeCab in the previous article, I will omit the installation part. You can install it from the link above!

$ mecab -Owakati wiki.txt -o wiki_wakati.txt

It took a lot of time to write a word. And I named it wiki_wakati.txt, but that data is also very heavy. .. .. In addition, binary data was converted to utf-8 because it would interfere with future analysis. You can do it with the following code.

$ nkf -w --overwrite wiki_wakati.txt

This is the pard performed at the terminal. I'm going to use Jupyter Notebook from now on. I was able to run the Juppyter Notebook on the external hardware as well. I think it's really convenient! !!

from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.Text8Corpus('./wiki_wakati.txt')

model = word2vec.Word2Vec(sentences, size=200, min_count=20, window=15,iter=3)
model.wv.save_word2vec_format("./wiki.vec.pt", binary=True)from gensim.models

The output result will be stored in a file called wuki.vsc.pt.

Case Study

I tried the program listed in the reference material.

from gensim.models import KeyedVectors

wv = KeyedVectors.load_word2vec_format('./wiki.vec.pt', binary=True)
results = wv.most_similar(positive=['lecture'])
for result in results:
    print(result)

This program is a code that outputs a word with a meaning similar to'lecture'. I was able to find out by incorporating the word I wanted to look up in this'lecture'. As a hobby, I like HIPHOP, so I put the HIPHOP artist "ZEEBRA" as a keyword. I thought that the execution result was interesting side by side such as'Mummy-D'and'KREVA' ww (I'm sorry if you can not convey it ...)

Also, since the day I wrote this article was 12/24 ... スクリーンショット 2020-12-24 16.37.55.png Not only Christmas but also Halloween and Valentine's Day were lined up w

Please, try it! !!

I tried adding and subtracting the meaning of words

Below are the results of doing two of the examples given earlier.

sim_do = wv.most_similar(positive = ["King", "Female"], negative=["male"], topn=5)
print(*[" ".join([v, str("{:.5f}".format(s))]) for v, s in sim_do], sep="\n")

I was able to output like this.

I tried using Word2Vec and found that it was possible to accurately recognize such common words. For example, "ZEEBRA" came out because it is a famous person, but if you want to use minor (?) Professional (?) Proprietary nouns such as rappers who are more active in the underground or anime characters, use Word2Vec. It seems that you can do it by adding a dictionary as a new registration! I will try again someday.

Thank you for reading this far. If you like, please follow us and LGTM! !! Also, since I am studying, please point out any mistakes or advice. It will be very helpful. If you have the time, read the other articles! !!

reference

How to use Word2Vec Euclidean distance vs cosine similarity Sentiment analysis with Python (word2vec)

[Python] I introduced Word2Vec and played with it.