[Python] I introduced Word2Vec and played with it.


I will write an article about natural language processing for the first time in a while. This time, I took Word2vec from wikipedia, took a .xml file, divided it, extracted similar words, and applied it to perform sentiment analysis. Also, from this time, I bought a USB for machine learning and created a directory there. I will also keep a record of how to connect to an external disk and launch jupyter notebook!

The article I referred to this time was based on [Python] How to use Word2Vec.

I will update it at a later date so that I can do it with Google Colaboratory that I used before, but first I will touch on the local method!

What is word2vec?

I think it's a word that you often hear when you've touched natural language a little, but word2vec means changing a word into a vector expression. This vectorized expression is also called a distributed expression.

Basically, words are vectorized based on the idea that words with similar meanings and usages appear in similar contexts. For this reason, words that have similar meanings and usages are close to each other in the vector space, that is, they have a high cosine similarity.

Cosine similarity is one of the contents I want to write in the article later, so to explain it briefly, when considering vectors, focus on the size and weight of the vectors and focus only on whether they are more similar. The guessed measurement method is called cosine similarity. The measurement method that emphasizes size and weight is the familiar Euclidean distance in high school mathematics. In natural language processing, the length of the vector is not so relevant, and we are more interested in whether the words are similar, so we often adopt cosine similarity.

Since Word2Vec can think about meaning, it is interesting to be able to add and subtract in the meaning of words. For example

--"King"-"Men" + "Women" = "Princess" --"Japan"-"Tokyo" + "Seoul" = "Korea"

I think it's interesting to be able to say that if you remove the "male" element from the "king" and add the "female" element, you will become a "princess" or "queen".

I actually tried it

Create a directory on USB

This time, create a new folder in the location where you want to create a file or move the code with Word2Vec. You need to access this folder from your terminal.

$ cd /Volumes/~USB or external hardware name~

You can specify it with / Volumes/~~ like this. It's easy! !! As a versatility of this, when I want to learn using very heavy data, I think that it is very good because it is possible to learn while avoiding pressure on the PC body by doing it with external hardware.

Download necessary data (prepare a corpus)

First, you need to prepare the corpus needed to create the model. This time, we will use Wikipedia data that is available to everyone as shown in the reference article.

Dumped data of Japanese Wikipedia is available. https://dumps.wikimedia.org/jawiki/latest/ If you look at this site, you can collect only the titles of wikipedia, or you can change the data collected by type depending on the purpose. This time I want the article text of Wikipedia, so

$ curl https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 -o jawiki-latest-pages-articles.xml.bz2

This .bz2 file is about 3G and is very heavy data, so I used an external hardware. If you can drop it, you can unzip it by clicking on the file. It took a long time to defrost.

Create a virtual environment

As I mentioned last time, I created a virtual environment. We will do this in a virtual environment called a wiki. If the command goes well, the terminal will change like (wiki) $.

$ conda create -n wiki python=3.7
$ conda activate wiki

Install and run WikiExtractor.

Type the following command.

$ pip install wikiextractor
$ python -m wikiextractor.WikiExtractor jawiki-latest-pages-articles.xml

When you hit this command A folder called text is created, the" AA, AB, ... "folder is created in it, and the text files" wiki_00 to wiki_99 "are created under it. (It took a long time.)

Combine the text files into wiki.txt with the following command.

When the above is finished, type the command to summarize.

$ find text/ | grep wiki | awk '{system("cat "$0" >> wiki.txt")}'

wiki.txt is now 3.6G at hand. (It's a big file .. you can see how big the wiki is.)

Delete the doc tag and blank lines with the following command. (Data shaping)

$ sed -i '' '/^<[^>]*>$/d' wiki.txt
$ sed -i '' '/^$/d' wiki.txt 

As you can see by checking the first 100 lines of the data in the terminal, there are blank lines and doc tags, so I deleted them and formatted the data. (Although it is not listed in the reference material, I felt it was necessary, so I investigated various things.)

$ head -n100 wiki.txt

By typing the command like this, you can output the first 100 lines and check it!

Check the number of lines and characters in wiki.txt

I was interested, so I looked it up and found the command. (It was about 14 million lines and 1 billion characters at hand ...)

$ wc -ml wiki.txt

This completes the preparation before creating the model.


I used a library called gensim.

$ pip install --upgrade gensim

It seems that it is necessary to divide it in order to use gensim, so I used MeCab to divide it. Since I used MeCab in the previous article, I will omit the installation part. You can install it from the link above!

$ mecab -Owakati wiki.txt -o wiki_wakati.txt

It took a lot of time to write a word. And I named it wiki_wakati.txt, but that data is also very heavy. .. .. In addition, binary data was converted to utf-8 because it would interfere with future analysis. You can do it with the following code.

$ nkf -w --overwrite wiki_wakati.txt

This is the pard performed at the terminal. I'm going to use Jupyter Notebook from now on. I was able to run the Juppyter Notebook on the external hardware as well. I think it's really convenient! !!

from gensim.models import word2vec
import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.Text8Corpus('./wiki_wakati.txt')

model = word2vec.Word2Vec(sentences, size=200, min_count=20, window=15,iter=3)
model.wv.save_word2vec_format("./wiki.vec.pt", binary=True)from gensim.models 

The output result will be stored in a file called wuki.vsc.pt.

Case Study

I tried the program listed in the reference material.

from gensim.models import KeyedVectors

wv = KeyedVectors.load_word2vec_format('./wiki.vec.pt', binary=True)
results = wv.most_similar(positive=['lecture'])
for result in results:

This program is a code that outputs a word with a meaning similar to'lecture'. I was able to find out by incorporating the word I wanted to look up in this'lecture'. As a hobby, I like HIPHOP, so I put the HIPHOP artist "ZEEBRA" as a keyword. IMG_0510.JPG I thought that the execution result was interesting side by side such as'Mummy-D'and'KREVA' ww (I'm sorry if you can not convey it ...)

Also, since the day I wrote this article was 12/24 ... スクリーンショット 2020-12-24 16.37.55.png Not only Christmas but also Halloween and Valentine's Day were lined up w

Please, try it! !!

I tried adding and subtracting the meaning of words

Below are the results of doing two of the examples given earlier.

sim_do = wv.most_similar(positive = ["King", "Female"], negative=["male"], topn=5)
print(*[" ".join([v, str("{:.5f}".format(s))]) for v, s in sim_do], sep="\n")

I was able to output like this.

I tried using Word2Vec and found that it was possible to accurately recognize such common words. For example, "ZEEBRA" came out because it is a famous person, but if you want to use minor (?) Professional (?) Proprietary nouns such as rappers who are more active in the underground or anime characters, use Word2Vec. It seems that you can do it by adding a dictionary as a new registration! I will try again someday.

Thank you for reading this far. If you like, please follow us and LGTM! !! Also, since I am studying, please point out any mistakes or advice. It will be very helpful. If you have the time, read the other articles! !!


How to use Word2Vec Euclidean distance vs cosine similarity Sentiment analysis with Python (word2vec)

Recommended Posts

[Python] I introduced Word2Vec and played with it.
I played with PyQt5 and Python3
[Python] I installed the game from pip and played it
[Introduction to system trading] I drew a Stochastic Oscillator with python and played with it ♬
I installed and used Numba with Python3.5
I played with wordcloud!
[Python] I played with natural language processing ~ transformers ~
I tried Jacobian and partial differential with python
I tried function synthesis and curry with python
I set the environment variable with Docker and displayed it in Python
I vectorized the chord of the song with word2vec and visualized it with t-SNE
Use Python and word2vec (learned) with Azure Databricks
I want to handle optimization with python and cplex
Install selenium on Mac and try it with python
I made a LINE BOT with Python and Heroku
Get mail from Gmail and label it with Python3
Read json file with Python, format it, and output json
I implemented collaborative filtering (recommendation) with redis and python
I tried fp-growth with python
Programming with Python and Tkinter
I tried scraping with Python
Encryption and decryption with Python
Python and hardware-Using RS232C with Python-
I made blackjack with python!
Sentiment analysis with Python (word2vec)
I compared Java and Python!
python with pyenv and venv
I tried gRPC with Python
I tried scraping with python
I made Word2Vec with Pytorch
I made blackjack with Python.
I made wordcloud with Python.
Works with Python and R
Image processing with Python (I tried binarizing it into a mosaic art of 0 and 1)
Associate Python Enum with a function and make it Callable
I tried follow management with Twitter API and Python (easy)
[I made it with Python] XML data batch output tool
I tried to make GUI tic-tac-toe with Python and Tkinter
This time I learned python III and IV with Prorate
Install CaboCha in Ubuntu environment and call it with Python.
I wrote python3.4 in .envrc with direnv and allowed it, but I got a syntax error
I made a server with Python socket and ssl and tried to access it from a browser
Communicate with FX-5204PS with Python and PyUSB
Shining life with Python and OpenCV
Robot running with Arduino and python
Install Python 2.7.9 and Python 3.4.x with pip.
Neural network with OpenCV 3 and Python 3
AM modulation and demodulation with python
[Python] font family and font with matplotlib
Scraping with Node, Ruby and Python
I can't install python3 with pyenv-vertualenv
Scraping with Python, Selenium and Chromedriver
I tried web scraping with python.
Scraping with Python and Beautiful Soup
I made a fortune with Python.
I sent an SMS with Python
JSON encoding and decoding with python
Hadoop introduction and MapReduce with Python
[GUI with Python] PyQt5-Drag and drop-
Reading and writing NetCDF with Python
I liked the tweet with python. ..