[PYTHON] Natural language processing with Word2Vec developed by a researcher in the US google (original data)

w2v.png w2v_logo.png

"Word2Vec" is a method proposed by Google researcher Thomas Mikolov and others, and is a natural language processing method that has made it possible to dramatically improve the accuracy of some problems compared to conventional algorithms.

Word2Vec, as the name implies, is a quantification method that vectorizes and expresses words. For example, the number of vocabularies that Japanese people use on a daily basis is said to be tens of thousands to hundreds of thousands, but Word2Vec expresses each word as a vector in a space of about 200 dimensions.

As a result, it has become possible to capture the "meaning" of words by making it possible to perform similarities between words that were previously unknown or difficult to improve accuracy, and addition / subtraction between words.

So it doesn't seem to be quite interesting, so I'll put it into practice right away.

1. Environment construction

Check out the Word2Vec source code using subversion.

mkdir ~/word2vec_test
cd ~/word2vec_test
svn checkout http://word2vec.googlecode.com/svn/trunk/
cd trunk
make

Installation of # word2vec is complete

2. Try it with test data for the time being

cd trunc
./demo-word.sh

Will start training with the test data.

2-1. After training, try using it

When the training is over, the input screen will appear, so try entering words such as "cat" and "dog".

The output result when actually inputting "cat" is as follows.

cats		0.603425
feline		0.583455
kitten		0.569622
meow		0.565481
purebred	0.558347
dog			0.545779

3. Try it with your own facebook post data of hundreds of thousands of people (this is the production!)

Export facebook post data to csv. (fb_post_for_word2vec.csv) Insert the message at the time of posting in the first column and the link title in the second column to generate a file.

Analyze the generated file with MeCab to create a word text file (fb_word_for_word2vec.txt).

3-1. Generate data using MeCab

# -*- coding: utf-8 -*-
import csv
import MeCab
import re
tagger = MeCab.Tagger('-Owakati')
fo = file('fb_word_for_word2vec.txt','w')
for line in csv.reader(open("fb_post_for_word2vec.csv","rU")):
  if len(line) == 2:
    line = line[0] + line[1]
  elif len(line) == 1:
    line = line[0]
  else:
    continue
  line = re.sub('http?://.*','', line)
  fo.write(tagger.parse(line))
fo.close()

3-2. Read the generated txt file and train it

#terminal
time ./word2vec -train fb_post_for_word2vec.txt -output fb_post.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -binary 1

3-3. Try using it after training

#terminal

#When you want to get words in descending order of similarity with words ./distance fb_post.bin

When you want to get the result by substituting a word vector into a certain equation with # ./word-analogy fb_post.bin

When you execute, the input screen is displayed.

3-3-1. Extract words and degrees similar to a certain word (distance)

When you enter a certain word, similar words will be displayed from the top.

ruby
⇒
rails		0.726545
js			0.719732
rbenv		0.715303
javascript	0.685051
gem			0.684497
python		0.677852
scala		0.672012

#Rails is the top, followed by object-oriented languages and gems.

docker
⇒
apache		0.672672
jenkins		0.668232
ruby		0.661645
redis		0.653154
Vagrant		0.645885
rbenv		0.643476

#Infrastructure and development environment systems are increasing properly.

Hanzawa ⇒ Managing Director Ohwada 0.794253 Reconciliation 0.655206 Sandwiched 0.634274 Naoki Hanzawa 0.632742 Nonomura 0.630198 Passion 0.604290 Parody 0.490672 Prefectural Assembly 0.472910

3-3-2. Add / subtract words and play with them (interesting here: word-analogy)

If you enter three words "A B C" and A ⇒ B, then C ⇒? Will return the output. Since it is a vector, it seems that it is calculating A-B + C.

Interesting w when actually trying

Ichiro-Baseball + Honda ⇒ Soccer 0.612238 First match 0.588327 Basketball 0.562973 Rugby 0.543752 College baseball 0.537109 Yokohama High School 0.536245 Practice match 0.535091

Japan-Tokyo + France ⇒ Sapporo 0.569258 Paris 0.566437 Milan 0.560036 London 0.552840 Osaka 0.541102 Venice 0.540721

Rice-egg + buckwheat ⇒ Chicken 0.686967 Leek 0.670782 Salt 0.663107 Miso 0.654149 Fish meal 0.648807 Shrimp 0.648329

4. Other things you can do

--Calculate the similarity between phrases as well as words --Word clustering

Summary

Being able to express a word as a vector is close to being able to express the relationship of everything.

Almost everything that can be explained in words, such as people to people, images to images, regions to regions, things to things, people to things, can be plotted in a 200-dimensional space.

It is necessary to make efforts to improve the accuracy, but the smell that seems to be useful for valuable things by thinking about how to use it will pop up!

Recommended Posts

Natural language processing with Word2Vec developed by a researcher in the US google (original data)
Dockerfile with the necessary libraries for natural language processing in python
Performance verification of data preprocessing in natural language processing
Realize a super IoT house by acquiring sensor data in the house with Raspberry Pi
3. Natural language processing with Python 3-4. A year of corona looking back on TF-IDF [Data creation]
[Word2vec] Let's visualize the result of natural language processing of company reviews
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"
SE, a beginner in data analysis, learns with the data science unit vol.1
A memo organized by renaming the file names in the folder with python
Is the space replaced by a plus sign or% 20 in percent-encoding processing?
Learn the basics of document classification by natural language processing, topic model
Receive a list of the results of parallel processing in Python with starmap
3. Natural language processing with Python 1-2. How to create a corpus: Aozora Bunko
3. Natural language processing with Python 1-1. Word N-gram
I tried natural language processing with transformers.
I tried to extract named entities with the natural language processing library GiNZA
3. Natural language processing with Python 3-1. Important word extraction tool TF-IDF analysis [original definition]
3. Natural language processing with Python 3-3. A year of corona looking back at TF-IDF
100 language processing knock-90 (using Gensim): learning with word2vec
[Python] Get the files in a folder with Python
3. Natural language processing with Python 2-2. Co-occurrence network [mecab-ipadic-NEologd]
Delete data in a pattern with Redis Cluster
Draw a graph by processing with Pandas groupby
[Python] I played with natural language processing ~ transformers ~
Python: Deep Learning in Natural Language Processing: Basics
Let's enjoy natural language processing with COTOHA API
Unbearable shortness of Attention in natural language processing
Quick batch text formatting + preprocessing for Aozora Bunko data for natural language processing with Python
[Natural language processing / NLP] How to easily perform back translation by machine translation in Python