[PYTHON] Natural language processing with Word2Vec developed by a researcher in the US google (original data)

"Word2Vec" is a method proposed by Google researcher Thomas Mikolov and others, and is a natural language processing method that has made it possible to dramatically improve the accuracy of some problems compared to conventional algorithms.

Word2Vec, as the name implies, is a quantification method that vectorizes and expresses words. For example, the number of vocabularies that Japanese people use on a daily basis is said to be tens of thousands to hundreds of thousands, but Word2Vec expresses each word as a vector in a space of about 200 dimensions.

As a result, it has become possible to capture the "meaning" of words by making it possible to perform similarities between words that were previously unknown or difficult to improve accuracy, and addition / subtraction between words.

So it doesn't seem to be quite interesting, so I'll put it into practice right away.

1. Environment construction

Check out the Word2Vec source code using subversion.

mkdir ~/word2vec_test
cd ~/word2vec_test
svn checkout http://word2vec.googlecode.com/svn/trunk/
cd trunk
make

Installation of # word2vec is complete

2. Try it with test data for the time being

cd trunc
./demo-word.sh

Will start training with the test data.

2-1. After training, try using it

When the training is over, the input screen will appear, so try entering words such as "cat" and "dog".

The output result when actually inputting "cat" is as follows.

cats		0.603425
feline		0.583455
kitten		0.569622
meow		0.565481
purebred	0.558347
dog			0.545779

3. Try it with your own facebook post data of hundreds of thousands of people (this is the production!)

Export facebook post data to csv. (fb_post_for_word2vec.csv) Insert the message at the time of posting in the first column and the link title in the second column to generate a file.

Analyze the generated file with MeCab to create a word text file (fb_word_for_word2vec.txt).

For MeCab dictionary data, register the noun words of Hateb and wikipedia in the default dictionary in advance. Reference: http://qiita.com/ysk_1031/items/2ebdfefbca7c01d19ac0

3-1. Generate data using MeCab

# -*- coding: utf-8 -*-
import csv
import MeCab
import re
tagger = MeCab.Tagger('-Owakati')
fo = file('fb_word_for_word2vec.txt','w')
for line in csv.reader(open("fb_post_for_word2vec.csv","rU")):
  if len(line) == 2:
    line = line[0] + line[1]
  elif len(line) == 1:
    line = line[0]
  else:
    continue
  line = re.sub('http?://.*','', line)
  fo.write(tagger.parse(line))
fo.close()

3-2. Read the generated txt file and train it

#terminal
time ./word2vec -train fb_post_for_word2vec.txt -output fb_post.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -binary 1

3-3. Try using it after training

#terminal

#When you want to get words in descending order of similarity with words ./distance fb_post.bin

When you want to get the result by substituting a word vector into a certain equation with # ./word-analogy fb_post.bin

When you execute, the input screen is displayed.

3-3-1. Extract words and degrees similar to a certain word (distance)

When you enter a certain word, similar words will be displayed from the top.

ruby
⇒
rails		0.726545
js			0.719732
rbenv		0.715303
javascript	0.685051
gem			0.684497
python		0.677852
scala		0.672012

#Rails is the top, followed by object-oriented languages and gems.

docker
⇒
apache		0.672672
jenkins		0.668232
ruby		0.661645
redis		0.653154
Vagrant		0.645885
rbenv		0.643476

#Infrastructure and development environment systems are increasing properly.

Hanzawa ⇒ Managing Director Ohwada 0.794253 Reconciliation 0.655206 Sandwiched 0.634274 Naoki Hanzawa 0.632742 Nonomura 0.630198 Passion 0.604290 Parody 0.490672 Prefectural Assembly 0.472910

3-3-2. Add / subtract words and play with them (interesting here: word-analogy)

If you enter three words "A B C" and A ⇒ B, then C ⇒? Will return the output. Since it is a vector, it seems that it is calculating A-B + C.

Interesting w when actually trying

Ichiro-Baseball + Honda ⇒ Soccer 0.612238 First match 0.588327 Basketball 0.562973 Rugby 0.543752 College baseball 0.537109 Yokohama High School 0.536245 Practice match 0.535091

Japan-Tokyo + France ⇒ Sapporo 0.569258 Paris 0.566437 Milan 0.560036 London 0.552840 Osaka 0.541102 Venice 0.540721

Rice-egg + buckwheat ⇒ Chicken 0.686967 Leek 0.670782 Salt 0.663107 Miso 0.654149 Fish meal 0.648807 Shrimp 0.648329

4. Other things you can do

--Calculate the similarity between phrases as well as words --Word clustering

Summary

Being able to express a word as a vector is close to being able to express the relationship of everything.

Almost everything that can be explained in words, such as people to people, images to images, regions to regions, things to things, people to things, can be plotted in a 200-dimensional space.

It is necessary to make efforts to improve the accuracy, but the smell that seems to be useful for valuable things by thinking about how to use it will pop up!