"Word2Vec" is a method proposed by Google researcher Thomas Mikolov and others, and is a natural language processing method that has made it possible to dramatically improve the accuracy of some problems compared to conventional algorithms.
Word2Vec, as the name implies, is a quantification method that vectorizes and expresses words. For example, the number of vocabularies that Japanese people use on a daily basis is said to be tens of thousands to hundreds of thousands, but Word2Vec expresses each word as a vector in a space of about 200 dimensions.
As a result, it has become possible to capture the "meaning" of words by making it possible to perform similarities between words that were previously unknown or difficult to improve accuracy, and addition / subtraction between words.
So it doesn't seem to be quite interesting, so I'll put it into practice right away.
Check out the Word2Vec source code using subversion.
mkdir ~/word2vec_test
cd ~/word2vec_test
svn checkout http://word2vec.googlecode.com/svn/trunk/
cd trunk
make
Installation of # word2vec is complete
cd trunc
./demo-word.sh
Will start training with the test data.
When the training is over, the input screen will appear, so try entering words such as "cat" and "dog".
The output result when actually inputting "cat" is as follows.
cats 0.603425
feline 0.583455
kitten 0.569622
meow 0.565481
purebred 0.558347
dog 0.545779
Export facebook post data to csv. (fb_post_for_word2vec.csv) Insert the message at the time of posting in the first column and the link title in the second column to generate a file.
Analyze the generated file with MeCab to create a word text file (fb_word_for_word2vec.txt).
# -*- coding: utf-8 -*-
import csv
import MeCab
import re
tagger = MeCab.Tagger('-Owakati')
fo = file('fb_word_for_word2vec.txt','w')
for line in csv.reader(open("fb_post_for_word2vec.csv","rU")):
if len(line) == 2:
line = line[0] + line[1]
elif len(line) == 1:
line = line[0]
else:
continue
line = re.sub('http?://.*','', line)
fo.write(tagger.parse(line))
fo.close()
#terminal
time ./word2vec -train fb_post_for_word2vec.txt -output fb_post.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -binary 1
#terminal
#When you want to get words in descending order of similarity with words ./distance fb_post.bin
When you want to get the result by substituting a word vector into a certain equation with # ./word-analogy fb_post.bin
When you execute, the input screen is displayed.
When you enter a certain word, similar words will be displayed from the top.
ruby
⇒
rails 0.726545
js 0.719732
rbenv 0.715303
javascript 0.685051
gem 0.684497
python 0.677852
scala 0.672012
#Rails is the top, followed by object-oriented languages and gems.
docker
⇒
apache 0.672672
jenkins 0.668232
ruby 0.661645
redis 0.653154
Vagrant 0.645885
rbenv 0.643476
#Infrastructure and development environment systems are increasing properly.
Hanzawa ⇒ Managing Director Ohwada 0.794253 Reconciliation 0.655206 Sandwiched 0.634274 Naoki Hanzawa 0.632742 Nonomura 0.630198 Passion 0.604290 Parody 0.490672 Prefectural Assembly 0.472910
If you enter three words "A B C" and A ⇒ B, then C ⇒? Will return the output. Since it is a vector, it seems that it is calculating A-B + C.
Interesting w when actually trying
Ichiro-Baseball + Honda ⇒ Soccer 0.612238 First match 0.588327 Basketball 0.562973 Rugby 0.543752 College baseball 0.537109 Yokohama High School 0.536245 Practice match 0.535091
Japan-Tokyo + France ⇒ Sapporo 0.569258 Paris 0.566437 Milan 0.560036 London 0.552840 Osaka 0.541102 Venice 0.540721
Rice-egg + buckwheat ⇒ Chicken 0.686967 Leek 0.670782 Salt 0.663107 Miso 0.654149 Fish meal 0.648807 Shrimp 0.648329
--Calculate the similarity between phrases as well as words --Word clustering
Being able to express a word as a vector is close to being able to express the relationship of everything.
Almost everything that can be explained in words, such as people to people, images to images, regions to regions, things to things, people to things, can be plotted in a 200-dimensional space.
It is necessary to make efforts to improve the accuracy, but the smell that seems to be useful for valuable things by thinking about how to use it will pop up!