[PYTHON] Calculation of similarity between sentences using Word2Vec (simplified version)

There is a way to use Doc2Vec etc. to calculate the similarity between sentences, but it is a little troublesome because I have to create a model for it from scratch. It may be more versatile and easier to use the Word2Vec model as it is if you only want to achieve a certain degree of accuracy.

So, I calculated the similarity between sentences based on the feature vector average of the words contained in the sentence and the cosine similarity between sentences.

environment

# OS
macOS Sierra

# Python(Use Anaconda)
Python : Python 3.5.3 :: Anaconda custom (x86_64)
pip : 9.0.1 from /Users/username/anaconda/lib/python3.5/site-packages (python 3.5)

It didn't work well with python3.6, so the python3.5 version of Anaconda ([Anaconda 4.2.0 for python3](https://repo.continuum.io/archive/Anaconda3-4.2.0-MacOSX-x86_64] I'm using .pkg)).

Get & load trained model

It took too long for my MacBook Air to generate a dictionary from the corpus

The trained model of fastText has been released

We used a more published trained model. This time, we will use a model (model_neologd.vec) in which the contents of Wikipedia are divided by MeCab's NEologd and the text is trained by fastText. (Number of dimensions: 300)

Loading trained model

import gensim
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('model/model_neologd.vec', binary=False)

(Since the file is close to 1GB, it takes tens of seconds to read)

By using this model, you can perform semantic calculations of words using feature vectors.

Word2Vec Calculator (Example: Woman + King-Man)

import pprint
pprint.pprint(word2vec_model.most_similar(positive=['woman', 'King'], negative=['Man']))

# => [('Queen', 0.7062159180641174),
# ('Royal family', 0.6530475616455078),
# ('Royal', 0.6122198104858398),
# ('Crown prince', 0.6098779439926147),
# ('Royal family', 0.6084121465682983),
# ('princess', 0.6005773544311523),
# ('Queen', 0.5964134335517883),
# ('king', 0.593998908996582),
# ('Monarch', 0.5929002165794373),
# ('Royal palace', 0.5772185325622559)]

Similarity calculation between words by Word2Vec

#If you want to calculate the similarity between simple words, model.Can be calculated by similarity
pprint.pprint(word2vec_model.similarity('King', 'Queen'))
# => 0.74155587641044496
pprint.pprint(word2vec_model.similarity('King', 'ramen'))
# => 0.036460763469822188

Somehow the result is like that.

Load MeCab

Use MeCab to break down natural language into word-separated words. Specify mecab-ipadic-neologd, which is also used to generate trained models as a dictionary, and specify the output in the form of word division.

import MeCab
mecab = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd -Owakati")

mecab.parse("He got hungry yesterday")
# => 'He broke his stomach yesterday\n'

The word-separated text is separated by spaces. Since a line break is included at the end, it seems necessary to delete it at the time of implementation. (By the way, MeCab was installed using mecab-python3. It seems that it does not work properly with python3.6 series, so I had to use python3.5 series at 2017/5)

Calculate the average of the feature vectors of the words used in the sentence

In this method, the feature vector average of the words used in the sentence is used as the feature vector of the sentence itself, so a function for that is defined.

import numpy as np
def avg_feature_vector(sentence, model, num_features):
    words = mecab.parse(sentence).replace(' \n', '').split() #Line break at the end of mecab word-separation(\n)Is output, so remove it
    feature_vec = np.zeros((num_features,), dtype="float32") #Initialize feature vector container
    for word in words:
        feature_vec = np.add(feature_vec, model[word])
    if len(words) > 0:
        feature_vec = np.divide(feature_vec, len(words))
    return feature_vec

You're just averaging the feature vectors for each word. (Since the trained model has 300 dimensions, specify 300 for num_features)

avg_feature_vector("He got hungry yesterday", word2vec_model, 300)
# => array([  6.39975071e-03,  -6.38077855e-02,  -1.41418248e-01,
#       -2.01289997e-01,   1.76049918e-01,   1.99666247e-02,
#             :                 :                 :
#       -7.54096806e-02,  -5.46530560e-02,  -9.14395228e-02,
#       -2.21335635e-01,   3.34903784e-02,   1.81226760e-01], dtype=float32)

When executed, I think that 300-dimensional features will be output.

Calculate the similarity between two sentences

Next, the above function is used to calculate the cosine similarity of the average vector between the two sentences.

from scipy import spatial
def sentence_similarity(sentence_1, sentence_2):
    #The Word2Vec model used this time is generated with a 300-dimensional feature vector, so num_features also specified as 300
    num_features=300
    sentence_1_avg_vector = avg_feature_vector(sentence_1, word2vec_model, num_features)
    sentence_2_avg_vector = avg_feature_vector(sentence_2, word2vec_model, num_features)
    #Calculate cosine similarity by subtracting the distance between vectors from 1
    return 1 - spatial.distance.cosine(sentence_1_avg_vector, sentence_2_avg_vector)

By using this function, you can easily calculate the similarity between sentences. (The range is 0 to 1, and the closer it is to 1, the more similar it is.)

result = sentence_similarity(
    "He ate a spicy ramen yesterday and got hungry",
    "Yesterday, I ate a spicy Chinese food and got hungry"
)
print(result)
# =>  0.973996032475

result = sentence_similarity(
    "It's no good ... I have to do something quickly ...",
    "We will deliver carefully selected job information"
)
print(result)
# => 0.608137464334

I was able to calculate a numerical value like that!

Problem of method ①

** In the case of long sentences, the degree of similarity is high. ** ** Since the average of the words is taken and compared, it becomes difficult to make a difference in the average value between sentences in a long sentence, and the similarity becomes high even in unrelated sentences.

result = sentence_similarity(
    "It's finally in the story of this story. At last, other educators would come to this point where they shouldn't push forward, but I'm sure they'll misunderstand it, and I'm content with it to some extent.",
    "Even if I'm sick, it's like a good day. Thinking to Gauche as a mouse, your face squeezed the Doremifa's late breath and the next raccoon dog cello, and the difference between them is quite different."
)
print(result)
# => 0.878950984671

Even if it can be done, the comparison between sentences of 10 words is the limit.

Problem of method ②

** Cannot handle unknown words. ** ** Since it is not possible to output a feature vector for an unknown word that is not registered in the trained model, it seems necessary to take measures such as filling the word itself with the average feature vector of other words. (However, in that case, unknown words often have semantic characteristics, which reduces the accuracy of similarity.)

>>> result = sentence_similarity(
...     "Referral adoption has become popular in recent years",
...     "The era of new graduate recruitment is over"
... )
Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "<stdin>", line 5, in sentence_similarity
  File "<stdin>", line 6, in avg_feature_vector
  File "/Users/username/anaconda/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 574, in __getitem__
    return self.word_vec(words)
  File "/Users/username/anaconda/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 273, in word_vec
    raise KeyError("word '%s' not in vocabulary" % word)
KeyError: "word 'Referral' not in vocabulary"

In this case, I can't find the word referral and I'm missing it.

Summary

Since the method itself is simple, I think that the cases that can be used are quite limited. On the contrary, if it is only necessary to deal with short sentences, it seems that this method can also provide some accuracy. Isn't it a straightforward approach to use a method like Doc2Vec when finding the similarity between sentences in earnest, and prepare a corpus that suits the purpose of the model itself? ..

Code introduced in the article (all)

import gensim
import MeCab
import numpy as np
from scipy import spatial

word2vec_model = gensim.models.KeyedVectors.load_word2vec_format('model/model_neologd.vec', binary=False)
mecab = MeCab.Tagger("-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd -Owakati")

#Calculate the average of the feature vectors of the words used in the sentence
def avg_feature_vector(sentence, model, num_features):
    words = mecab.parse(sentence).replace(' \n', '').split() #Line break at the end of mecab word-separation(\n)Is output, so remove it
    feature_vec = np.zeros((num_features,), dtype="float32") #Initialize feature vector container
    for word in words:
        feature_vec = np.add(feature_vec, model[word])
    if len(words) > 0:
        feature_vec = np.divide(feature_vec, len(words))
    return feature_vec

#Calculate the similarity between two sentences
def sentence_similarity(sentence_1, sentence_2):
    #The Word2Vec model used this time is generated with a 300-dimensional feature vector, so num_features also specified as 300
    num_features=300
    sentence_1_avg_vector = avg_feature_vector(sentence_1, word2vec_model, num_features)
    sentence_2_avg_vector = avg_feature_vector(sentence_2, word2vec_model, num_features)
    #Calculate cosine similarity by subtracting the distance between vectors from 1
    return 1 - spatial.distance.cosine(sentence_1_avg_vector, sentence_2_avg_vector)

result = sentence_similarity(
    "He ate a spicy ramen yesterday and got hungry",
    "Yesterday, I ate a spicy Chinese food and got hungry"
)
print(result)
# => 0.973996032475

reference

-Trained model of fastText has been released -Which is better, Cos similarity or Doc2Vec?

Recommended Posts

Calculation of similarity between sentences using Word2Vec (simplified version)
Calculate the similarity between sentences with Doc2Vec, an evolution of Word2Vec
Similarity calculation between episodes of Precure using live timeline and topic model
Calculation of similarity by MinHash
Calculation of normal vector using convolution
[Python] Calculation of image similarity (Dice coefficient)
[Japanese version] Judgment of word similarity for polysemous words using ELMo and BERT
Calculation of support vector machine (SVM) (using cvxopt)