Introduction

This is a summary of what I learned about Word2Vec, which is now a standard method for natural language processing. The outline of the algorithm is organized and the model is created using the library.

reference

In understanding Word2Vec, I referred to the following.

-Deep Learning from scratch ❷ ― Natural language processing Yasuki Saito (Author) -How Word2vec works with pictures

Efficient Estimation of Word Representations in Vector Space (Original paper) -Gensim API Reference

Word2Vec overview

The following describes the concept of natural language processing, which is the premise of Word2Vec.

Distributed representation of words

Representing a word as a fixed-length vector is called "distributed word representation". If a word can be expressed as a vector, the meaning of the word can be grasped quantitatively, so it can be applied to various processes. ** Word2Vec is also a method aimed at acquiring distributed expressions of words **.

Distribution hypothesis

Various vectorization methods are being studied in the world of natural language processing, but the main method is based on the idea that ** "word meanings are formed by surrounding words" ** ** distribution hypothesis * It is called *. ** Word2Vec introduced in this article is also based on the distribution hypothesis. ** **

Count-based and inference-based

There are two main methods for acquiring distributed expressions of words: ** count-based method ** and ** inference-based method **. The count-based method is a method of expressing words by the frequency of surrounding words, and obtains a distributed representation of words from ** statistical data of the entire corpus **. On the other hand, the inference-based method is a method that uses a neural network to repeatedly update weights ** while viewing a small amount of training samples. ** Word2Vec falls under the latter **.

Word2vec algorithm

Below, we will explain the contents of the Word2Vec algorithm.

Neural network model used in Word2vec

Word2vec uses the following two models.

CBOW(continuous bag-of-words)
skip-gram

I will explain the mechanism of each model.

CBOW model

Overview

The CBOW model is a neural network whose purpose is to ** infer a target from the context **. You can get a distributed representation of words by training this CBOW model to make inferences as accurate as possible.

How much to use the context before and after is determined for each model creation, but if one word before and after is used as the context, for example, in the following case, the word "?" Is guessed from "every morning" and "o". I will.

I	Is	Every morning	？	To	Drink	Masu	。

The model structure of CBOW is shown below. There are two input layers, and it reaches the output layer via the intermediate layer.

The middle layer in the above figure is the "averaged" value after conversion by full connection of each input layer. If the first input layer is converted to $ h_1 $ and the second input layer is converted to $ h_2 $, the neurons in the middle layer will be $ \ frac {1} {2} (h_1 + h_2) $.

The conversion from the input layer to the middle layer is done by the fully connected layer (weighted by $ W_ {in} $). At this time, the weight $ W_ {in} $ of the fully connected layer is a matrix with the shape of $ 8 × 3 $, but this ** weight is the distributed representation of the words created using CBOW **.

Learning the CBOW model

The CBOW model outputs the score of each word in the output layer, and the "probability" can be obtained by applying the Softmax function to that score. This probability indicates which word appears in the center when the preceding and following words are given.

In the above example, the context is "every morning" and "o", and the word that the neural network wants to predict is "coffee". At this time, in a neural network with appropriate weights, it can be expected that the correct answer neurons are higher in the neurons that represent "probability". In CBOW learning, the cross entropy error of the correct label and the probability output by the neural network is obtained, and the learning proceeds in the direction of reducing the loss as a loss.

The loss function of the CBOW model is expressed as follows. (When the context used to create the model is one word before and after)


L = -\frac{1}{T}\sum_{t=1}^{T}logP(w_{t}|w_{t-1},w_{t+1})

By learning in the direction of making the above loss function as small as possible, the weight at that time can be acquired as a distributed expression of words.

skip-gram model

The skip-gram model is a model that reverses the context and target handled by CBOW. It is a model that predicts multiple contexts before and after from the central word as shown below.

I	Is	？	coffee	？	Drink	Masu	。

The image of the skip-Gram model is as follows.

There is only one input layer for skip-gram and there are as many output layers as there are contexts. The loss is calculated individually for each output layer, and the sum of them is the final loss.

The loss function of the skip-gram model is expressed by the following formula. (When the context used to create the model is one word before and after)


L = -\frac{1}{T}\sum_{t=1}^{T}(logP(w_{t-1}|w_{t}) + logP(w_{t+1}|w_{t}))

Since the skip-gram model makes inferences for the number of contexts, its loss function needs to find the sum of the losses found in each context.

CBOW and skip-gram

In CBOW and skip-gram, the skip-gram model is said to give better results, and the larger the corpus, the better the results in terms of the performance of frequently occurring words and analogical problems. That is. On the other hand, skip-gram has a high learning cost because it is necessary to calculate the loss for the number of contexts, and CBOW is faster in learning.

Creating a Word2vec model using the library

In the following, we will actually create a Word2Vec model using the library.

data set

It is possible to easily create a Word2Vec model using gensim, which is a python library. This time, we will use "livedoor news corpus" for the dataset. For details of the dataset and the method of morphological analysis, please refer to Posted in the previously posted article. I will.

In the case of Japanese, preprocessing that decomposes sentences into morphemes is required in advance, so after decomposing all sentences into morphemes, they are dropped into the following data frame.

スクリーンショット 2020-01-13 21.07.38.png

The rightmost column is the one in which all sentences are morphologically analyzed and separated by half-width spaces. Use this to create a Word2Vec model.

Model learning

Create a Word2vec model using gensim. Below are the main parameters for creating a model.

Parameter name	Meaning of parameters
sg	1 is skip-If it is 0 on gram, learn with CBOW
size	Specify how many dimensions of distributed representation to acquire
window	Specify the number of words before and after recognizing as a context
min_count	Ignore words that appear less than the specified number

Below is the code to create a Word2Vec model. As long as you can create the text to be input, you can create a model in one line.


sentences = []
for text in df[3]:
    text_list = text.split(' ')
    sentences.append(text_list)

from gensim.models import Word2Vec
model = Word2Vec(sentences,  sg=1, size=100, window=5, min_count=1)

What you can do with Word2Vec

I was able to get a distributed representation of words with the Word2Vec model. By using the distributed expression of words, it is possible to quantitatively express the semantic distance between words, and to add or subtract meanings between words.

Let's check the words close to "family" using the model created earlier.

for i in model.most_similar('family'):
    print(i)

('Parent and child', 0.7739133834838867)
('Lover', 0.7615703344345093)
('bonds', 0.7321233749389648)
('friend', 0.7270181179046631)
('Danran', 0.724891185760498)
('friend', 0.7237613201141357)
('Two people', 0.7198089361190796)
('couple', 0.6997368931770325)
('To each other', 0.6886075735092163)
('deepen', 0.6761922240257263)

Words such as "parent and child" and "lover" that seem to have a similar meaning to "family" have risen to the top. Next, let's do arithmetic calculations between words. The following is the calculation of "life"-"happiness".

for i in model.most_similar(positive='life',negative='happiness'):
    print(i)

('cash', 0.31968846917152405)
('Ora', 0.29543358087539673)
('Repair', 0.29313164949417114)
('Donation', 0.2858077883720398)
('user', 0.2797638177871704)
('frequency', 0.27897265553474426)
('Appropriate', 0.2780274450778961)
('tax', 0.27565300464630127)
('From', 0.273759663105011)
('budget', 0.2734326720237732)

This time, the corpus I learned isn't that big, so it's a little subtle, but the word "cash" comes to the top. Since the distributed expression of words depends on the corpus to be input, I think it is necessary to consider what kind of corpus to input according to the situation where you want to use Word2Vec.

Next I got a rough overview of Word2Vec. I would like to summarize Doc2vec, which is the development of Word2Vec, from the next time onwards. Thank you for watching until the end.

[PYTHON] Understand Word2Vec