This is a continuation of Word2Vec using the Microsoft Cognitive Toolkit (CNTK).
In Part2, Word2Vec by CNTK will be performed using the Japanese corpus prepared in Part1. It is assumed that CNTK and NVIDIA GPU CUDA are installed.
Natural Language: Word2Vec Part1 --Japanese Corpus has prepared a Japanese corpus.
In Part 2, we will create and train a Skip-gram model, which is famous as a neural language model.
Word2Vec Word2Vec [1] proposes two models: Continuous bag-of-words (CBOW) and Skip-gram.

The CBOW model uses peripheral words as input to predict the central word. The Skip-gram model, on the other hand, predicts the words that appear around a word. The number of words before and after is called the window size, and 2 to 5 are adopted.
The dimension of the embedded layer is 100, and the bias term of the output layer is not adopted.
This time I would like to train a Skip-gram model with a window size of 5 to get a distributed representation of words.
The default value for each parameter uses the CNTK default settings. In most cases, it has a uniform distribution of Glorot [2].
Since Word2Vec is considered a classification problem that predicts which word will appear for an input word, the first thing that comes to mind is to apply the Softmax function to the output layer and the Cross Entropy Error for the loss function. .. However, if the number of words is very large, the calculation of the Softmax function will take time. Therefore, various methods [3] that approximate the Softmax function have been devised to speed up the output layer. This time, I chose Sampled Softmax [4] from them.
Adam [5] was used as the optimization algorithm. Adam's learning rate is 0.01, hyperparameters $ β_1 $ are set to 0.9, and $ β_2 $ is set to the default value of CNTK.
Model training performed 100 Epoch with mini-batch training of mini-batch size 128.
-CPU Intel (R) Core (TM) i7-6700K 4.00GHz ・ GPU NVIDIA GeForce GTX 1060 6GB
・ Windows 10 Pro 1909 ・ CUDA 10.0 ・ CuDNN 7.6 ・ Python 3.6.6 ・ Cntk-gpu 2.7 ・ Pandas 0.25.0
The training program is available on GitHub.
word2vec_training.py
I tried various verifications using the distributed representation of the words acquired in the Skip-gram training.
[similarity]magic
Yu:0.80
Hiding:0.79
Produced:0.77
beneficial:0.77
New:0.76
The five most similar to "magic" are displayed. The word "magic" here is an expression within the work, so it has a different meaning than the general one.
[analogy]Hazuki-lotus+Jin= ?
directed by:0.27
Confluence:0.25
Role:0.25
building:0.24
You:0.23
This is the result of analogizing words from the relationships between the characters. If you pull the lotus from the main character Hazuki and add Jin who is hostile to them, you will become a director. This gave reasonable results.
The word embedding layer acquired by the Skip-gram model is high-dimensional data and is difficult to grasp intuitively. Therefore, t-distribution Stochastic Neighbor Embedding (t-SNE) [6] is famous as a method for converting high-dimensional data into 2D or 3D space and visualizing it.
This time, I changed Perplexity, which is one of the t-SNE parameters that indicate how much neighborhood is considered, to 5, 10, 20, 30, 50 and visualized it in a two-dimensional space.

This time I used Sampled Softmax as an approximation of the Softmax function. The number of words in the corpus prepared in Part 1 was 3,369, but I tried to see how much faster it would be if the number of words was larger.
The average execution speed excluding the beginning and the end when executing 10 epoch on a corpus with 500,000 words is shown below. Sampled Softmax has 5 samples.
| mean speed per epoch | |
|---|---|
| full softmax | 17.5s | 
| sampled softmax | 8.3s | 
Sampled Softmax seems to be about twice as fast.
CNTK 207: Sampled Softmax Deep learning library that builds on and extends Microsoft CNTK
Natural Language : Word2Vec Part1 - Japanese Corpus
Recommended Posts