[PYTHON] Use the vector learned by word2vec in the Embedding layer of LSTM

Introduction

I wanted to use the vector of word2vec in the embedding layer such as seq2seq, so I summarized the useful functions of word2vec. In the example, we will explain using the learned vector of Tohoku University (http://www.cl.ecei.tohoku.ac.jp/~m-suzuki/jawiki_vector/).

Model loading

First, load the model. Enter the downloaded (entity_vector.model.bin) path in model_dir.

from gensim.models import KeyedVectors
model_dir = 'entity_vector.model.bin path'
model = KeyedVectors.load_word2vec_format(model_dir, binary=True)

Vector output

The word vector can be obtained as follows.

print(model['I'])

output


[-0.9154563   0.97780323 -0.43780354 -0.6441212  -1.6350892   0.8619687
  0.41775486 -1.094953    0.74489385 -1.6742945  -0.34408432  0.5552686
 -3.9321985   0.3369842   1.5550056   1.3112832  -0.64914817  0.5996928
  1.6174266   0.8126016  -0.75741744  1.7818885   2.1305397   1.8607832
  3.0353768  -0.8547809  -0.87280065 -0.54103154  0.752979    3.8159864
 -1.4066186   0.78604376  1.2102902   3.9960158   2.9654515  -2.6870391
 -1.3829899   0.993459    0.86303824  0.29373714  4.0691266  -1.4974884
 -1.5601915   1.4936608   0.550254    2.678553    0.53790915 -1.7294457
 -0.46390963 -0.34776476 -1.2361934  -2.433244   -0.21348757  0.0488718
  0.8197853  -0.59188586  1.7276062   0.9713122  -0.06519659  2.4763248
 -0.93890053  0.36217824  1.887851   -0.0175399  -0.21866432 -0.81253886
 -3.9667509   2.5340643   0.02985824  0.338091   -1.3745826  -2.3509336
 -1.5615206   0.8273324  -1.263886   -1.2259561   0.9079302   2.0258994
 -0.8576754  -2.5492477  -2.45557    -0.5216256  -1.3474834   2.3590422
  1.0459667   2.0919168   1.6904455   1.7064931   0.7376105   0.2567448
 -0.8194208   0.8788849  -0.89287275 -0.22960001  1.8320689  -1.7200342
  0.8977642   1.5119879  -0.3325551   0.7429934  -1.2087826   0.5350336
 -0.03887295 -1.9642036   1.0406445  -0.80972534  0.49987233  2.419521
 -0.30317742  0.96494234  0.6184119   1.2633535   2.688754   -0.7226699
 -2.8695397  -0.8986926   0.1258761  -0.75310475  1.099076    0.90656924
  0.24586082  0.44014114  0.85891217  0.34273988  0.07071286 -0.71412176
  1.4705397   3.6965442  -2.5951867  -2.4142458   1.2733719  -0.22638321
  0.15742263 -0.717717    2.2888887   3.3045793  -0.8173686   1.368556
  0.34260234  1.1644434   2.2652006  -0.47847173  1.5130697   3.481819
 -1.5247481   2.166555    0.7633031   0.61121356 -0.11627229  1.0461875
  1.4994645  -2.8477156  -2.9415505  -0.86640745 -1.1220155   0.10772963
 -1.6050811  -2.519997   -0.13945188 -0.06943721  0.83996797  0.29909992
  0.7927955  -1.1932545  -0.375592    0.4437512  -1.4635806  -0.16438413
  0.93455386 -0.4142645  -0.92249537 -1.0754105   0.07403489  1.0781559
  1.7206618  -0.69100255 -2.6112185   1.4985414  -1.8344582  -0.75036854
  1.6177907  -0.47727013  0.88055164 -1.057859   -2.0196638  -3.5305111
  1.1221203   3.3149185   0.859528    2.3817215  -1.1856595  -0.03347144
 -0.84533554  2.201596   -2.1573794  -0.6228852   0.12370715  3.030279
 -1.9215534   0.09835044]

In this way, a 200-dimensional vector is output.

Get all word2vec vectors as numpy array

You can get the vector of all words with the following code.

w2v_vector = model.wv.syn0

Check the number of dimensions.

print(w2v_vector.shape)

output


(1015474, 200)

From this output, we can see that 1015474 words are stored as a 200-dimensional vector.

Get index of learned words

I was able to get the vectors for all the words, but I still don't know which vector the words correspond to. So, let's get the word ID on word2vec.

print(model.vocab['I'].index)

output


1027

It turns out that the word "I" corresponds to the 1027th. Now, let's output the 1027th vector.

print(w2v_vector[model.vocab['I'].index])

output


[-0.9154563   0.97780323 -0.43780354 -0.6441212  -1.6350892   0.8619687
  0.41775486 -1.094953    0.74489385 -1.6742945  -0.34408432  0.5552686
 -3.9321985   0.3369842   1.5550056   1.3112832  -0.64914817  0.5996928
  1.6174266   0.8126016  -0.75741744  1.7818885   2.1305397   1.8607832
  3.0353768  -0.8547809  -0.87280065 -0.54103154  0.752979    3.8159864
 -1.4066186   0.78604376  1.2102902   3.9960158   2.9654515  -2.6870391
 -1.3829899   0.993459    0.86303824  0.29373714  4.0691266  -1.4974884
 -1.5601915   1.4936608   0.550254    2.678553    0.53790915 -1.7294457
 -0.46390963 -0.34776476 -1.2361934  -2.433244   -0.21348757  0.0488718
  0.8197853  -0.59188586  1.7276062   0.9713122  -0.06519659  2.4763248
 -0.93890053  0.36217824  1.887851   -0.0175399  -0.21866432 -0.81253886
 -3.9667509   2.5340643   0.02985824  0.338091   -1.3745826  -2.3509336
 -1.5615206   0.8273324  -1.263886   -1.2259561   0.9079302   2.0258994
 -0.8576754  -2.5492477  -2.45557    -0.5216256  -1.3474834   2.3590422
  1.0459667   2.0919168   1.6904455   1.7064931   0.7376105   0.2567448
 -0.8194208   0.8788849  -0.89287275 -0.22960001  1.8320689  -1.7200342
  0.8977642   1.5119879  -0.3325551   0.7429934  -1.2087826   0.5350336
 -0.03887295 -1.9642036   1.0406445  -0.80972534  0.49987233  2.419521
 -0.30317742  0.96494234  0.6184119   1.2633535   2.688754   -0.7226699
 -2.8695397  -0.8986926   0.1258761  -0.75310475  1.099076    0.90656924
  0.24586082  0.44014114  0.85891217  0.34273988  0.07071286 -0.71412176
  1.4705397   3.6965442  -2.5951867  -2.4142458   1.2733719  -0.22638321
  0.15742263 -0.717717    2.2888887   3.3045793  -0.8173686   1.368556
  0.34260234  1.1644434   2.2652006  -0.47847173  1.5130697   3.481819
 -1.5247481   2.166555    0.7633031   0.61121356 -0.11627229  1.0461875
  1.4994645  -2.8477156  -2.9415505  -0.86640745 -1.1220155   0.10772963
 -1.6050811  -2.519997   -0.13945188 -0.06943721  0.83996797  0.29909992
  0.7927955  -1.1932545  -0.375592    0.4437512  -1.4635806  -0.16438413
  0.93455386 -0.4142645  -0.92249537 -1.0754105   0.07403489  1.0781559
  1.7206618  -0.69100255 -2.6112185   1.4985414  -1.8344582  -0.75036854
  1.6177907  -0.47727013  0.88055164 -1.057859   -2.0196638  -3.5305111
  1.1221203   3.3149185   0.859528    2.3817215  -1.1856595  -0.03347144
 -0.84533554  2.201596   -2.1573794  -0.6228852   0.12370715  3.030279
 -1.9215534   0.09835044]

If you compare it with the first vector output result, you can see that the output result is the same.

Finally

Now that we are ready to use the word2vec vector, next time we will try to learn using the word2vec vector in the Embedding layer of the seq2seq / Encoder model.

Recommended Posts

Use the vector learned by word2vec in the Embedding layer of LSTM
Search by the value of the instance in the list
Visualization of the firing state of the hidden layer of the model learned in the TensorFlow MNIST tutorial
○○ Solving problems in the Department of Mathematics by optimization
Use Sudachipy's learned word2vec in a low memory environment
Let's use the open data of "Mamebus" in Python
How to use the model learned in Lobe in Python
What I learned by participating in the ISUCON10 qualifying
Survey on the use of machine learning in real services
What beginners learned from the basics of variables in python
Wrap (part of) the AtCoder Library in Cython for use in Python
[Understanding in the figure] Management of Python virtual environment by Pipenv
Read the standard output of a subprocess line by line in Python
Use PyCaret to predict the price of pre-owned apartments in Tokyo!
Check the drawing result using Plotly by embedding CodePen in Qiita
The story of participating in AtCoder
The story of the "hole" in the file
The meaning of ".object" in Django
Get the key for the second layer migration of JSON data in python
In creating a model for discriminating tweet emotions with LSTM + Embedding, I reaffirmed the importance of preprocessing in NLP.