Introduction

When you touch natural language processing using deep learning, I come across an unfamiliar guy called Embedding.

Translated literally into Japanese, it is ** embedded **.

~~ I don't understand the meaning ~~ I'm not sure, so I looked it up.

What kind of operation?

Converting natural language into a computable form seems to be called embedding. In many cases, it refers to ** operations that convert words, sentences, etc. into vector representations **.

for what?

There are two main reasons.

1. To allow the computer to process

Basically, current machine learning algorithms are not designed to handle string types. Therefore, it needs to be converted into a computable form.

2. Because accuracy can be expected to improve depending on the conversion method

In addition to simply making it computable, by devising a vector representation method You will be able to express the characteristics of words and sentences in vectors.

For example, by converting words with similar meanings into close vectors **, You will be able to express the meaning (like) of a word by the distance and similarity of vectors.

I will move it for the time being

I wrote it like that, but I can't really feel it unless I move it, so I'll write the code.

Technology used

Let's embed it in ** Word2Vec **, a library called gensim, which is easy to implement. I used the following pre-trained model as it is.

February 1, 2017 Japanese Wikipedia Entity Vector © Masatoshi Suzuki http://www.cl.ecei.tohoku.ac.jp/~m-suzuki/jawiki_vector/

Try to embed various

For the time being, I will try to embed "Good morning" and "Good evening".

Please check Code for various processing.

print(model["Good morning"])

# [ 0.36222297 -0.5308175   0.97112703 -0.50114137 -0.41576928  1.7538059
#  -0.17550747 -0.95748925 -0.9604152  -0.0804095  -1.160322    0.22136442
# ...

print(model["Good evening"])

# [-0.13505702 -0.11360763  0.00522657 -0.01382224  0.03126004  0.14911242
#   0.02867801 -0.02347831 -0.06687803 -0.13018233 -0.01413341  0.07728481
# ...

You can see that the string has been converted to a vector.

.. .. .. That's why I felt like I was told, so I'll check if I can express the meaning.

Check the similarity

Let's take a look at the cosine similarity that is often used when calculating document similarity. By the way, the cosine similarity is expressed between 0 and 1, and the closer it is to ** 1, the more similar it is **.

First, let's look at the similarity between ** "Good morning" and "Good evening" **.

print(cos_similarity(model["Good morning"],model["Good evening"]))
# 0.8513177

The score came out as 0.85 ... Seems pretty close.

Now let's look at the similarity of words to distant ones.


print(cos_similarity(model["Good morning"],model["Hijiki"]))
# 0.17866151

The score was 0.17 ... ** "Good morning" and "Hijiki" ** can be said to be far away.

It seems that the vector has a meaning in terms of experience.

in conclusion

I felt like I was able to grasp the image of Embedding. I heard that BERT's Embedding, which I buzzed a while ago, is really good, so I'll give it a try.

code

import numpy as np
import gensim

#Model loading
model_path = "entity_vector/entity_vector.model.bin"
model = gensim.models.KeyedVectors.load_word2vec_format(model_path, binary=True)

#Cosine similarity
def cos_similarity(a,b):
    return np.dot(a,b) / ((np.sqrt(np.dot(a,a))) * (np.sqrt(np.dot(b,b))))

print(model["Good morning"])
print(model["Good evening"])

print(cos_similarity(model["Good morning"],model["Good evening"]))
print(cos_similarity(model["Good morning"],model["Hijiki"]))

[PYTHON] After all, who is Embedding?