[PYTHON] [With Japanese model] Sentence vector model recommended for people who process natural language in 2020

The point

image UMAP.png

Introduction

It is a story of creating a vector with high quality (in Japanese), that is, the closer the meaning of a sentence is based on the context rather than the letter, the closer the vector becomes.

I think there are a certain number of people who create sentence vectors for the purpose of searching for sentences that have similar meanings in their hobbies and practices.

However, although the average of the word vectors is not bad, you may find yourself in a maze called heuristics when you notice a poor scene (try using a specific word or part of speech as a stop word, etc.). Try adding a mysterious weight. What is the basis?). In addition, the average word vector is difficult to handle polysemous words, and because the context is not taken into consideration, similar sentence searches that focus on the word "not there" are performed. On the other hand, learning a sentence vector model using an existing deep neural network such as Universal Sentence Encoder is too expensive to calculate, and it may be difficult to train it by yourself.

For comrades who have such troubles, high quality Japanese sentence vector using the method proposed in Paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks Create a model and publish it.

Prerequisite knowledge

--Basic knowledge of modern natural language processing --What is BERT? --What is a sentence vector (distributed representation of a sentence)? --How to use Google Colaboratory --Experience using sentence vectors

Example of using Japanese sentence vector model

Before the technical explanation of Sentence-BERT, I will explain how to use it. it's simple. The example does two things: You can try these examples in Google Colaboratory.

  1. Search for sentences that have similar meanings to the given query statement.
  2. Visualize the latent semantic space of the title sentence with UMAP.

setup

Install the relevant library and download the Japanese model. No detailed explanation is needed. For the convenience of trying it easily in Colaboratory, this time instead of installing Japanese sentence-transformers, I moved to the directory with the source code, but in production use, like In [2] commented out. It is better to install with setup.py.

In[1]


!git clone https://github.com/sonoisa/sentence-transformers
!cd sentence-transformers; pip install -r requirements.txt

In[2]


#!cd sentence-transformers; python setup.py install

In[3]


!wget -O sonobe-datasets-sentence-transformers-model.tar "https://www.floydhub.com/api/v1/resources/JLTtbaaK5dprnxoJtUbBbi?content=true&download=true&rename=sonobe-datasets-sentence-transformers-model-2"
!tar -xvf sonobe-datasets-sentence-transformers-model.tar

By unpacking the tar, the directory training_bert_japanese that stores the model will be created.

In[4]


%cd sentence-transformers

Loading Japanese model

If you create a SentenceTransformer instance with the path / content / training_bert_japanese to the directory where the Japanese model is placed, the loading of the model is completed.

In[5]


%tensorflow_version 2.x
from sentence_transformers import SentenceTransformer
import numpy as np

model_path = "/content/training_bert_japanese"
model = SentenceTransformer(model_path, show_progress_bar=False)

Statement vector calculation

Compute the statement vector. Just call model.encode (a list of statements). In this example, as a sentence, I will use the title (a slightly modified version) of "Irasutoya" published in Another article. .. (Actually, it's better to have a sentence that is several times longer to show the effect, but I couldn't prepare it right away, so I'll just use the title. I'll add an explanation when I can prepare a more appropriate sentence.)

In[6]


#Source: https://qiita.com/sonoisa/items/Excerpt from the image title of "Irasutoya" published on 775ac4c7871ced6ed4c3 (the words "illustration", "mark", and "character" have been removed).
sentences = ["Male office worker bowing", "Laughter bag", "Technical evangelist (female)", "Fighting AI", "Laughing man (5 levels)", 
...
"A man staring at money and grinning", "People saying "Thank you"", "Retirement age (female)", "Technical evangelist (male)", "Standing ovation"]

In[7]


sentence_vectors = model.encode(sentences)

This is the only calculation of the statement vector.

Search for sentences with similar meanings

Using the calculated sentence vector, try to search for sentences with similar meanings (title of Irasuto). Look for a statement vector with a small cosine distance.

In[9]


import scipy.spatial

queries = ['Runaway AI', 'Runaway artificial intelligence', 'Thanks to Mr. Irasutoya', 'to be continued']
query_embeddings = model.encode(queries)

closest_n = 5
for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], sentence_vectors, metric="cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx, distance in results[0:closest_n]:
        print(sentences[idx].strip(), "(Score: %.4f)" % (distance / 2))

Below is the output result. Runaway ≒ fighting, it can be regarded as a weapon, so I think it is a natural result. If you have a heart, you can certainly think that you can run out of control.

Out[9]


======================
Query:Runaway AI

Top 5 most similar sentences in corpus:
Fighting AI(Score: 0.1521)
AI with a heart(Score: 0.1666)
AI with weapons(Score: 0.1994)
Artificial intelligence / AI(Score: 0.2130)
AI for image recognition(Score: 0.2306)

Paraphrased AI into artificial intelligence. The result has changed. It seems that those with similar notations tend to come to the top.

Out[9]


======================
Query:Runaway artificial intelligence

Top 5 most similar sentences in corpus:
Artificial intelligence that robs jobs(Score: 0.1210)
People who quarrel with artificial intelligence(Score: 0.1389)
Artificial intelligence(Score: 0.1411)
Growing artificial intelligence(Score: 0.1482)
Artificial intelligence / AI(Score: 0.1629)

Thank you = You know that "Thank you".

Out[9]


======================
Query:Thanks to Mr. Irasutoya

Top 5 most similar sentences in corpus:
People saying "Thank you"(Score: 0.1381)
Fukuwarai mumps(Score: 0.1693)
Fukuwarai (Hyottoko)(Score: 0.1715)
Fukuwarai (Okame)(Score: 0.1743)
A person who holds back laughter (male)(Score: 0.1789)

You can search for similar sentences with a single word instead of a sentence.

Out[9]


======================
Query:to be continued

Top 5 most similar sentences in corpus:
"Continued" of various movies(Score: 0.1878)
Singularity(Score: 0.2703)
Fake smile(Score: 0.2811)
Thank you(Score: 0.2881)
Harisen(Score: 0.2931)

Visualize latent semantic space with TensorBoard

Run the following code and use Colaboratory's TensorBoard extension to map and visualize the space of the statement vector to a low dimensional space.

In[10]


%load_ext tensorboard
import os
logs_base_dir = "runs"
os.makedirs(logs_base_dir, exist_ok=True)

In[11]


import torch
from torch.utils.tensorboard import SummaryWriter

import tensorflow as tf
import tensorboard as tb
tf.io.gfile = tb.compat.tensorflow_stub.io.gfile

summary_writer = SummaryWriter()
summary_writer.add_embedding(mat=np.array(sentence_vectors), metadata=sentences)

In[12]


%tensorboard --logdir {logs_base_dir}

--When TensorBoard starts, select PROJECTOR from the menu on the upper right. --It will be easier to see if the visualization algorithm (lower left pane of TensorBoard) is set to 2D of UMAP and neighbors (right pane of TensorBoard) is set to 10.

Hopefully the space of the sentence vector will be visualized as follows: There is an "artificial intelligence" system in the upper left, and a mass of "AI" systems can be seen immediately to the right. You can see the "bowing" type in the lower left, a few other miscellaneous things in the center, the "female" type in the right, and the "male" type in the lower part.

TensorBoard.png

Overview of Sentence-BERT

(It's a very easy-to-read paper, so I don't think it's necessary to explain it.)

--Papers: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks --Original implementation: UKPLab / sentence-transformers (* Note that the Japanese version model does not work with the original implementation)

In a word, the sentence vector created by embedding tokens in BERT and mean pooling is finetuned using Siamese Network. It's simple. You can see it roughly by looking at Figures 1 and 2 of the papers cited below.

SBERT_architecture.png

In Paper, various other methods have been experimented, but in Japanese model construction, this network structure with the highest performance is adopted.

Cosine similarity and correct labels using the English version of the accuracy rating (STSbenchmark), as shown in Table 2 quoted below. In spearman's rank correlation coefficient (the closer it is to 1, the better), it is 0.58 for the method using the average of simple word vectors (GloVe), and around 0.85 for the model equivalent to the scale created this time.

Performance.png

Also, according to Table 1 in Paper, the accuracy is 0.17 when using the CLS vector of the plain BERT, and 0.46 when using the average of the embedding of BERT. The result is worse than the vector average result (0.58). As stated in the original BERT paper, it turns out that using these as sentence vectors is not appropriate (it's worse than you might think).

In creating the Japanese version of the model, huggingface / transformers (The model is Tohoku University Inui Suzuki Laboratory cl-tohoku / bert-japanese)) Japanese version BERT model is used.

And the learning method of the Japanese model this time (data set used for learning, accuracy evaluation method), I'm sorry, but it is a secret due to various circumstances. As a result, it seems that a sentence vector with context is created, which is comparable to the English version.

Download Japanese source code and model

The Japanese version of the source code and model used in the above example can be downloaded from the following.

Disclaimer

The author pays close attention to the content, functions, etc. of this article, but does not guarantee that the content is accurate or safe. We are not responsible. Even if any inconvenience or damage occurs to the user by using the contents of this article, the author and the organization to which the author belongs (NSSOL, formerly NS Solutions Corporation) We do not take any responsibility.

Summary

I made a Japanese version code and model of Sentence-BERT. Now anyone can easily create high quality sentence vectors. Please use it.

Recommended Posts

[With Japanese model] Sentence vector model recommended for people who process natural language in 2020
Natural language processing for busy people
[Natural language processing] Preprocessing with Japanese
Dockerfile with the necessary libraries for natural language processing in python
Process multiple lists with for in Python
Model using convolutional neural network in natural language processing
Create Japanese sentence vector with BertModel of huggingface / transformers
3. Natural language processing with Python 4-1. Analysis for words with KWIC
Building an environment for natural language processing with Python
Behavior in each language when coroutines are reused with for
For those who are analyzing in atmosphere (Linear Regression Model 1)