[PYTHON] [With Japanese model] Sentence vector model recommended for people who process natural language in 2020

The point

Create and publish a Japanese model of Sentence-BERT (Paper, Implementation) did.
By using this Japanese model, anyone can easily create high-quality sentence vectors. Please use it.
Sentence-BERT uses pre-trained BERT model and Siamese Network , It is a method to create a highly accurate sentence vector.
Cosine similarity using the English version of the accuracy evaluation (STSbenchmark) and Spearman's rank correlation coefficient of the correct label. (The closer it is, the better), the method using the average of the simple word vector (GloVe) is 0.58, and the model equivalent to the scale created this time is around 0.85 (Paper). See Table 2).
If you use the CLS vector of the plain BERT, the accuracy is 0.17, and if you use the average of the embedded BERT, it is 0.46, which is worse than the average result of the word vector (0.58) (Paper. See Tables 1 and 2 of //arxiv.org/abs/1908.10084)). As stated in the original BERT paper, it is not appropriate to use these as sentence vectors.

image

Introduction

It is a story of creating a vector with high quality (in Japanese), that is, the closer the meaning of a sentence is based on the context rather than the letter, the closer the vector becomes.

I think there are a certain number of people who create sentence vectors for the purpose of searching for sentences that have similar meanings in their hobbies and practices.

However, although the average of the word vectors is not bad, you may find yourself in a maze called heuristics when you notice a poor scene (try using a specific word or part of speech as a stop word, etc.). Try adding a mysterious weight. What is the basis?). In addition, the average word vector is difficult to handle polysemous words, and because the context is not taken into consideration, similar sentence searches that focus on the word "not there" are performed. On the other hand, learning a sentence vector model using an existing deep neural network such as Universal Sentence Encoder is too expensive to calculate, and it may be difficult to train it by yourself.

For comrades who have such troubles, high quality Japanese sentence vector using the method proposed in Paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks Create a model and publish it.

Prerequisite knowledge

--Basic knowledge of modern natural language processing --What is BERT? --What is a sentence vector (distributed representation of a sentence)? --How to use Google Colaboratory --Experience using sentence vectors

Example of using Japanese sentence vector model

Before the technical explanation of Sentence-BERT, I will explain how to use it. it's simple. The example does two things: You can try these examples in Google Colaboratory.

Search for sentences that have similar meanings to the given query statement.
Visualize the latent semantic space of the title sentence with UMAP.

Colaboratory Notebook: https://colab.research.google.com/github/sonoisa/sentence-transformers/blob/master/sentence_transformers_ja.ipynb
GPU is required. Open it in Chrome to use TensorBoard.

setup

Install the relevant library and download the Japanese model. No detailed explanation is needed. For the convenience of trying it easily in Colaboratory, this time instead of installing Japanese sentence-transformers, I moved to the directory with the source code, but in production use, like In [2] commented out. It is better to install with setup.py.

`In[1]`


!git clone https://github.com/sonoisa/sentence-transformers
!cd sentence-transformers; pip install -r requirements.txt

`In[2]`


#!cd sentence-transformers; python setup.py install

`In[3]`


!wget -O sonobe-datasets-sentence-transformers-model.tar "https://www.floydhub.com/api/v1/resources/JLTtbaaK5dprnxoJtUbBbi?content=true&download=true&rename=sonobe-datasets-sentence-transformers-model-2"
!tar -xvf sonobe-datasets-sentence-transformers-model.tar

By unpacking the tar, the directory training_bert_japanese that stores the model will be created.

`In[4]`


%cd sentence-transformers

Loading Japanese model

If you create a SentenceTransformer instance with the path / content / training_bert_japanese to the directory where the Japanese model is placed, the loading of the model is completed.

`In[5]`


%tensorflow_version 2.x
from sentence_transformers import SentenceTransformer
import numpy as np

model_path = "/content/training_bert_japanese"
model = SentenceTransformer(model_path, show_progress_bar=False)

Statement vector calculation

Compute the statement vector. Just call model.encode (a list of statements). In this example, as a sentence, I will use the title (a slightly modified version) of "Irasutoya" published in Another article. .. (Actually, it's better to have a sentence that is several times longer to show the effect, but I couldn't prepare it right away, so I'll just use the title. I'll add an explanation when I can prepare a more appropriate sentence.)

`In[6]`


#Source: https://qiita.com/sonoisa/items/Excerpt from the image title of "Irasutoya" published on 775ac4c7871ced6ed4c3 (the words "illustration", "mark", and "character" have been removed).
sentences = ["Male office worker bowing", "Laughter bag", "Technical evangelist (female)", "Fighting AI", "Laughing man (5 levels)", 
...
"A man staring at money and grinning", "People saying "Thank you"", "Retirement age (female)", "Technical evangelist (male)", "Standing ovation"]

`In[7]`


sentence_vectors = model.encode(sentences)

This is the only calculation of the statement vector.

Search for sentences with similar meanings

Using the calculated sentence vector, try to search for sentences with similar meanings (title of Irasuto). Look for a statement vector with a small cosine distance.

`In[9]`


import scipy.spatial

queries = ['Runaway AI', 'Runaway artificial intelligence', 'Thanks to Mr. Irasutoya', 'to be continued']
query_embeddings = model.encode(queries)

closest_n = 5
for query, query_embedding in zip(queries, query_embeddings):
    distances = scipy.spatial.distance.cdist([query_embedding], sentence_vectors, metric="cosine")[0]

    results = zip(range(len(distances)), distances)
    results = sorted(results, key=lambda x: x[1])

    print("\n\n======================\n\n")
    print("Query:", query)
    print("\nTop 5 most similar sentences in corpus:")

    for idx, distance in results[0:closest_n]:
        print(sentences[idx].strip(), "(Score: %.4f)" % (distance / 2))

Below is the output result. Runaway ≒ fighting, it can be regarded as a weapon, so I think it is a natural result. If you have a heart, you can certainly think that you can run out of control.

`Out[9]`


======================
Query:Runaway AI

Top 5 most similar sentences in corpus:
Fighting AI(Score: 0.1521)
AI with a heart(Score: 0.1666)
AI with weapons(Score: 0.1994)
Artificial intelligence / AI(Score: 0.2130)
AI for image recognition(Score: 0.2306)

Paraphrased AI into artificial intelligence. The result has changed. It seems that those with similar notations tend to come to the top.

`Out[9]`


======================
Query:Runaway artificial intelligence

Top 5 most similar sentences in corpus:
Artificial intelligence that robs jobs(Score: 0.1210)
People who quarrel with artificial intelligence(Score: 0.1389)
Artificial intelligence(Score: 0.1411)
Growing artificial intelligence(Score: 0.1482)
Artificial intelligence / AI(Score: 0.1629)

Thank you = You know that "Thank you".

`Out[9]`


======================
Query:Thanks to Mr. Irasutoya

Top 5 most similar sentences in corpus:
People saying "Thank you"(Score: 0.1381)
Fukuwarai mumps(Score: 0.1693)
Fukuwarai (Hyottoko)(Score: 0.1715)
Fukuwarai (Okame)(Score: 0.1743)
A person who holds back laughter (male)(Score: 0.1789)

You can search for similar sentences with a single word instead of a sentence.

`Out[9]`


======================
Query:to be continued

Top 5 most similar sentences in corpus:
"Continued" of various movies(Score: 0.1878)
Singularity(Score: 0.2703)
Fake smile(Score: 0.2811)
Thank you(Score: 0.2881)
Harisen(Score: 0.2931)

Visualize latent semantic space with TensorBoard

Run the following code and use Colaboratory's TensorBoard extension to map and visualize the space of the statement vector to a low dimensional space.

It may not be displayed unless it is Chrome.

`In[10]`


%load_ext tensorboard
import os
logs_base_dir = "runs"
os.makedirs(logs_base_dir, exist_ok=True)

`In[11]`


import torch
from torch.utils.tensorboard import SummaryWriter

import tensorflow as tf
import tensorboard as tb
tf.io.gfile = tb.compat.tensorflow_stub.io.gfile

summary_writer = SummaryWriter()
summary_writer.add_embedding(mat=np.array(sentence_vectors), metadata=sentences)

`In[12]`


%tensorboard --logdir {logs_base_dir}

--When TensorBoard starts, select PROJECTOR from the menu on the upper right. --It will be easier to see if the visualization algorithm (lower left pane of TensorBoard) is set to 2D of UMAP and neighbors (right pane of TensorBoard) is set to 10.

Hopefully the space of the sentence vector will be visualized as follows: There is an "artificial intelligence" system in the upper left, and a mass of "AI" systems can be seen immediately to the right. You can see the "bowing" type in the lower left, a few other miscellaneous things in the center, the "female" type in the right, and the "male" type in the lower part.

Overview of Sentence-BERT

(It's a very easy-to-read paper, so I don't think it's necessary to explain it.)

--Papers: Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks --Original implementation: UKPLab / sentence-transformers (* Note that the Japanese version model does not work with the original implementation)

In a word, the sentence vector created by embedding tokens in BERT and mean pooling is finetuned using Siamese Network. It's simple. You can see it roughly by looking at Figures 1 and 2 of the papers cited below.

In Paper, various other methods have been experimented, but in Japanese model construction, this network structure with the highest performance is adopted.

Cosine similarity and correct labels using the English version of the accuracy rating (STSbenchmark), as shown in Table 2 quoted below. In spearman's rank correlation coefficient (the closer it is to 1, the better), it is 0.58 for the method using the average of simple word vectors (GloVe), and around 0.85 for the model equivalent to the scale created this time.

Also, according to Table 1 in Paper, the accuracy is 0.17 when using the CLS vector of the plain BERT, and 0.46 when using the average of the embedding of BERT. The result is worse than the vector average result (0.58). As stated in the original BERT paper, it turns out that using these as sentence vectors is not appropriate (it's worse than you might think).

In creating the Japanese version of the model, huggingface / transformers (The model is Tohoku University Inui Suzuki Laboratory cl-tohoku / bert-japanese)) Japanese version BERT model is used.

And the learning method of the Japanese model this time (data set used for learning, accuracy evaluation method), I'm sorry, but it is a secret due to various circumstances. As a result, it seems that a sentence vector with context is created, which is comparable to the English version.

Download Japanese source code and model

The Japanese version of the source code and model used in the above example can be downloaded from the following.

Source code of Japanese version of sentence-transformers
Japanese model (442.8MB) The tar file contains the following files:
training_bert_japanese/0_BERTJapanese/added_tokens.json
training_bert_japanese/0_BERTJapanese/config.json
training_bert_japanese/0_BERTJapanese/pytorch_model.bin
training_bert_japanese/0_BERTJapanese/sentence_bert_config.json
training_bert_japanese/0_BERTJapanese/special_tokens_map.json
training_bert_japanese/0_BERTJapanese/tokenizer_config.json
training_bert_japanese/0_BERTJapanese/vocab.txt
training_bert_japanese/1_Pooling/config.json
training_bert_japanese/config.json
training_bert_japanese/modules.json

Disclaimer

The author pays close attention to the content, functions, etc. of this article, but does not guarantee that the content is accurate or safe. We are not responsible. Even if any inconvenience or damage occurs to the user by using the contents of this article, the author and the organization to which the author belongs (NSSOL, formerly NS Solutions Corporation) We do not take any responsibility.

Summary

I made a Japanese version code and model of Sentence-BERT. Now anyone can easily create high quality sentence vectors. Please use it.