[PYTHON] Sentence vector creation using BERT (Keras BERT)

I tried to create a Japanese sentence vector using Japanese BERT trained model. I've seen how to create sentence vectors with BERT on various sites, but I didn't know where to place the model files, or I couldn't start from scratch, so Google Colaboratory I'd like to use .google.com/) in an easy way to do it without having to put it in my machine.

What is Google Colaboratory?

You can run Python programs in the cloud runtime environment with a browser and a Google account without having to install anything on your machine. The image is a Jupyter notebook environment on the cloud. As soon as you write Python code, you can run it on the fly. It's great that it's free to use because Google is intended to be used for machine learning education and research. It is a service that you can use GPU and TPU, and there is no option not to use for experimentation and study.

Prepare a BERT model

First, prepare a trained model of BERT. Thankfully, there are people who have created and published a trained model on Japanese Wikipedia, so I will use this model.

I learned BERT with SentencePiece on Japanese Wikipedia and published the model

There is a ** google drive ** link on this page, so download the file from there. All you need is the following files: (You don't need the largest bz2 file!)

After downloading the file, create a ** bert ** folder under ** My Drive ** on your Google Drive, create a ** bert-wiki-ja ** folder in it, and the above file in it. Upload all.

In addition, download the following file from here and upload it to the ** bert-wiki-ja ** folder as well. I will. This file is a file that contains the setting values required to use the model file uploaded earlier.

Finally, it's okay if you have the following files in Google Drive. googledrive.PNG

Run a program that creates sentence vectors

From here Download the following ipynb file, which is the main body of the program, and upload it to any location on Google Drive.

Set up Google Drive to use Google Colaboratory and open this file in Google Colaboratory. Once open, run it with Run all ([Ctrl] + [F9]).

The code runs in order, and in one place on the way, "Mount Google Drive in path / content / drive" is displayed as follows. Since you need to authenticate to see the file in Google Drive, click the link displayed, give permission according to the instructions on the screen, and enter the code displayed at the end of "Enter your authorization code:" If you put it in the place, it will proceed.

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=...

Enter your authorization code:

The sentence set in text in the last code cell is the sentence for which the sentence vector is created.

text= 'A terminal station that can be called the front door of Tokyo. In particular, it is the starting point of the Tokaido Shinkansen and Tohoku Shinkansen, and is the largest base in the nationwide Shinkansen network.'
texts2matrix([text])

The created sentence vector looks like this.

array([[ 7.48805702e-01,  6.90443218e-01, -2.08694339e-01,
         2.60837108e-01, -6.57196045e-01,  2.21781164e-01,
         2.99572378e-01, -5.03947437e-02,  2.57107586e-01,
        -3.71909142e-02,  4.70012784e-01, -4.32350069e-01,
        ...
        -2.44613029e-02, -5.86998463e-02,  3.70831758e-01,
        -2.27520689e-01,  3.76363575e-01,  2.21934259e-01,
         7.50128254e-02,  1.20648248e-02, -2.35060215e-01]], dtype=float32)

Did it move successfully? If you rewrite text and execute this code cell again, the sentence vector will be recalculated and output.

References

I referred to this document.

-Introduction of environment construction for obtaining vector representation of sentences with BERT -I learned BERT with SentencePiece on Japanese Wikipedia and published the model

in conclusion

I introduced how to create a sentence vector in the shortest possible procedure as easily as possible, but if something goes wrong, please comment. If the sentences can be made into vectors, the usage will be expanded in various ways. BERT is still under trial and error, but I'm thinking of trying it if a vector can be created even with a model fine-tuned to a specific field. (Information is welcome!)

I usually do natural language related work at this company. We also use technologies other than BERT, so please take a look if you are interested. → Ifocus Network Co., Ltd.

Recommended Posts

Sentence vector creation using BERT (Keras BERT)
Sentence vector creation using fastText (also visualization)
Creation of negative / positive classifier using BERT
Sentence vector creation using BERT (Keras BERT)
Summary when using Fabric
Summary for learning RAPIDS
Pipenv usage summary (for myself)
Reference resource summary (for beginners)
Chat creation using sockets
Sentence generation with GRU (keras)