This article is the 13th day article of NTT Communications Advent Calendar 2019. Yesterday was @ nitky's article, We are dealing with threat intelligence in an atmosphere.
Hello, this is yuki uchida that belong to NTT Communications SkyWay team.
In this article, I will try to visualize my tweets using the language model BERT
, which is used for natural language processing, which was recently applied to Google search.
I will write it in a hands-on format so that as many people as possible can try it, so if you are interested, please try it.
Google's search engine "the biggest leap in the last 5 years"
BERT is a natural language model announced by Google in 2018, and has achieved high-impact accuracy in many natural language tasks. This accuracy was astounding, and it gained a lot of recognition in the natural language processing area. (By the way, in the natural language processing area, a natural language model called Word2Vec was announced in 2014, and there was a lot of noise, but this is also a paper from Google.) Many people have already explained the details of this BERT, so I will post some links.
Move the general-purpose language expression model BERT in Japanese (PyTorch) Let's clarify the internal operation of the general-purpose language expression model BERT
What can you do with this BERT this time? To intuitively understand that, let's open the following site!
https://transformer.huggingface.co/
This site allows you to try the library transformers (formerly pytorch-pretrained-bert)
provided by huggingface
online.
This time, let's select gpt
, which is the same language model as BERT, and demonstrate it.
When selected, the screen will switch to a Word-like screen where you can enter text, as shown below.
In this state, enter some text and then press the tab button. Then you can see that the sentences are displayed as three candidates. If you select this sentence, it will be added to the sentence.
This demo is a trial of ** sentence generation **.
** Sentence generation ** is a technology that has long been studied in the natural language processing area. By using the ** natural language model ** to make the computer understand ** text information ** well, it has become possible to generate ** sentences ** naturally in this way.
If you try to generate sentences several times as it is, you will get quite interesting results.
I have never met anyone who did not find it useful or useful for others . It was originally released as an open source project
<img width = "707" alt = "Screenshot 2019-12-12 21.05.18.png " src = "https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0" /124611/b79bc18d-4ad4-b63d-0bd0-64b6c5a61167.png ">
This time, I chose the model called GPT
on this demo site, but the successor model called GPT2
was kept private for a while because the proponent OpenAI was concerned about its misuse. (Currently published and can be tried on the demo site)
Techcrunch: OpenAI has developed a very good text generator, but I think it's too dangerous to release it as is -openai-text-generator-dangerous /)
Thus, ** a better natural language model improves the accuracy of many natural language processing tasks. ** **
Use the huggingface
library, renamed from pytorch-pretrained-bert
to transformers
.
The demo site I used earlier will be the online demo of this transformers
.
By using this library, you can easily call BERT in combination with pytorch (or tensorflow)
.
https://github.com/huggingface/transformers
For this hands-on, we'll call BERT
using pytorch
instead of transformers
.
If you haven't installed pytorch
and transformers
yet, install them with the following command.
pip install torch torchvision
pip install transformers
This time, I will pull the BERT model that supports Japanese.
If you can speak English, no special preparation is required and you can read it with the following code.
model = BertForMasked.from_pretrained('bert-base-uncased')
** It takes a little time to use the Japanese learned model, so before using the Japanese learned model, let's use BERT's English learned model that can be easily called. ** **
First, import the required libraries. Create test.py
and write as follows.
import torch
from transformers import BertTokenizer, BertForMaskedLM
import numpy as np
Next, let's set any simple English sentence.
text = "How many lakes are there in Japan."
Now, let's use the BERT-specific word-separation tool BertTokenizer
to split words.
Paste the code below.
Word-separated word-separation is a necessary process for a computer to determine where to word-separate.
(In English, you can basically divide each word just by separating it with a space.)
** (Note: Enclose the beginning and end with [CLS] [SEP].) **
test.py
##At the stage of importing the library, BertTokenizer is also loaded
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_text = tokenizer.tokenize(text)
tokenized_text.insert(0, "[CLS]")
tokenized_text.append("[SEP]")
# ['[CLS]', 'how', 'many', 'lakes', 'are', 'there', 'in', 'japan', '.', '[SEP]']
This allowed us to split each word.
** Next, select the word you want to hide. BERT will try to guess the word. ** **
This time I would like to hide ** are **. Let's describe the process of replacing this word with [MASK]
.
test.py
masked_index = 4
tokenized_text[masked_index] = '[MASK]'
# ['[CLS]', 'how', 'many', 'lakes', '[MASK]', 'there', 'in', 'japan', '.', '[SEP]']
This has replaced ** are ** with ** [MASK] **. Now I don't know what the word here is.
Now give BERT this text and ask them to predict the word ** [MASK] **! !! !!
test.py
##Instead of passing it to BERT as it is, convert it with a dictionary and make it id.
tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
tokens_tensor = torch.tensor([tokens])
##Read BERT. It may take some time here
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()
with torch.no_grad():
outputs = model(tokens_tensor)
predictions = outputs[0]
## masked_Take out the prediction result of the word in the index part and output the prediction result top5
_, predict_indexes = torch.topk(predictions[0, masked_index], k=5)
predict_tokens = tokenizer.convert_ids_to_tokens(predict_indexes.tolist())
print(predict_tokens)
# ['are', 'were', 'lie', 'out', 'is']
As a result, as a result of predicting hidden words, TOP5 became ['are','were','lie','out','is']
. As you can see from this result, ** BERT was able to guess the hidden words. ** It is amazing!
This time, I used ** BertForMaskedLM ** to predict hidden words. Other things that you can easily try are as follows. Please, try it.
Then, from here is the production. ** Use BERT to visualize what type of tweet you are tweeting ** First, prepare BERT that supports Japanese. Originally, I would like to do BERT pre-learning on my own, but BERT takes a lot of time to learn. Therefore, Kyoto University Kurobashi / Kawahara Laboratory's HP distributes the trained models. Visit% E6% 97% A5% E6% 9C% AC% E8% AA% 9EPretrained% E3% 83% A2% E3% 83% 87% E3% 83% AB) to download the model.
It seems that it takes about 30 days to learn BERT, so I'm really grateful that you can feel free to try it if you publish it like this ...
30 epoch (1GPU (using GeForce GTX 1080 Ti) takes about 1 day for 1 epoch, so pretraining takes about 30 days)
Unzip the downloaded file and place it in the same location as the python file. (Create a bert folder as shown below and store it there)
test.py
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM
import numpy as np
model = BertModel.from_pretrained('bert/Japanese_L-12_H-768_A-12_E-30_BPE_transformers')
bert_tokenizer = BertTokenizer("bert/Japanese_L-12_H-768_A-12_E-30_BPE_transformers/vocab.txt",
do_lower_case=False, do_basic_tokenize=False)
Now, let's try once to see if the BERT model can be used.
I'd like to use the code I just made as it is, but this time I'm dealing with Japanese instead of English, so I can't divide it with BertTokenizer
. (Since the dictionary function that replaces words with id is used, read it as above)
This time, let's use Juman
to divide the words and divide the words.
Since pip install is required to handle juman, type the following command to install it
pip install pyknp
Now that you have installed it, let's use this Juman
to divide the words.
The target sentence is I like playing soccer with my friends
, just as I did in English.
test.py
from pyknp import Juman
jumanpp = Juman()
text = "I like playing soccer with my friends"
result = jumanpp.analysis(text)
tokenized_text = [mrph.midasi for mrph in result.mrph_list()]
# ['I', 'Is', 'friend', 'When', 'Football', 'To', 'To do', 'こWhen', 'But', 'I like']
From here, the prediction is made in the same way as the English version.
test.py
tokenized_text.insert(0, '[CLS]')
tokenized_text.append('[SEP]')
masked_index = 5
tokenized_text[masked_index] = '[MASK]'
print(tokenized_text)
# ['[CLS]', 'I', 'Is', 'friend', 'When', '[MASK]', 'To', 'To do', 'こWhen', 'But', 'I like', '[SEP]']
tokens = bert_tokenizer.convert_tokens_to_ids(tokenized_text)
tokens_tensor = torch.tensor([tokens])
model.eval()
with torch.no_grad():
outputs = model(tokens_tensor)
predictions = outputs[0]
_, predict_indexes = torch.topk(predictions[0, masked_index], k=5)
predict_tokens = bert_tokenizer.convert_ids_to_tokens(predict_indexes.tolist())
print(predict_tokens)
# ['Talk', 'jobs', 'kiss', 'game', 'Football']
In the Japanese version of BERT, I tried to hide and predict the soccer of " I like playing soccer with my friends "
. It became ['story','work','kiss','game','soccer']
.
** Soccer wasn't number one, but you can feel that the other answers are correct. ** **
I was able to confirm that the Japanese version of BERT works here.
Now, let's prepare a tweet for visualization.
Please access the following page and click Twitter data
.
https://twitter.com/settings/account
When the screen below appears, enter the password and then press Request Archive
.
When you are ready and receive an email, it will change to Download Archive
and you will be able to download it.
After the download is complete, unzip it and check the contents. As below. It's okay if you have a lot of Javascript files and tweet.js
.
All the current tweet history is stored in tweet.js and it is a little difficult to handle, so let's convert it to csv. You can convert tweet.js to csv at the following site.
** However, if you are worried about using this tool as described in this site, please convert it to csv by another method. ** ** I made a tool "tweet.js loader" that reads tweet.js of Twitter data and displays all tweet history
Click the CSV output
button to download the CSV.
Once the conversion to csv is complete and you can download it, it's OK. (Let's name it tweets.csv)
Now, let's use BERT to convert this tweet.csv tweet into a sentence vector. ** A sentence vector is a vectorized version of that sentence **. Let's check what kind of tweet is being said by pouring the sentence vector converted by BERT into the visualization tool.
test.py
import pandas as pd
import re
tweets_df = pd.read_csv("./tweets.csv")
tweets_df["text"] = tweets_df["text"].astype(str) #Make it a character string for the time being
##Declare an array to store the result after sentence vector conversion and an array to store the original tweet
vectors = []
tweets = []
for tweet in tweets_df["text"]:
tweet = re.sub('\n', " ", tweet) #Remove newline characters
strip_tweet = re.sub(r'[︰-@]', "", tweet) #Removal of double-byte symbols
try:
if len(strip_tweet) > 3: #Because too few words may not give you the right vector
vector = compute_vector(
strip_tweet, model, bert_tokenizer, juman_tokenizer)
vectors.append(vector)
tweets.append(tweet)
except Exception as e:
continue
##Put the converted text vector in tsv.(Visualization tool requests tsv, so make it tsv)
pd.DataFrame(tweets).to_csv('./tweets_text.tsv', index=False, header=None))
pd.DataFrame(vectors).to_csv('./tweets_vector.tsv', sep='\t', index=False, header=None))
The compute_vector
that appears here is the process of converting to a sentence vector using the BERT model.
test.py
def compute_vector(text, model, bert_tokenizer, juman_tokenizer):
use_model = model
tokens = juman_tokenizer.tokenize(text)
bert_tokens = bert_tokenizer.tokenize(" ".join(tokens))
ids = bert_tokenizer.convert_tokens_to_ids(
["[CLS]"] + bert_tokens[:126] + ["[SEP]"])
tokens_tensor = torch.tensor(ids).reshape(1, -1)
use_model.eval()
with torch.no_grad():
all_encoder_layers, _ = use_model(tokens_tensor)
pooling_layer = -2
embedding = all_encoder_layers[0][pooling_layer].numpy()
# embedding = all_encoder_layers[0].numpy()
# return np.mean(embedding, axis=0)
return embedding
If you check the file saved by this process, there should be tweets_vector.tsv
where the sentence vector is saved in tab delimiters and tweets_text.tsv
where the original tweet is saved. is.
Now you are ready for visualization. Let's visualize these sentence vectors using EmbeddingProjector! !! !! !! !!
http://projector.tensorflow.org/
When you access it, you should see the Word2Vec word distribution as shown below.
This time, I want to see the distribution of my tweets, so let's use the tweets_vector.tsv
and tweets_text.tsv
generated earlier as data sources.
Press the Load button and select tweets_vector.tsv
for the first and tweets_text.tsv
for the second.
** You should now be able to visualize your tweets. ** **
Since the sentence vector converted by BERT is originally 768 dimensions, it cannot be displayed in 3 dimensions like this, but since PCA (one of the dimension compression methods) is automatically performed, the following Is displayed. For dimensional compression, not only PCA but also T-SNE and UMAP can be selected.
If you select any one point, a sentence similar to that sentence will be displayed. This time, the top 10 similarities are displayed.
** Sentences related to study and dissertation have been selected as similar sentences! !! It seems that the similarity of sentences can be calculated reasonably well. ** **
This is the distribution when it is dropped into two dimensions.
** Originally, 768 dimensions are forcibly dropped into 2 or 3 dimensions, so the distribution is not clearly separated. ** ** ** As a result of PCA, the explanation rate was 25% in the top 2nd place, 30% in the 3rd place, and 50% in the 10th place, so this was also a convincing result. ** **
This time, it was a hands-on ** that I tried to visualize my tweets while understanding BERT. By visualizing the distribution of your own tweets, you will be able to understand to some extent what kind of tweets you are doing. In my example, I was pretty biased towards programming and negative tweets, and I found out that "I'm such a person from an objective point of view ...". I regret that I didn't have time to use the sentence piece, so I will give you another article using the sentence piece.
Natural language processing often does not produce easy-to-understand results, but visualization gives us new discoveries. If you find this article interesting, ** please try it on your own tweet **.
Also, if you follow twitter, you will be muttering about natural language processing and the recommendation system, and writing articles.
Well then, this is the end of my article. Tomorrow is an article by @Mahito. looking forward to!