In this article, we'll look at how to use fastText to get a distributed representation of ** crisp ** words. I wrote it with the hope that it could be linked to the previous day's article.
fastText is a method to acquire the distributed representation of words (words expressed numerically) announced by Facebook. It is based on the familiar Word2Vec (CBOW / skip-gram). Word2Vec is still new, so no explanation is needed.
Paper: Enriching Word Vectors with Subword Information
The difference between Word2Vec and fastText is how to take a vector. By incorporating a mechanism called subwords, words that are close to each other, such as inflected forms, are drawn.
In Word2Vec, go and goes were completely different words. But fastText takes this into account and makes go and goes meaningful as the same component. Of course, you will be strong against unknown words!
For details, we recommend the paper and the following materials.
-Get distributed expressions of words in Fast with fastText on Facebook --Qiita -How fast and rumored fastText works-- Qiita
The purpose of this time is to "play with distributed expressions of words quickly using Python and FastText".
e? Can't you even introduce Python quickly? Now use Google Colaboratory. (Hereafter colab)
colab is the strongest free environment where anyone with a Google account can easily use Python and GPU.
A learning model is required to use fastText. You can collect the data yourself and train it, but here we will use the trained model.
Trained model: The trained model of fastText has been released --Qiita
Just throw the model into Google Drive. Since it's a big deal, let's unzip it from the drive, including instructions on how to use colab!
A mount is required to touch the inside of the drive with colab. You can do this by running the code below, but if you press the "mount drive" button you should get the same code, so run it.
from google.colab import drive
drive.mount('/content/drive')
After that, access the URL, log in to your account, and you will see the authorization code, so just copy and paste it with colab. Easy.
All you have to do is unzip the model!
%cd /content/drive/My Drive/data
!unzip vector_neologd.zip
For path, specify the location of the trained model that you uploaded to the drive.
Let's look for similar words.
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('model.vec', binary=False)
model.most_similar(positive=['Natural language processing'])
result.
[('Natural language understanding', 0.7600098848342896),
('Natural language', 0.7503659725189209),
('Computational linguistics', 0.7258570194244385),
('Automatic programming', 0.6848069429397583),
('Text mining', 0.6811494827270508),
('Computer language', 0.6618390083312988),
('Metaprogramming', 0.658093273639679),
('Web programming', 0.6488876342773438),
('Morphological analysis', 0.6479052901268005),
('Corpus linguistics', 0.6465639472007751)]
It's surprisingly fun to play around.
model.most_similar(positive=['friend'], negative=['friendship'])
[('acquaintance', 0.4586910605430603),
('home', 0.35488438606262207),
('acquaintance', 0.329221248626709),
('frequenting', 0.3212822675704956),
('Relatives', 0.31865695118904114),
('Acquaintance', 0.3158203959465027),
('home', 0.31503963470458984),
('Invitation', 0.302945077419281),
('Frequently', 0.30250048637390137),
('Colleague', 0.29792869091033936)]
It was so crisp that I would continue to limit my vocabulary and do the same. It is an option that there is also such a method.
This time, we are dealing with Japanese text, but unlike English, each word is not divided by spaces, so preprocessing must be performed first (separate writing).
The implementation is based on the book "Learn while making! Deep learning by PyTorch" (https://github.com/YutaroOgawa/pytorch_advanced).
Here we use janome.
!pip install janome
Let's write a word.
from janome.tokenizer import Tokenizer
j_t = Tokenizer()
def tokenizer_janome(text):
return [tok for tok in j_t.tokenize(text, wakati=True)]
You can use janome to write words quickly, but let's do some simple pre-processing.
import re
import unicodedata
import string
def format_text(text):
text = unicodedata.normalize("NFKC", text)
table = str.maketrans("", "", string.punctuation + """ ,. ・")
text = text.translate(table)
return text
def preprocessing(text):
"""
Preprocessing function
"""
text.translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)})) #Full-width → half-width
text = text.lower() #Uppercase → lowercase
text = re.sub('\r', '', text)
text = re.sub('\n', '', text)
text = re.sub(' ', '', text)
text = re.sub(' ', '', text)
text = re.sub(r'[0-9 0-9]', '', text) #Removal of numbers
text = re.sub(r'[!-/:-@[-`{-~]', '', text) #Removal of half-width symbols
text = re.sub(r'/[!-/:-@[-`{-~、-~ "" ・]', '', text) #Removal of double-byte symbols
text = format_text(text) #Removal of double-byte symbols
return text
def tokenizer_with_preprocessing(text):
text = preprocessing(text)
text = tokenizer_janome(text)
return text
As I wrote in the comment, I have done some pre-processing. Of course, what you need to do in preprocessing will change depending on the task and purpose.
Let's try putting in a suitable sentence.
text = 'This time, I want to use fastText to acquire distributed expressions! !! !!?'
print(tokenizer_with_preprocessing(text))
result.
['this time', 'Is', 'fasttext', 'To', 'Use', 'hand', 'Distributed', 'Expression', 'To', 'Acquired', 'Shi', 'Want']
Sounds like it's working.
Then use torchtext to make things easier. For more information on torchtext, see Easy and Deep Natural Language Processing with torchtext-Qiita.
import torchtext
max_length = 25
TEXT = torchtext.data.Field(sequential=True, tokenize=tokenizer_with_preprocessing,
use_vocab=True, lower=True, include_lengths=True,
batch_first=True, fix_length=max_length)
LABEL = torchtext.data.Field(sequential=False, use_vocab=False)
train_ds, val_ds, test_ds = torchtext.data.TabularDataset.splits(
path='./tsv/', train='train.tsv',
validation='val.tsv', test='test.tsv', format='tsv',
fields=[('Text', TEXT), ('Label', LABEL)])
We have prepared train.tsv / test.tsv / val.tsv. This is a tsv file by dividing the text of "What is fastText" in this article into three parts. This is also a detailed link above, so I recommend that.
from torchtext.vocab import Vectors
vectors = Vectors(name='model.vec')
#Create a vectorized version of the vocabulary
TEXT.build_vocab(train_ds, vectors=vectors, min_freq=1)
#Check the vocabulary vector
print(TEXT.vocab.vectors.shape) #52 words are represented by a 300-dimensional vector
TEXT.vocab.vectors
#Check the order of words in the vocabulary
TEXT.vocab.stoi
Now that we're ready, let's calculate the similarity of each word in the vocabulary. Let's see if three words are similar to "word".
import torch.nn.functional as F
tensor_calc = TEXT.vocab.vectors[TEXT.vocab.stoi['word']]
#Cosine similarity
print("paper", F.cosine_similarity(tensor_calc, TEXT.vocab.vectors[TEXT.vocab.stoi['paper']], dim=0))
print("word", F.cosine_similarity(tensor_calc, TEXT.vocab.vectors[TEXT.vocab.stoi['word']], dim=0))
print("vector", F.cosine_similarity(tensor_calc, TEXT.vocab.vectors[TEXT.vocab.stoi['vector']], dim=0))
result.
Paper tensor(0.3089)
Word tensor(0.3704)
Vector tensor(0.3265)
There are various articles about fastText, though not as much as Word2Vec, so I will introduce the ones that I especially recommend (references) and conclude.
Postscript location. I have introduced various articles before, but the following articles posted recently are the most recommended. There is a comparison with Watson in the title, but since there is a fastText code that works with Google colab, just click. The explanation about the format of the learning data is also polite, and I think it's a good idea to look at the article and code.
-Comparison of document classification between fasttext and Watson Natural Language Classifier
-List of ready-to-use word embedding vectors --Qiita
If you want to know more recent word embedding, I think that many will come out if you google with ELMo or BERT. Of course, the more recent models are, the more difficult it becomes.
@ takashi1029!
-The first step of dependency search with GiNZA + Elasticsearch --Taste of Tech Topics
Recommended Posts