[PYTHON] Let's use the distributed expression of words quickly with fastText!

In this article, we'll look at how to use fastText to get a distributed representation of ** crisp ** words. I wrote it with the hope that it could be linked to the previous day's article.

What is fastText

fastText is a method to acquire the distributed representation of words (words expressed numerically) announced by Facebook. It is based on the familiar Word2Vec (CBOW / skip-gram). Word2Vec is still new, so no explanation is needed.

Paper: Enriching Word Vectors with Subword Information

The difference between Word2Vec and fastText is how to take a vector. By incorporating a mechanism called subwords, words that are close to each other, such as inflected forms, are drawn.

In Word2Vec, go and goes were completely different words. But fastText takes this into account and makes go and goes meaningful as the same component. Of course, you will be strong against unknown words!

For details, we recommend the paper and the following materials.

-Get distributed expressions of words in Fast with fastText on Facebook --Qiita -How fast and rumored fastText works-- Qiita

Let's play with FastText

The purpose of this time is to "play with distributed expressions of words quickly using Python and FastText".

Environment

e? Can't you even introduce Python quickly? Now use Google Colaboratory. (Hereafter colab)

colab is the strongest free environment where anyone with a Google account can easily use Python and GPU.

Crisp and fastText

A learning model is required to use fastText. You can collect the data yourself and train it, but here we will use the trained model.

Trained model: The trained model of fastText has been released --Qiita

Just throw the model into Google Drive. Since it's a big deal, let's unzip it from the drive, including instructions on how to use colab!

A mount is required to touch the inside of the drive with colab. You can do this by running the code below, but if you press the "mount drive" button you should get the same code, so run it.

from google.colab import drive
drive.mount('/content/drive')

After that, access the URL, log in to your account, and you will see the authorization code, so just copy and paste it with colab. Easy.

All you have to do is unzip the model!

%cd /content/drive/My Drive/data
!unzip vector_neologd.zip

For path, specify the location of the trained model that you uploaded to the drive.

Let's look for similar words.


import gensim

model = gensim.models.KeyedVectors.load_word2vec_format('model.vec', binary=False)

model.most_similar(positive=['Natural language processing'])

result.

[('Natural language understanding', 0.7600098848342896),
 ('Natural language', 0.7503659725189209),
 ('Computational linguistics', 0.7258570194244385),
 ('Automatic programming', 0.6848069429397583),
 ('Text mining', 0.6811494827270508),
 ('Computer language', 0.6618390083312988),
 ('Metaprogramming', 0.658093273639679),
 ('Web programming', 0.6488876342773438),
 ('Morphological analysis', 0.6479052901268005),
 ('Corpus linguistics', 0.6465639472007751)]

It's surprisingly fun to play around.


model.most_similar(positive=['friend'], negative=['friendship'])
[('acquaintance', 0.4586910605430603),
 ('home', 0.35488438606262207),
 ('acquaintance', 0.329221248626709),
 ('frequenting', 0.3212822675704956),
 ('Relatives', 0.31865695118904114),
 ('Acquaintance', 0.3158203959465027),
 ('home', 0.31503963470458984),
 ('Invitation', 0.302945077419281),
 ('Frequently', 0.30250048637390137),
 ('Colleague', 0.29792869091033936)]

Crispy? fastText

It was so crisp that I would continue to limit my vocabulary and do the same. It is an option that there is also such a method.

This time, we are dealing with Japanese text, but unlike English, each word is not divided by spaces, so preprocessing must be performed first (separate writing).

The implementation is based on the book "Learn while making! Deep learning by PyTorch" (https://github.com/YutaroOgawa/pytorch_advanced).

Here we use janome.

!pip install janome

Let's write a word.

from janome.tokenizer import Tokenizer

j_t = Tokenizer()

def tokenizer_janome(text):
    return [tok for tok in j_t.tokenize(text, wakati=True)]

You can use janome to write words quickly, but let's do some simple pre-processing.

import re
import unicodedata
import string


def format_text(text):
    text = unicodedata.normalize("NFKC", text)
    table = str.maketrans("", "", string.punctuation  + """ ,. ・")
    text = text.translate(table)

    return text


def preprocessing(text):
    """
Preprocessing function
    """
    text.translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)}))  #Full-width → half-width

    text = text.lower()  #Uppercase → lowercase

    text = re.sub('\r', '', text)
    text = re.sub('\n', '', text)
    text = re.sub(' ', '', text)
    text = re.sub(' ', '', text)

    text = re.sub(r'[0-9 0-9]', '', text)  #Removal of numbers
    text = re.sub(r'[!-/:-@[-`{-~]', '', text)  #Removal of half-width symbols
    text = re.sub(r'/[!-/:-@[-`{-~、-~ "" ・]', '', text)  #Removal of double-byte symbols
    text = format_text(text)  #Removal of double-byte symbols

    return  text


def tokenizer_with_preprocessing(text):
    text = preprocessing(text)
    text = tokenizer_janome(text)
    
    return text

As I wrote in the comment, I have done some pre-processing. Of course, what you need to do in preprocessing will change depending on the task and purpose.

Let's try putting in a suitable sentence.


text = 'This time, I want to use fastText to acquire distributed expressions! !! !!?'
print(tokenizer_with_preprocessing(text))

result.

['this time', 'Is', 'fasttext', 'To', 'Use', 'hand', 'Distributed', 'Expression', 'To', 'Acquired', 'Shi', 'Want']

Sounds like it's working.

Then use torchtext to make things easier. For more information on torchtext, see Easy and Deep Natural Language Processing with torchtext-Qiita.


import torchtext

max_length = 25
TEXT = torchtext.data.Field(sequential=True, tokenize=tokenizer_with_preprocessing,
                            use_vocab=True, lower=True, include_lengths=True,
                            batch_first=True, fix_length=max_length)
LABEL = torchtext.data.Field(sequential=False, use_vocab=False)

train_ds, val_ds, test_ds = torchtext.data.TabularDataset.splits(
    path='./tsv/', train='train.tsv',
    validation='val.tsv', test='test.tsv', format='tsv',
    fields=[('Text', TEXT), ('Label', LABEL)])

We have prepared train.tsv / test.tsv / val.tsv. This is a tsv file by dividing the text of "What is fastText" in this article into three parts. This is also a detailed link above, so I recommend that.

from torchtext.vocab import Vectors

vectors = Vectors(name='model.vec')

#Create a vectorized version of the vocabulary
TEXT.build_vocab(train_ds, vectors=vectors, min_freq=1)

#Check the vocabulary vector
print(TEXT.vocab.vectors.shape)  #52 words are represented by a 300-dimensional vector
TEXT.vocab.vectors

#Check the order of words in the vocabulary
TEXT.vocab.stoi

Now that we're ready, let's calculate the similarity of each word in the vocabulary. Let's see if three words are similar to "word".


import torch.nn.functional as F

tensor_calc = TEXT.vocab.vectors[TEXT.vocab.stoi['word']]

#Cosine similarity
print("paper", F.cosine_similarity(tensor_calc, TEXT.vocab.vectors[TEXT.vocab.stoi['paper']], dim=0))
print("word", F.cosine_similarity(tensor_calc, TEXT.vocab.vectors[TEXT.vocab.stoi['word']], dim=0))
print("vector", F.cosine_similarity(tensor_calc, TEXT.vocab.vectors[TEXT.vocab.stoi['vector']], dim=0))

result.

Paper tensor(0.3089)
Word tensor(0.3704)
Vector tensor(0.3265)

For those who want to do more

There are various articles about fastText, though not as much as Word2Vec, so I will introduce the ones that I especially recommend (references) and conclude.

Document classification with fastText

Postscript location. I have introduced various articles before, but the following articles posted recently are the most recommended. There is a comparison with Watson in the title, but since there is a fastText code that works with Google colab, just click. The explanation about the format of the learning data is also polite, and I think it's a good idea to look at the article and code.

-Comparison of document classification between fasttext and Watson Natural Language Classifier

Embed words other than fastText

-List of ready-to-use word embedding vectors --Qiita

If you want to know more recent word embedding, I think that many will come out if you google with ELMo or BERT. Of course, the more recent models are, the more difficult it becomes.

next time?

@ takashi1029!

-The first step of dependency search with GiNZA + Elasticsearch --Taste of Tech Topics

Recommended Posts

Let's use the distributed expression of words quickly with fastText!
Let's use the API of the official statistics counter (e-Stat)
Destroy the intermediate expression of the sweep method with Python
Let's use the Python version of the Confluence API module.
Let's use the open data of "Mamebus" in Python
Calculate the product of matrices with a character expression?
Let's execute the command on time with the bot of discord
Let's touch the API of Netatmo Weather Station with Python. #Python #Netatmo
Let's visualize the number of people infected with coronavirus with matplotlib
Let's use usercustomize.py instead of sitecustomize.py
Let's use tomotopy instead of gensim
Use the preview feature with aws-cli
Let's decide the winner of bingo
Align the size of the colorbar with matplotlib
Let's tune the model hyperparameters with scikit-learn!
Let's solve the portfolio with continuous optimization
[Introduction to Python] Let's use foreach with Python
Let's read the RINEX file with Python ①
The third night of the loop with for
The second night of the loop with for
Let's investigate the mechanism of Kaiji's cee-loline
Count the number of characters with echo
Let's simulate the transition of infection rate with respect to population density with python
Learn the trends of feature words in texts with Jubatus and categorize input texts
Let's play with Python Receive and save / display the text of the input form
Let's check the population transition of Matsue City, Shimane Prefecture with open data
Let's draw the voltage of the digital multimeter 34461A of the measuring instrument Keysight with CircuitPython