In this article, we'll look at how to use fastText to get a distributed representation of ** crisp ** words. I wrote it with the hope that it could be linked to the previous day's article.

What is fastText

fastText is a method to acquire the distributed representation of words (words expressed numerically) announced by Facebook. It is based on the familiar Word2Vec (CBOW / skip-gram). Word2Vec is still new, so no explanation is needed.

Paper: Enriching Word Vectors with Subword Information

The difference between Word2Vec and fastText is how to take a vector. By incorporating a mechanism called subwords, words that are close to each other, such as inflected forms, are drawn.

In Word2Vec, go and goes were completely different words. But fastText takes this into account and makes go and goes meaningful as the same component. Of course, you will be strong against unknown words!

For details, we recommend the paper and the following materials.

-Get distributed expressions of words in Fast with fastText on Facebook --Qiita -How fast and rumored fastText works-- Qiita

Let's play with FastText

The purpose of this time is to "play with distributed expressions of words quickly using Python and FastText".

Environment

e? Can't you even introduce Python quickly? Now use Google Colaboratory. (Hereafter colab)

colab is the strongest free environment where anyone with a Google account can easily use Python and GPU.

Crisp and fastText

A learning model is required to use fastText. You can collect the data yourself and train it, but here we will use the trained model.

Trained model: The trained model of fastText has been released --Qiita

Just throw the model into Google Drive. Since it's a big deal, let's unzip it from the drive, including instructions on how to use colab!

A mount is required to touch the inside of the drive with colab. You can do this by running the code below, but if you press the "mount drive" button you should get the same code, so run it.

from google.colab import drive
drive.mount('/content/drive')

After that, access the URL, log in to your account, and you will see the authorization code, so just copy and paste it with colab. Easy.

All you have to do is unzip the model!

%cd /content/drive/My Drive/data
!unzip vector_neologd.zip

For path, specify the location of the trained model that you uploaded to the drive.

Let's look for similar words.


import gensim

model = gensim.models.KeyedVectors.load_word2vec_format('model.vec', binary=False)

model.most_similar(positive=['Natural language processing'])

result.

[('Natural language understanding', 0.7600098848342896),
 ('Natural language', 0.7503659725189209),
 ('Computational linguistics', 0.7258570194244385),
 ('Automatic programming', 0.6848069429397583),
 ('Text mining', 0.6811494827270508),
 ('Computer language', 0.6618390083312988),
 ('Metaprogramming', 0.658093273639679),
 ('Web programming', 0.6488876342773438),
 ('Morphological analysis', 0.6479052901268005),
 ('Corpus linguistics', 0.6465639472007751)]

It's surprisingly fun to play around.


model.most_similar(positive=['friend'], negative=['friendship'])

[('acquaintance', 0.4586910605430603),
 ('home', 0.35488438606262207),
 ('acquaintance', 0.329221248626709),
 ('frequenting', 0.3212822675704956),
 ('Relatives', 0.31865695118904114),
 ('Acquaintance', 0.3158203959465027),
 ('home', 0.31503963470458984),
 ('Invitation', 0.302945077419281),
 ('Frequently', 0.30250048637390137),
 ('Colleague', 0.29792869091033936)]

Crispy? fastText

It was so crisp that I would continue to limit my vocabulary and do the same. It is an option that there is also such a method.

This time, we are dealing with Japanese text, but unlike English, each word is not divided by spaces, so preprocessing must be performed first (separate writing).

The implementation is based on the book "Learn while making! Deep learning by PyTorch" (https://github.com/YutaroOgawa/pytorch_advanced).

Here we use janome.

!pip install janome

Let's write a word.

from janome.tokenizer import Tokenizer

j_t = Tokenizer()

def tokenizer_janome(text):
    return [tok for tok in j_t.tokenize(text, wakati=True)]

You can use janome to write words quickly, but let's do some simple pre-processing.

import re
import unicodedata
import string


def format_text(text):
    text = unicodedata.normalize("NFKC", text)
    table = str.maketrans("", "", string.punctuation  + """ ,. ・")
    text = text.translate(table)

    return text


def preprocessing(text):
    """
Preprocessing function
    """
    text.translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)}))  #Full-width → half-width

    text = text.lower()  #Uppercase → lowercase

    text = re.sub('\r', '', text)
    text = re.sub('\n', '', text)
    text = re.sub(' ', '', text)
    text = re.sub('　', '', text)

    text = re.sub(r'[0-9 ０-９]', '', text)  #Removal of numbers
    text = re.sub(r'[!-/:-@[-`{-~]', '', text)  #Removal of half-width symbols
    text = re.sub(r'/[！-／：-＠［-｀｛-～、-~ "" ・]', '', text)  #Removal of double-byte symbols
    text = format_text(text)  #Removal of double-byte symbols

    return  text


def tokenizer_with_preprocessing(text):
    text = preprocessing(text)
    text = tokenizer_janome(text)
    
    return text

As I wrote in the comment, I have done some pre-processing. Of course, what you need to do in preprocessing will change depending on the task and purpose.

Let's try putting in a suitable sentence.


text = 'This time, I want to use fastText to acquire distributed expressions! !! !!?'
print(tokenizer_with_preprocessing(text))

result.

['this time', 'Is', 'fasttext', 'To', 'Use', 'hand', 'Distributed', 'Expression', 'To', 'Acquired', 'Shi', 'Want']

Sounds like it's working.

Then use torchtext to make things easier. For more information on torchtext, see Easy and Deep Natural Language Processing with torchtext-Qiita.


import torchtext

max_length = 25
TEXT = torchtext.data.Field(sequential=True, tokenize=tokenizer_with_preprocessing,
                            use_vocab=True, lower=True, include_lengths=True,
                            batch_first=True, fix_length=max_length)
LABEL = torchtext.data.Field(sequential=False, use_vocab=False)

train_ds, val_ds, test_ds = torchtext.data.TabularDataset.splits(
    path='./tsv/', train='train.tsv',
    validation='val.tsv', test='test.tsv', format='tsv',
    fields=[('Text', TEXT), ('Label', LABEL)])

We have prepared train.tsv / test.tsv / val.tsv. This is a tsv file by dividing the text of "What is fastText" in this article into three parts. This is also a detailed link above, so I recommend that.

from torchtext.vocab import Vectors

vectors = Vectors(name='model.vec')

#Create a vectorized version of the vocabulary
TEXT.build_vocab(train_ds, vectors=vectors, min_freq=1)

#Check the vocabulary vector
print(TEXT.vocab.vectors.shape)  #52 words are represented by a 300-dimensional vector
TEXT.vocab.vectors

#Check the order of words in the vocabulary
TEXT.vocab.stoi

Now that we're ready, let's calculate the similarity of each word in the vocabulary. Let's see if three words are similar to "word".


import torch.nn.functional as F

tensor_calc = TEXT.vocab.vectors[TEXT.vocab.stoi['word']]

#Cosine similarity
print("paper", F.cosine_similarity(tensor_calc, TEXT.vocab.vectors[TEXT.vocab.stoi['paper']], dim=0))
print("word", F.cosine_similarity(tensor_calc, TEXT.vocab.vectors[TEXT.vocab.stoi['word']], dim=0))
print("vector", F.cosine_similarity(tensor_calc, TEXT.vocab.vectors[TEXT.vocab.stoi['vector']], dim=0))

result.

Paper tensor(0.3089)
Word tensor(0.3704)
Vector tensor(0.3265)

For those who want to do more

There are various articles about fastText, though not as much as Word2Vec, so I will introduce the ones that I especially recommend (references) and conclude.

Document classification with fastText

Postscript location. I have introduced various articles before, but the following articles posted recently are the most recommended. There is a comparison with Watson in the title, but since there is a fastText code that works with Google colab, just click. The explanation about the format of the learning data is also polite, and I think it's a good idea to look at the article and code.

-Comparison of document classification between fasttext and Watson Natural Language Classifier

Embed words other than fastText

-List of ready-to-use word embedding vectors --Qiita

If you want to know more recent word embedding, I think that many will come out if you google with ELMo or BERT. Of course, the more recent models are, the more difficult it becomes.

next time?

@ takashi1029!

-The first step of dependency search with GiNZA + Elasticsearch --Taste of Tech Topics

[PYTHON] Let's use the distributed expression of words quickly with fastText!

What is fastText

Let's play with FastText

Environment

Crisp and fastText

Crispy? fastText

For those who want to do more

Document classification with fastText

Embed words other than fastText

next time?