[PYTHON] I made Word2Vec with Pytorch

This article is the 11th day article of Ibaraki University Advent Calendar 2019.

Implement Word2Vec with Pytorch. Word2Vec When I thought about building Word2Vec, many articles on gensim were hit, but since there were few articles that implemented Word2Vec using Pytorch, I decided to post it. Since there are many articles that explain Word2Vec, I will briefly explain it.

Skip-gram

Screen Shot 2019-12-01 at 14.25.48.png

skip-gram maximizes the output probability of the surrounding word sequence $ {, w (t-1), w (t + 1)} $ given the input word $ w (t) $. Therefore, the objective variable to be minimized is as follows.

\begin{align}
E 
&= -log p( w_{t-1},w_{t+1} | w_{t} ) \\
&= -log p(w_{t-1},w_{t})*p(w_{t+1},w_{t}) \\
&= -log \prod_{i}\frac{exp(p(w_{i},w_{t}))}{\sum_{j}exp(p(w_{j},w_{t}))}
\end{align}

Here, the numerator is a word for window size, but the denominator needs to calculate the total number of words. Since that is not possible, we approximate it with negative sampling.

Negative sampling code

Words output by negative sampling are determined by the frequency of occurrence of words as shown in reference [1]. The program looks like this:

def sample_negative(sample_size):
    prob = {}
    word2cnt = dict(Counter(list(itertools.chain.from_iterable(corpus))))
    
    pow_sum = sum([v**0.75 for v in word2cnt.values()])
    for word in word2cnt:
        prob[word] = word_counts[word]**0.75 / pow_sum
    words = np.array(list(word2cnt.keys()))
    while True:
        word_list = []
        sampled_index = np.array(multinomial(sample_size, list(prob.values())))
        for index, count in enumerate(sampled_index):
            for _ in range(count):
                 word_list.append(words[index])
        yield word_list

Creating a model

There is also a method of expressing a word with Onehot for inputting a word, but that would increase the dimension by the number of words, so after converting it to a word vector using the Embedding Layer, apply it to Encoder and Decoder. The evaluation takes the inner product of word vectors and outputs it with the log sigmoid function. The calculation formula is as follows.

L= \sum_{i} log \sigma({v'}_{w_{i}}^{T}v_{w_{I}})+\sum_{i}log \sigma(-{v'}_{w_{i}}^{T}v_{w_{I}})

class SkipGram(nn.Module):
    def __init__(self, V, H):
        super(SkipGram, self).__init__()
        self.encode_embed = nn.Embedding(V, H)
        self.decode_embed = nn.Embedding(V, H)
        
        self.encode_embed.weight.data.uniform_(-0.5/H, 0.5/H)
        self.decode_embed.weight.data.uniform_(0.0, 0.0)
        
    def forward(self, contexts, center, neg_target):
        embed_ctx = self.encode_embed(contexts)
        embed_center = self.decode_embed(center)
        neg_embed_center= self.encode_embed(neg_target)

        #inner product
        ##Positive example
        score = torch.matmul(embed_ctx, torch.t(embed_center))
        score = torch.sum(score, dim=2).view(1, -1)
        log_target = F.logsigmoid(score)
        
        ##Negative example
        neg_score = torch.matmul(embed_ctx, torch.t(neg_embed_center))
        neg_score = -torch.sum(neg_score, dim=2).view(1, -1)
        log_neg_target = F.logsigmoid(neg_score)

        return -1 * (torch.mean(log_target) + torch.mean(log_neg_target))

It seems that it is common to separate the Emcoder and Decoder Embedding. Since it is a maximization problem, it is multiplied by a minus.

result

Screen Shot 2019-12-10 at 17.02.41.png

The accuracy is not good as a whole, and it is necessary to set the Scheduler and learning rate appropriately.

I haven't organized the code, so I will publish the whole code after organizing it.

References

[1] Distributed Representations of Words and Phrases and their Compositionality [2] word2vec Parameter Learning Explained

Recommended Posts

I made Word2Vec with Pytorch
I made blackjack with python!
I made blackjack with Python.
I made wordcloud with Python.
I made a fortune with Python.
I implemented Attention Seq2Seq with PyTorch
I tried implementing DeepPose with PyTorch
I made a daemon with Python
I implemented Shake-Shake Regularization (ShakeNet) with PyTorch
Word2Vec with BoUoW
Play with PyTorch
[Introduction to Pytorch] I played with sinGAN ♬
I tried batch normalization with PyTorch (+ note)
I tried implementing DeepPose with PyTorch PartⅡ
I tried to implement CVAE with PyTorch
Beginning with PyTorch
I made a Hex map with Python
I made a stamp generator with GAN
I made a roguelike game with Python
I made a simple blackjack with Python
I made a configuration file with Python
I made a WEB application with Django
I made a neuron simulator with Python
I made a stamp substitute bot with line
I tried to detect Mario with pytorch + yolov3
I tried to implement reading Dataset with PyTorch
[Python] I introduced Word2Vec and played with it.
I made a competitive programming glossary with Python
I made a weather forecast bot-like with Python.
I made a GUI application with Python + PyQt5
I made my dog "Monaka Bot" with LineBot
I made a Twitter fujoshi blocker with Python ①
[Python] I made a Youtube Downloader with Tkinter.
I made a simple Bitcoin wallet with pycoin
I made a LINE Bot with Serverless Framework!
I made a random number graph with Numpy
I made a bin picking game with Python
I made a Mattermost bot with Python (+ Flask)
I made a QR code image with CuteR
Use RTX 3090 with PyTorch
〇✕ I made a game
Container-like # 1 made with C
Container-like # 2 made with C
Install torch-scatter with PyTorch 1.7
I made an Ansible-installer
I made my goimports
[AWS] I made a reminder BOT with LINE WORKS
I made a Twitter BOT with GAE (python) (with a reference)
I made a household account book bot with LINE Bot
I tried to move Faster R-CNN quickly with pytorch
I tried to implement and learn DCGAN with PyTorch
I made a ready-to-use syslog server with Play with Docker
I made a Christmas tree lighting game with Python
I made a vim learning game "PacVim" with Go
I made a window for Log output with Tkinter
I made a net news notification app with Python
I made a Python3 environment on Ubuntu with direnv.
I made a LINE BOT with Python and Heroku
[Introduction to Pytorch] I tried categorizing Cifar10 with VGG16 ♬
I tried to implement SSD with PyTorch now (Dataset)
I got an error when using Tensorboard with Pytorch