motivation

Nice to meet you, my name is pyaNotty. This is the first post. Recently, I had the opportunity to come into contact with MeCab and keras, so I wanted to take on the challenge of natural language processing. I hear that distributed expressions using Word2Vec are often used in natural language processing, especially in sentence generation using LSTMs. This time, I will try to make a distributed expression of words that can be fed to the LSTM model with Word2Vec.

Even the author, who has only the intelligence of a cat, is easy enough to manage.

What is Word2Vec

A model for converting words into vectors.

When training an LSTM model or something with some text, you can't feed the model with raw strings. Therefore, it is necessary to convert the sentence into some numerical expression. For example, in the case of the sentence "This is a pen", after decomposing the part of speech like ['This is',' Pen',' is'], each word is converted into a numerical expression.

The simplest way is to give each word a unique ID. If you convert ['this is',' pen',' is'] → [1, 2, 3] etc., you can get into the Ichiou model. However, in this state, the training of the model will not work because the relationships between words cannot be expressed.

Therefore, Word2Vec is a method to acquire a numerical expression that incorporates the relationships between words. You can use it to find the distance between'pen'and'apple', or to calculate'pen' +'apple' ='pineapple'. The author is confused, and when I start writing a story around here, the article will be filled up by itself, so for details, see This book % 83% AD% E3% 81% 8B% E3% 82% 89% E4% BD% 9C% E3% 82% 8BDeep-Learning-% E2% 80% 95% E8% 87% AA% E7% 84% B6% E8% A8% 80% E8% AA% 9E% E5% 87% A6% E7% 90% 86% E7% B7% A8-% E6% 96% 8E% E8% 97% A4-% E5% BA% B7% Please refer to E6% AF% 85 / dp / 4873118360).

Corpus creation

In order to create a Word2Vec model, you need to create a corpus (roughly speaking, a list of words). A common method is to create a corpus from a large amount of text collected from Wikipedia. As a model created from the Wikipedia corpus, for example, Japanese entity vector is published on the Web. I am.

This time, I want to be a novelist, so I would like to scrape the text and create a corpus. Since the Dedicated API is open to the public, it is easy to collect sentences. Let's implement it right away. Implemented in python3.7.

import requests
import gzip
import pandas as pd
import datetime
import time

api_url="https://api.syosetu.com/novelapi/api/" 
df = pd.DataFrame()
endtime = int(datetime.datetime.now().timestamp())

interval = 3600*24
cnt = len(df)
while cnt < 100000:
  #Get the novel title and synopsis posted in the time range
  time_range = str(endtime - interval) + '-' + str(endtime)
  payload = {'out': 'json','gzip': 5,'order': 'new','lim': 500,
             'lastup': time_range,
             'of': 't-ua-s'}
  res = requests.get(api_url, params=payload).content
  r =  gzip.decompress(res).decode("utf-8") 

  df_temp = pd.read_json(r).drop(0)
  df = pd.concat([df, df_temp])

  #time_Shift range by interval
  endtime -= interval
  cnt = len(df['title'].unique())
  time.sleep(2)

Going back from the current time, I got the titles and synopses of the latest 100,000 novels. Of the collected data, we will create a corpus from the synopsis. First, save the txt file.

import pandas as pd

df = pd.read_csv('./data/titles.csv')
txt = ''

#Roughly connect all the synopses
for i in range(len(df)):
    txt += df['story'][i]

with open('./data/story.txt', mode='w', encoding='utf-8') as f:
    f.write(txt)

Create a corpus from the saved txt file. After dividing using MeCab, remove unnecessary symbols and write to a txt file. The code is based on this article.

import MeCab
import re
import os
import sys

mecab = MeCab.Tagger('-Ochasen -d C:\mecab-ipadic-neologd')

class Corpus:
    def __init__(self, text):
        self.text = text
        self.formated = ""
        self.corpus = []

        self.format()
        self.split()

        
    def split(self):
        node = mecab.parseToNode(str(self.formated))
        while node:
            PoS = node.feature.split(",")[0]

            if PoS not in "BOS/EOS":#The one that comes out with the specifications of mecab, disturbing
                self.corpus.append(node.surface)

            node = node.next

    def format(self):
        ret= re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-…]+', "", self.text)#URL
        ret = re.sub(r"[0-9]+", "0", ret)#Number to 0
        ret = re.sub(r"[!-/]", "", ret)#Half-width symbol
        ret = re.sub(r"[:-@]", "", ret)#Half-width symbol
        ret = re.sub(r"[[-`]", "", ret)#Half-width symbol
        ret = re.sub(r"[{|}-]", "", ret)#Half-width symbol
        ret = re.sub(r"["'() * [] ... [] << >> = ~" "" "+ * ,. ・ _? !! / ◀ ▲:]", "", ret)#Double-byte symbol
        ret = re.sub("[\n]", "", ret)#Line breaks, spaces
        ret = re.sub("[\u3000]", "", ret)

        self.formated = ret

    def rtn_corpus(self):
        ret = " ".join(self.corpus)
        return ret


input_file = "./data/story.txt"
output_file = "./data/corpus.txt"

with open(input_file, encoding='utf-8') as in_f:
    lines = in_f.readlines()
    with open(output_file, mode='w', encoding='utf-8') as out_f:
        for l in lines:
            text = Corpus(l).rtn_corpus()
            out_f.write(text + '')

You have successfully created a corpus.

`corpus.txt`


The prince's fiancé, Dalia, is despaired when he hears that the prince is on good terms with the common people, because his father, who plans to rule the royal power, tells him that you will definitely be a queen, but Dalia suddenly realizes that he has grown up. I wonder if it's okay to get engaged. If we weren't there, this country would be happier, and if it weren't for us, it would be scattered like a disgusting villain daughter in a novel. Please note that there are typographical errors and unclear sentences that have been reworked and changed frequently. Although the title is Reverse Harlem, the heroine does not have a reverse harem, but the villain daughter misunderstands it as a reverse harem. In addition, I wrote that not only the villain daughter but also the other characters were reincarnated in another world, but I changed the schedule and the villain daughter changed the setting of the reincarnated person. I am sorry for the frequent changes.\Chinatsu, who was transferred from a major real estate company to work for a subsidiary real estate management company, was depressed and went to work. It was a former banker's floating spirit Takamura Genki who was sitting next to me. Chinatsu, who was known to have a constitution in which ghosts can be seen by talking to her, is unfortunately selected as a ghost property manager.\And the ghost of a former banker who was possessed as a floating spirit in the office Genki and the cool handsome boss Harutaka will be in charge of the ghost property. Peeping Special Ability Activated Using that power, as they solve the ghosts' regrets, they realize the true cause of death of their energy. Such an occult love mystery Lloyd, who was a knight of a country, is a princess's guardian. One day, the two of them boarded a ghost to attend a ghost in a neighboring country, but the knight who had the knight fall from a cliff by the hands of an enemy nation called Kaburagi Shigure. Shigure Kaburagi, who became a high school student who had been reincarnated as a woman, met Rin Kiriyama, who was one step higher in both Bunbu and Rin, and the fate of the two began to move. He lost his mother when he was young and suffered from a serious heart disease. In the garden, I spend most of my life lonely, and I lost the meaning of living. One day, I met a girl named Amatsuka Nagisa. I was attracted to the beauty and kindness of Nagisa. The first time I fell in love with Nagisa, I repeated my days and solicited my feelings. Lotus regained the meaning of living. However, Nagisa gave me a confession that I would die soon. A ray of light that shines into the sun, like the sunlight, that was the sunlight, Mahal unnie, I want to be your knight, I'm the older Maiharu and the younger cousin, the love of the sun, and up to that. A sad love story A story of a girl with a peculiar power I will evacuate because the security has deteriorated. I will acquire skills to another world with mysterious coins and it will be quiet. Shiro Sakamoto, a detective of the Osaka Prefectural Police Department, is sentenced to a week's life by a doctor who had cancer. At that time, he wants to live in Japan.

Like this, the list of divided words is saved in a txt file.

Word2Vec model training

Finally, we will create a Word2Vec model. The method using word2vec of gensim is convenient and good. It really ends in an instant.

from gensim.models import word2vec

#Load the corpus you just made
corpus = './data/corpus.txt'
sentences = word2vec.Text8Corpus(corpus)

#Train the model
model = word2vec.Word2Vec(sentences,#Corpus to use
                          size=200,#Number of dimensions of vector to create
                          sg=0#skip-Whether to use gram,This time go with cbow
                          )

#Save model
model.wv.save('./narou.model')

I will briefly explain only the parameters for modeling. The sg parameter specifies the method to use when creating the model. Create a model using CBOW with sg = 0 and skip-gram with sg = 1. CBOW is a method of inferring words in between from surrounding words. If there is a sentence "This is a pen", it is an image of inputting'this is'and'is' and outputting'pen'. skip-gram is the opposite of CBOW, inferring surrounding words from one word. Therefore, in CBOW and skip-gram, the input layer and the output layer are interchanged as they are. 機械学習殴り書き-20.jpg Sorry for the rough figure, please forgive me because I will do anything.

This is the end of model creation. Thank you for your hard work.

Try playing

Now that you've made a model, let's play a little.

model.wv['Different world']

array([-2.11708   ,  0.48667097,  1.4323529 ,  1.2636857 ,  3.7746162 ,
        1.3120568 ,  2.2951639 , -0.8711858 ,  1.1539211 , -0.54808956,
        0.6777047 ,  0.21446693, -1.3346114 ,  3.0864553 ,  2.8941932 ,
        0.78770447,  1.4938581 , -1.7187694 , -0.58673733,  1.3345109 ,
       -0.5837457 ,  1.1400971 , -1.3413094 , -1.1784658 ,  0.5038208 ,
        0.2184668 ,  0.7903634 ,  0.99530613,  1.1820349 , -0.39339375,
        1.1770552 ,  1.1574073 ,  0.8442716 , -1.5331408 , -1.3503907 ,
       -0.22193083, -1.2109485 ,  3.1873496 ,  1.5198792 , -0.3475026 ,
        1.1639794 ,  2.1614919 ,  1.44486   ,  1.4375949 , -0.12329875,
        0.76681995,  1.0177598 ,  0.15669581,  1.1294595 ,  0.6686    ,
       -2.159141  ,  2.169207  , -0.00955578,  0.3961775 ,  0.839961  ,
        0.05453613, -0.4493284 ,  2.4686203 ,  0.35897058,  0.6430457 ,
       -0.7321106 , -0.06844574,  1.1651453 ,  1.440661  , -1.9773052 ,
       -1.0753456 , -1.3506272 ,  0.90463066, -1.5573175 ,  3.1350327 ,
        2.821969  ,  1.6074497 , -0.03897483,  0.84363884,  2.4653218 ,
        0.65267706,  0.22048295,  2.229037  ,  0.8114238 , -2.0834744 ,
        0.47891453, -1.1666266 , -0.5350998 ,  0.25257212,  2.3054895 ,
       -1.2035478 ,  2.7664409 , -2.121225  ,  1.3237966 , -0.40595815,
       -0.69292945, -0.39868835,  0.22690924,  0.3353806 , -1.3963023 ,
        0.48296794,  1.5792748 , -1.4290403 , -0.7156262 ,  2.1010907 ,
        0.4076586 , -0.47208166,  1.3889042 ,  0.9942379 , -0.3618385 ,
        0.10046659, -2.8085515 , -0.12091257,  1.33154   ,  1.196143  ,
       -1.3222407 , -2.2687335 , -0.74325466, -0.6354738 ,  1.2630842 ,
       -0.98507017, -1.5422399 ,  2.0910058 , -0.71927756,  0.3105838 ,
        1.4744896 , -0.84034425,  1.3462327 ,  0.08759955,  0.29124606,
       -1.9146007 ,  1.361217  ,  2.059756  , -0.73954767, -0.8559703 ,
        1.9385318 ,  0.44165856,  0.76255304,  0.26668853,  2.135404  ,
        0.37146965,  0.17825744,  0.73358685, -0.40393773, -0.58863884,
        2.9904902 ,  0.5401901 , -0.90699816, -0.03270415,  1.4531562 ,
       -2.6180272 ,  0.03460709, -1.028743  , -1.1348175 ,  0.9340523 ,
       -1.8640583 , -0.68235844,  1.8670527 ,  0.6017655 , -1.0030158 ,
       -1.7779472 ,  0.5410166 , -0.54911584,  1.4723094 , -1.229491  ,
        1.768442  ,  0.41630363, -2.417083  , -0.46536174,  0.26779643,
        0.6326522 ,  1.2000504 ,  1.1760272 , -0.17639238,  1.1781607 ,
       -3.0334888 ,  0.93554455,  0.52397215, -0.4301726 ,  1.3797791 ,
       -3.2156737 , -0.9460046 , -0.32353514, -0.27070895, -0.01405313,
        0.78362066, -0.41299725, -1.148895  ,  1.810671  , -1.0644491 ,
       -1.2899619 , -1.2057134 , -0.43731746, -0.5561588 ,  0.18522681,
       -0.86407244,  0.6044319 ,  0.3605701 ,  1.167799  , -1.2906225 ,
       -0.41644478,  1.3338335 , -0.2976896 ,  0.56920403,  2.6792917 ],
      dtype=float32)

In this way, the learned word vector can be retrieved as a numpy array.

Since it is a vector representation, it is naturally possible to perform various operations between vectors. Let's find a vector similar to the'different world'vector. It seems that how similar word vectors are is generally measured by cosine similarity. The cosine similarity between the two word vectors $ \ boldsymbol {w} _1 $ and $ \ boldsymbol {w} _2 $ is given by:

\mathrm{similarity}(\boldsymbol{w}_1, \boldsymbol{w}_2) = \frac{\boldsymbol{w}_1.\boldsymbol{w}_2}{|\boldsymbol{w}_1||\boldsymbol{w}_2|}

Since this represents $ cos \ theta $ in Euclidean space, intuitively it seems that the cosine similarity is such that two word vectors point in the same direction.

The most_similar method can be used to retrieve the top 10 words with the highest cosine similarity.

model.wv.most_similar('Different world')

[('Reincarnation in another world', 0.6330794095993042),
 ('Another world', 0.6327196359634399),
 ('Transfer to another world', 0.5990908145904541),
 ('real world', 0.5668200850486755),
 ('new world', 0.5559623837471008),
 ('Different', 0.5458788871765137),
 ('world', 0.5394454002380371),
 ('Modern Japan', 0.5360320210456848),
 ('Alien world', 0.5353666543960571),
 ('This world', 0.5082162618637085)]

The result was generally convincing.

The words registered in the model can be retrieved as a list with index2word.

import copy
index2word = copy.copy(model.wv.index2word)
index2word

['of',
 'To',
 'To',
 'Is',
 'Ta',
 'hand',
 'But',
 'When',
 'so',
 'Shi',
 'Nana',
 'Also',
 'I',
 'Absent',
 'Re',
 'To do',
 'Masu',

By default, commonly used words come first in the list. Since there is no duplication of words in the list, the length is the number of vocabulary that the model has.

len(index2word)

It seems that about 42,000 words are registered.

You can use the index of the list as the word ID.

index2word.index('Different world')

When'another world'comes in 32nd place, I feel like it will be.

Summary

That's all for creating a model with Word2Vec. Once I created the corpus, I could easily create a model, so I thought it was really convenient. In the next post (if any), I'd like to be able to generate sentences using the vector model (provisional) I made this time. Then let's meet again.

[PYTHON] An introduction to Word2Vec that even cats can understand