[PYTHON] I wrote the code for Japanese sentence generation with DeZero

1.First of all

In April of this year, ** "Deep Learning3 Framework from scratch" ** was released. I read ** Zero work ** as ** 1.2 ** and learned a lot, so I decided to challenge the framework edition this time.

So I recently bought a book, but before I started studying step by step, I decided to write some code to get a quick idea of the ** framework overview **.

Refer to ** DeZero **'s ** library ** and ** example ** at Github. I wrote a simple code for natural language processing on google colab while browsing ** this **, so I will leave it as a memorandum.

The created ** google colab code ** is posted on ** Github **, so if you like ** this link ** Click .ipynb) to move it.

2. Japanese dataset Neko class

When I tried to do natural language processing in Japanese, I thought it would be convenient if there was something that was easy to use, such as ** MNIST ** in the case of image processing, so I made something similar.

For the ** data set class ** to be created, download ** "I am a cat" ** of Aozora Bunko. Then, after deleting the unnecessary part, divide it with ** janome **, create a dictionary and corpus, and then create ** time series data ** and ** next correct answer data **.

As a preliminary preparation, install the ** framework dezero ** with ! Pip install dezero, and install the ** morphological analysis library janome ** with! Pip install janome.

Set the class name of the dataset to ** Neko **, inherit the ** Dataset class ** according to dezero's method, write ** processing content ** in def prepare (), and then ** process Write the required function ** in.

import numpy as np
import dezero
from dezero.datasets import Dataset
from dezero.utils import get_file, cache_dir
import zipfile
import re
from janome.tokenizer import Tokenizer

class Neko(Dataset):
    
    def prepare(self):
        url = 'https://www.aozora.gr.jp/cards/000148/files/789_ruby_5639.zip'
        file = get_file(url)  
        data = self.unzip(cache_dir + '/' + '789_ruby_5639.zip')  
        self.text = self.preprocess(cache_dir + '/' + 'wagahaiwa_nekodearu.txt')
        self.wakati = self.keitaiso(self.text)
        self.corpus, self.word_to_id, self.id_to_word = self.process(self.wakati)
        self.data = np.array(self.corpus[:-1])
        self.label = np.array(self.corpus[1:])
    
    def unzip(self, file_path):
        with zipfile.ZipFile(file_path) as existing_zip:
            existing_zip.extractall(cache_dir)
            
    def preprocess(self, file_path):
        binarydata = open(file_path, 'rb').read()
        text = binarydata.decode('shift_jis')        
                   
        text = re.split(r'\-{5,}', text)[2]  #Delete header
        text = re.split('Bottom book:',text)[0]   #Remove footer
        text = re.sub('|', '', text)  # |Delete
        text = re.sub('[.+?]', '', text)  #Delete input note
        text = re.sub(r'《.+?》', '', text)  #Delete ruby
        text = re.sub(r'\u3000', '', text)  #Remove whitespace
        text = re.sub(r'\r\n', '', text)  #Remove line breaks
        text = text[1:]  #Delete the first character (adjustment)
        return text
 
    def keitaiso(self, text):
        t = Tokenizer()
        output = t.tokenize(text, wakati=True)
        return output
     
    def process(self, text):
        # word_to_id, id_to_word creation
        word_to_id, id_to_word = {}, {}
        for word in text:
            if word not in word_to_id:
                new_id = len(word_to_id)
                word_to_id[word] = new_id
                id_to_word[new_id] = word

        #Creating corpus
        corpus = np.array([word_to_id[W] for W in text])
        return corpus, word_to_id, id_to_word

The ** constructor ** (at def __init __ ()) of the ** inherited ** Dataset class ** says self.prepare (), so the Neko class is ** instanced. Then, def prepare () will ** work **.

def prepare () usesget_file (url)in the dezero library to download the file from the specified ʻurl and save it in cache_dir. For google colab, cache_dir is /root/.dezero`.

After that, four functions are called in sequence to perform processing. Finally, substitute ** corpus ** into self.data (time series data) and self.label (next correct answer data) according to the method.

The variables text, wakati, corpus, word_to_id, id_to_word are each given self. so that they can be called as ** attributes ** once the Neko class is ** instantiated **. ..

def unzip () is a function that unzips the downloaded ** zip file **. def preprocess () is a function that reads the decompressed file and returns the text with ** unnecessary parts such as ruby and line breaks ** removed. def keitaiso () is a function that morphologically analyzes text and returns ** word-separation **. def process () is a function that creates ** dictionaries ** and ** corpus ** from word-separation.

Let's actually move it.

3. Try running the Neko class

スクリーンショット 2020-06-15 08.33.26.png ** Instantiate ** the Neko class with neko = Neko () to download the file and ** start the process **. It takes a few tens of seconds to complete because the janome word-separation process takes some time. When you're done, let's use it right away. スクリーンショット 2020-06-14 19.08.47.png You can display ** text ** with neko.text, ** word-separation ** with neko.wakati, and ** corpus ** with neko.corpus. The text is so-called solid, the word-separated list is a word-by-word list, and the corpus is a number from the beginning of the word-separated word (no duplication). By the way, let's take a look at the dictionary. スクリーンショット 2020-06-14 19.16.38.png neko.waord_to_id [] is a dictionary that ** converts words to numbers **, and neko.id_to_word [] is a dictionary that ** converts numbers to words **. Let's look at the training data. スクリーンショット 2020-06-14 19.22.01.png You can see that neko.data and neko.label are off by one. Finally, let's look at the length of the data and the number of words in the dictionary. スクリーンショット 2020-06-14 19.54.03.png  The ** data length ** is 205,815 and the number of words in the dictionary ** vovab_size ** is 13,616.

Now, let's write the code of the main body.

4. Body code

import numpy as np
import dezero
from dezero import Model
from dezero import SeqDataLoader
import dezero.functions as F
import dezero.layers as L
import random
from dezero import cuda 
import textwrap

max_epoch = 70
batch_size = 30 
vocab_size = len(neko.word_to_id)  
wordvec_size = 650  
hidden_size = 650
bptt_length = 30  

class Lstm_nlp(Model):
    def __init__(self, vocab_size, wordvec_size, hidden_size, out_size):
        super().__init__()
        self.embed = L.EmbedID(vocab_size, wordvec_size)
        self.rnn = L.LSTM(hidden_size)
        self.fc = L.Linear(out_size)

    def reset_state(self):  #State reset
        self.rnn.reset_state()

    def __call__(self, x):  #Describe the connection contents of the layer
        y = self.embed(x) 
        y = self.rnn(y)
        y = self.fc(y)
        return y

The model has a simple structure of ** Embedding layer + LSTM layer + Linear layer **. Enter the EmbedID as a number (integer) for the word.

The size of the EmbedID word embedding matrix is ** vocal_size x wordvec_size **, so it is 13616 x 650. The LSTM hidden_size is 650, which is the same as wordvec_size. And the output size of Linear ʻout_size` is 13616, which is the same as vocab_size.

Describe ** the connection contents of each layer ** in def __call __ (). The contents described here can be called by giving arguments to the created instance like a function. For example, if you instantiate with model = Lstm_nlp (....), you can move the def __call__ () part with y = model (x). In other words, so-called predict can be achieved with this. This is smart.

model = Lstm_nlp(vocab_size, wordvec_size, hidden_size, vocab_size)  #Model generation
dataloader = SeqDataLoader(neko, batch_size=batch_size)  #Data loader generation
seqlen = len(neko)
optimizer = dezero.optimizers.Adam().setup(model)  #The optimization method is Adam

#Presence / absence judgment and processing of GPU
if dezero.cuda.gpu_enable:  #If GPU is enabled, do the following
    dataloader.to_gpu()  #Data loader to GPU
    model.to_gpu()  #Model to GPU

The data loader uses SeqDataLoader for time series data. Since the order of time-series data changes when shuffled, the method of extracting multiple data by dividing the time-series data at regular intervals is adopted.

If the GPU is available, ʻif dezero.cuda.gpu_enable:` will be True, in which case send the data loader and model to the GPU.

#Learning loop
for epoch in range(max_epoch):
    model.reset_state()
    loss, count = 0, 0

    for x, t in dataloader:
        y = model(x)  #Forward propagation

        #Degree of appearance of the next word y(vocab_size dimensional vector)Softmax processed and correct answer(One hot vector)Loss calculation with
        #However, the input t is the index number in which 1 of the one-hot vector stands.(integer)
        loss += F.softmax_cross_entropy_simple(y, t)  
        count += 1

        if count % bptt_length == 0 or count == seqlen:
            model.cleargrads()  #Derivative initialization
            loss.backward()  #Backpropagation
            loss.unchain_backward()  #Go back to the calculation graph and break the connection
            optimizer.update()  #Weight update
    avg_loss = float(loss.data) / count
    print('| epoch %d | loss %f' % (epoch + 1, avg_loss))

    #Sentence generation
    model.reset_state()  #Reset state
    with dezero.no_grad():  #Do not update weights
         text = []
         x = random.randint(0,vocab_size)  #Randomly choose the first word number
         while len(text)  < 100:  #Repeat until 100 words
               x = np.array(int(x))
               y = model(x)  #y is the degree of appearance of the next word(vocab_size dimensional vector)
               p = F.softmax_simple(y, axis=0)  #Multiply by softmax to get the probability of appearance
               xp = cuda.get_array_module(p)  #Xp with GPU=xp without cp=np
               sampled = xp.random.choice(len(p.data), size=1, p=p.data)  #Numbers considering the probability of appearance(index)Choose
               word = neko.id_to_word[int(sampled)]  #Convert numbers to words
               text.append(word)  #Add word to text
               x = sampled  #Make sampled the following input
         text = ''.join(text)
         print(textwrap.fill(text, 60))  #Display with line breaks at 60 characters

It's a learning loop. ** Forward propagation ** with y = model (x) and calculate loss with loss + = F.softmax_cross_entropy_simple (y, t).

At this time, y is a ** vector ** (vocab_size dimension) representing the ** appearance degree ** of the next word, which is multiplied by softmax to obtain the ** appearance probability **, and ** one hot next. Loss is calculated from the correct answer data **. However, the input t is the ** number (integer) ** of the one-hot vector in which 1 stands.

ʻIf count% bptt_length == 0 or count == If count is an integral multiple of bptt_length or goes to the end with seqlen: `, backpropagation is performed and the weight is updated.

Next, 100 words are generated for each eopch. First, reset the state with model.reset_state () and keep the weights unchanged with with dezero.no_grad ():. Then, with x = random.randint (0, vocal_size), the initial value of the word is randomly determined from an integer from 0 to vocal_size, and the next word is predicted. Based on the predicted word, further prediction is repeated to generate a sentence.

p = F.softmax_simple (y, axis = 0) multiplies y by softmax to find the probability of occurrence of the next word, andxp.random.choice ()is a random word along that probability. I'm choosing.

The reason why xp.random.choice () starts with ** xp ** is ** np ** (numpy) when the first character is moved by the CPU, and ** cp when it is moved by the GPU. This is because it needs to be changed to ** (cupy). Therefore, judge by xp = cuda.get_array_module (p) and substitute xp = np for CPU and xp = cp for GPU.

Now let's move the main unit.

5. Try moving the main body code

When you execute the main body code, it learns the word order of "I am a cat" and generates a sentence for each epoch. It takes about 1 to 2 minutes per epoch. After learning to some extent, it looks like this. スクリーンショット 2020-06-15 17.00.54.png It's also fun to see the sentences become more like that little by little.

6. Summary

The impression that I wrote the code by imitating it is that it is a ** simple framework written in ** all python **, so the contents are easy to understand ** and the degree of freedom is high for easy writing * I had a good impression of *. During this period, I would like to study the contents of the DeZero framework.

Recommended Posts

I wrote the code for Japanese sentence generation with DeZero
I wrote the code for Gibbs sampling
I just wrote the original material for the python sample code
I tried sentence generation with GPT-2
I played with Floydhub for the time being
Code for TensorFlow MNIST Begginer / Expert with Japanese comments
I wrote you to watch the signal with Go
I tried porting the code written for TensorFlow to Theano
I wrote GP with numpy
I was hooked for 2 minutes with the Python debugger pdb
I wrote python in Japanese
I wrote the basic grammar of Python with Jupyter Lab
Sentence generation with GRU (keras)
I want to change the Japanese flag to the Palau flag with Numpy
Check the code with flake8
I wrote the basic operation of matplotlib with Jupyter Lab
Let the Japanese BERT model do the center test and sentence generation
I wrote the code to write the code of Brainf * ck in python
[Text classification] I implemented Convolutional Neural Networks for Sentence Classification with Chainer
[Let's play with Python] Aiming for automatic sentence generation ~ Completion of automatic sentence generation ~
Check the memory protection of linux kerne with the code for ARM
I wrote the basic operation of Pandas with Jupyter Lab (Part 1)
For the time being, I want to convert files with ffmpeg !!
I wrote the basic operation of Pandas with Jupyter Lab (Part 2)
I tried running the TensorFlow tutorial with comments (_TensorFlow_2_0_Introduction for beginners)
Decrypt the QR code with CNN
I liked the tweet with python. ..
I can't use Japanese with pyperclip
I wrote the queue in Python
[PyTorch] Japanese sentence generation using Transformer
I wrote the stack in Python
I tried to refer to the fun rock-paper-scissors poi for beginners with Python
I read the Sudachi synonym dictionary with Pandas and searched for synonyms
AtCoder writer I wrote a script to aggregate the contests for each writer
I tried to get the authentication code of Qiita API with Python.
[Let's play with Python] Aiming for automatic sentence generation ~ Perform morphological analysis ~
I installed the library with Visual Studio Code, but Unable to import
Impressions and memorandums when working with VS code for the first time
I measured the speed of list comprehension, for and while with python2.7.
VS Code freezes & PC crashes when I start the server with go
I tried to compare the accuracy of Japanese BERT and Japanese Distil BERT sentence classification with PyTorch & Introduction of BERT accuracy improvement technique
Ask for Pi with the bc command
I tried tensorflow for the first time
Search for files with the specified extension
[Scikit-learn] I played with the ROC curve
Display Japanese graphs with VS Code + matplotlib
I wrote the selection sort in C
I wrote unit tests for various languages
The third night of the loop with for
I tried playing with the image with Pillow
The second night of the loop with for
[Python beginner] I collected the articles I wrote
I wrote the sliding wing in creation.
I can't install the package with pip.
[With Japanese model] Sentence vector model recommended for people who process natural language in 2020
Somehow the code I wrote worked and I was impressed, so I will post it
I wrote an animation that degenerates a linear system with deadly dirty code
I customized it with Visual Studio Code (mainly for python), so I will summarize it
Search for files with line feed code CR + LF under the current directory
I tried searching for files under the folder with Python by file name
A story that I was very convinced when I wrote the code for the Monty Hall problem and calculated the winning percentage