[PYTHON] A story that stumbled when I made a chatbot with Transformer

Introduction

This is a story that I stumbled upon when I first tried machine learning in natural language processing. Write down the process up to production. At the time of posting the article, it wasn't working well, so on the other hand, it would only be a teacher. If you want to know how to do it well, please see here.

Author's specs

It will be an article of such a paper engineer apprentice. If you think that it is not very helpful, we recommend browser back.

Examine machine learning in the natural language processing area

Originally I was touching machine learning itself. However, since I have no experience in natural language processing, I decided to collect information and knowledge for the time being.

The first thing I jumped in was that google's BERT was amazing. So I checked the structure and learning mechanism, but it was in a completely "?????????" state.

Anyway, BERT seems to be amazing. Then I decided to make something using it, and decided to make a chatbot.

I also considered using the later XLNet and ALBERT. However, there was nothing including BERT that could be easily modified for myself.

In particular, Github's BERT repository and text classification, which are unofficially provided by Google, seemed to be easy, but it seemed to be a high hurdle to do other tasks that seemed to be unexpected. Therefore, I searched for another measure.

Chapter 1 Section 2 I don't know right or left, but let's make a chatbot for the time being

After checking various things, People who are doing Japanese-English translation with transformer and [People who are making chatbots with transformer] ](Https://sekailab.com/wp/2019/03/27/transformer-general-responce-bot/).

Then why not make a chatbot with a transformer? I came up with the idea, so I decided to put it into practice.

Click here for the materials used this time

Preparation

Let's move on to how to make it. This time it will be executed on Google colab, so it is assumed to be in notebook format. Click here for the full code (https://github.com/NJIMAMTO/transformer-chat-bot/blob/master/transformer.ipynb)

First from the installer installation

!pip install keras-transformer

Next, install the sentence piece to be used as a talker

!pip install sentencepiece

Mount Google Drive here (how to do it)

Next, download the corpus and shape it.

!git clone https://github.com/knok/make-meidai-dialogue.git

Change to the directory where the repository is located.

cd "/content/drive/My Drive/Colab Notebooks/make-meidai-dialogue"

Run makefile

!make all

You will be returned to the original directory.

cd "/content/drive/My Drive/Colab Notebooks"

Download the corpus and run the makefile to generate sequence.txt. In this input:~~~~ output:~~~~ Since the conversational sentence is written in the format of, we will format it so that it will be easier to use in the future.

input_corpus = []
output_corpus = []
for_spm_corpus = []
with open('/content/drive/My Drive/Colab Notebooks/make-meidai-dialogue/sequence.txt') as f:
  for s_line in f:

    if s_line.startswith('input: '):
      input_corpus.append(s_line[6:])
      for_spm_corpus.append(s_line[6:])

    elif s_line.startswith('output: '):
      output_corpus.append(s_line[7:])
      for_spm_corpus.append(s_line[7:])

with open('/content/drive/My Drive/Colab Notebooks/input_corpus.txt', 'w') as f:
  f.writelines(input_corpus)

with open('/content/drive/My Drive/Colab Notebooks/output_corpus.txt', 'w') as f:
  f.writelines(output_corpus)

with open('/content/drive/My Drive/Colab Notebooks/spm_corpus.txt', 'w') as f:
  f.writelines(for_spm_corpus)

It is divided into a text for input to the transformer and a text for output, and a text for learning with the input: and output: parts removed for training in the sentence piece.

Preparation

Next, use the sentence piece to divide the conversation. Let's train using spm_corpus.txt.

import sentencepiece as spm

# train sentence piece
spm.SentencePieceTrainer.Train("--input=spm_corpus.txt --model_prefix=trained_model --vocab_size=8000 --bos_id=1 --eos_id=2 --pad_id=0 --unk_id=5")

Details are omitted because the method is described in the official repository of sentence piece.

Now, let's divide the sentence once with the sentence piece.

sp = spm.SentencePieceProcessor()
sp.Load("trained_model.model")

#test
print(sp.EncodeAsPieces("Oh that's right"))
print(sp.EncodeAsPieces("I see"))
print(sp.EncodeAsPieces("So what do you mean by that?"))
print(sp.DecodeIds([0,1,2,3,4,5]))

This is the execution result.

['Oh oh', 'Such thing', 'Ne']
['I see', 'all right']
['▁', 'In other words', '、', 'That', 'you', 'of', 'say', 'Want', 'That is', 'Such thing', 'Is it', '?']
、。 ⁇ 

It turns out that it is divided as above.

The content from here is a partial modification of the content written in the README.md of Keras Transformer.

Now let's shape the corpus into a format suitable for padding and transformers.

import numpy as np

# Generate toy data
encoder_inputs_no_padding = []
encoder_inputs, decoder_inputs, decoder_outputs = [], [], []
max_token_size = 168

with open('/content/drive/My Drive/Colab Notebooks/input_corpus.txt') as input_tokens, open('/content/drive/My Drive/Colab Notebooks/output_corpus.txt') as output_tokens:
  #Read line by line from the corpus
  input_tokens = input_tokens.readlines()
  output_tokens = output_tokens.readlines()

  for input_token, output_token in zip(input_tokens, output_tokens):
    if input_token or output_token:
      encode_tokens, decode_tokens = sp.EncodeAsPieces(input_token), sp.EncodeAsPieces(output_token)
      #Padding
      encode_tokens = ['<s>'] + encode_tokens + ['</s>'] + ['<pad>'] * (max_token_size - len(encode_tokens))
      output_tokens = decode_tokens + ['</s>', '<pad>'] + ['<pad>'] * (max_token_size - len(decode_tokens))
      decode_tokens = ['<s>'] + decode_tokens + ['</s>']  + ['<pad>'] * (max_token_size - len(decode_tokens))

      
      encode_tokens = list(map(lambda x: sp.piece_to_id(x), encode_tokens))
      decode_tokens = list(map(lambda x: sp.piece_to_id(x), decode_tokens))
      output_tokens = list(map(lambda x: [sp.piece_to_id(x)], output_tokens))

      encoder_inputs_no_padding.append(input_token)
      encoder_inputs.append(encode_tokens)
      decoder_inputs.append(decode_tokens)
      decoder_outputs.append(output_tokens)
    else:
      break

#Convert for input to the training model
X = [np.asarray(encoder_inputs), np.asarray(decoder_inputs)]
Y = np.asarray(decoder_outputs)

To learn

Now let's train the transformer.

from keras_transformer import get_model

# Build the model
model = get_model(
    token_num=sp.GetPieceSize(),
    embed_dim=32,
    encoder_num=2,
    decoder_num=2,
    head_num=4,
    hidden_dim=128,
    dropout_rate=0.05,
    use_same_embed=True,  # Use different embeddings for different languages
)

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
)
model.summary()

# Train the model
model.fit(
    x=X,
    y=Y,
    epochs=10,
    batch_size=32,
)

This is the execution result.

Epoch 1/10
33361/33361 [==============================] - 68s 2ms/step - loss: 0.2818
Epoch 2/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2410
Epoch 3/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2331
Epoch 4/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2274
Epoch 5/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2230
Epoch 6/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2193
Epoch 7/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2163
Epoch 8/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2137
Epoch 9/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2114
Epoch 10/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2094

The loss doesn't look bad.

Infer

Let's make inferences with the learned model.

from keras_transformer import decode

input = "It's nice weather today, is not it"
encode = sp.EncodeAsIds(input)

decoded = decode(
    model,
    encode,
    start_token=sp.bos_id(),
    end_token=sp.eos_id(),
    pad_token=sp.pad_id(),
    max_len=170
)

decoded = np.array(decoded,dtype=int)
decoded = decoded.tolist()
print(sp.decode(decoded))

This is the execution result.

Hey, but that, that, that, that, that, that, that, that, that, that, that

That ... the result is not good at all. This is a communication disorder.

Consideration so far

What was the cause of the failure?

Etc. can be given first?

To improve the performance of the model

So I tried to tune with Optuna but it didn't work for the following reasons:

So I decided to try another method. The method that worked is here

Recommended Posts

A story that stumbled when I made a chatbot with Transformer
A story that I was addicted to when I made SFTP communication with python
I made a LINE BOT that returns parrots with Go
I made a fortune with Python.
A story that stumbled when using pip in a proxy environment
I made a rigid Pomodoro timer that works with CUI
I made a daemon with Python
I made a plug-in that can "Daruma-san fell" with Minecraft
A story that went missing when I specified a path starting with a tilde (~) in python open
I made a package that can compare morphological analyzers with Python
A story that I fixed when I got Lambda logs from Cloudwatch Logs
I made a shuffle that can be reset (reverted) with Python
I made a program that automatically calculates the zodiac with tkinter
I made a chatbot with Tensor2Tensor and this time it worked
A story that stumbled upon installing matplotlib
I made a character counter with Python
A story that stumbled upon a comparison operation
I made a Hex map with Python
I made a life game with Numpy
I made a stamp generator with GAN
I made a roguelike game with Python
I made a simple blackjack with Python
I made a configuration file with Python
I made a WEB application with Django
I made a neuron simulator with Python
When writing to a csv file with python, a story that I made a mistake and did not meet the delivery date
A story that didn't work when I tried to log in with the Python requests module
I made a plug-in "EZPrinter" that easily outputs map PDF with QGIS.
I made a Discord bot in Python that translates when it reacts
I made a tool that makes decompression a little easier with CLI (Python3)
I made a module PyNanaco that can charge nanaco credit with python
I made a stamp substitute bot with line
I made a competitive programming glossary with Python
I made a weather forecast bot-like with Python.
I made a GUI application with Python + PyQt5
A memo that made a graph animated with plotly
I made a Twitter fujoshi blocker with Python ①
[Python] I made a Youtube Downloader with Tkinter.
I get a UnicodeDecodeError when running with mod_wsgi
I made a simple Bitcoin wallet with pycoin
I made a LINE Bot with Serverless Framework!
I made a random number graph with Numpy
I made a bin picking game with Python
I made a Mattermost bot with Python (+ Flask)
I made a QR code image with CuteR
A memo that I stumbled upon when doing a quote RT on Twitter Bot
A story that did not end with exit when turning while with pipe input
I made a familiar function that can be used in statistics with Python
〇✕ I made a game
[AWS] I made a reminder BOT with LINE WORKS
I made a Twitter BOT with GAE (python) (with a reference)
I made a household account book bot with LINE Bot
Story that an inexperienced person made a masked solver
I made a ready-to-use syslog server with Play with Docker
I made a Christmas tree lighting game with Python
I made a vim learning game "PacVim" with Go
I made a window for Log output with Tkinter
I made a net news notification app with Python
I made a VM that runs OpenCV for Python
What I did when I stumbled on a Django tutorial
A story that struggled with the common set HTTP_PROXY = ~