Introduction

This is a story that I stumbled upon when I first tried machine learning in natural language processing. Write down the process up to production. At the time of posting the article, it wasn't working well, so on the other hand, it would only be a teacher. If you want to know how to do it well, please see here.

Author's specs

I was doing research using image processing and machine learning (multilayer perceptron) while attending university (correctly, advanced course)
Knowledge of machine learning is almost self-taught (no experience in class, almost self-study during research activities)
Looking for a job because I want to work in machine learning
No experience in natural language processing, little knowledge

It will be an article of such a paper engineer apprentice. If you think that it is not very helpful, we recommend browser back.

Examine machine learning in the natural language processing area

Originally I was touching machine learning itself. However, since I have no experience in natural language processing, I decided to collect information and knowledge for the time being.

The first thing I jumped in was that google's BERT was amazing. So I checked the structure and learning mechanism, but it was in a completely "?????????" state.

Anyway, BERT seems to be amazing. Then I decided to make something using it, and decided to make a chatbot.

I also considered using the later XLNet and ALBERT. However, there was nothing including BERT that could be easily modified for myself.

In particular, Github's BERT repository and text classification, which are unofficially provided by Google, seemed to be easy, but it seemed to be a high hurdle to do other tasks that seemed to be unexpected. Therefore, I searched for another measure.

Chapter 1 Section 2 I don't know right or left, but let's make a chatbot for the time being

After checking various things, People who are doing Japanese-English translation with transformer and [People who are making chatbots with transformer] ](Https://sekailab.com/wp/2019/03/27/transformer-general-responce-bot/).

Then why not make a chatbot with a transformer? I came up with the idea, so I decided to put it into practice.

Click here for the materials used this time

Japanese Natural Conversation Transcription Corpus (formerly Meidai Conversation Corpus)
Corpus Data Formatting Tool
sentencepiece
trasnformer wrapper for keras

Preparation

Let's move on to how to make it. This time it will be executed on Google colab, so it is assumed to be in notebook format. Click here for the full code (https://github.com/NJIMAMTO/transformer-chat-bot/blob/master/transformer.ipynb)

First from the installer installation

!pip install keras-transformer

Next, install the sentence piece to be used as a talker

!pip install sentencepiece

Mount Google Drive here (how to do it)

Next, download the corpus and shape it.

!git clone https://github.com/knok/make-meidai-dialogue.git

Change to the directory where the repository is located.

cd "/content/drive/My Drive/Colab Notebooks/make-meidai-dialogue"

Run makefile

!make all

You will be returned to the original directory.

cd "/content/drive/My Drive/Colab Notebooks"

Download the corpus and run the makefile to generate sequence.txt. In this input:~~~~ output:~~~~ Since the conversational sentence is written in the format of, we will format it so that it will be easier to use in the future.

input_corpus = []
output_corpus = []
for_spm_corpus = []
with open('/content/drive/My Drive/Colab Notebooks/make-meidai-dialogue/sequence.txt') as f:
  for s_line in f:

    if s_line.startswith('input: '):
      input_corpus.append(s_line[6:])
      for_spm_corpus.append(s_line[6:])

    elif s_line.startswith('output: '):
      output_corpus.append(s_line[7:])
      for_spm_corpus.append(s_line[7:])

with open('/content/drive/My Drive/Colab Notebooks/input_corpus.txt', 'w') as f:
  f.writelines(input_corpus)

with open('/content/drive/My Drive/Colab Notebooks/output_corpus.txt', 'w') as f:
  f.writelines(output_corpus)

with open('/content/drive/My Drive/Colab Notebooks/spm_corpus.txt', 'w') as f:
  f.writelines(for_spm_corpus)

It is divided into a text for input to the transformer and a text for output, and a text for learning with the input: and output: parts removed for training in the sentence piece.

Preparation

Next, use the sentence piece to divide the conversation. Let's train using spm_corpus.txt.

import sentencepiece as spm

# train sentence piece
spm.SentencePieceTrainer.Train("--input=spm_corpus.txt --model_prefix=trained_model --vocab_size=8000 --bos_id=1 --eos_id=2 --pad_id=0 --unk_id=5")

Details are omitted because the method is described in the official repository of sentence piece.

Now, let's divide the sentence once with the sentence piece.

sp = spm.SentencePieceProcessor()
sp.Load("trained_model.model")

#test
print(sp.EncodeAsPieces("Oh that's right"))
print(sp.EncodeAsPieces("I see"))
print(sp.EncodeAsPieces("So what do you mean by that?"))
print(sp.DecodeIds([0,1,2,3,4,5]))

This is the execution result.

['Oh oh', 'Such thing', 'Ne']
['I see', 'all right']
['▁', 'In other words', '、', 'That', 'you', 'of', 'say', 'Want', 'That is', 'Such thing', 'Is it', '?']
、。 ⁇

It turns out that it is divided as above.

The content from here is a partial modification of the content written in the README.md of Keras Transformer.

Now let's shape the corpus into a format suitable for padding and transformers.

import numpy as np

# Generate toy data
encoder_inputs_no_padding = []
encoder_inputs, decoder_inputs, decoder_outputs = [], [], []
max_token_size = 168

with open('/content/drive/My Drive/Colab Notebooks/input_corpus.txt') as input_tokens, open('/content/drive/My Drive/Colab Notebooks/output_corpus.txt') as output_tokens:
  #Read line by line from the corpus
  input_tokens = input_tokens.readlines()
  output_tokens = output_tokens.readlines()

  for input_token, output_token in zip(input_tokens, output_tokens):
    if input_token or output_token:
      encode_tokens, decode_tokens = sp.EncodeAsPieces(input_token), sp.EncodeAsPieces(output_token)
      #Padding
      encode_tokens = ['<s>'] + encode_tokens + ['</s>'] + ['<pad>'] * (max_token_size - len(encode_tokens))
      output_tokens = decode_tokens + ['</s>', '<pad>'] + ['<pad>'] * (max_token_size - len(decode_tokens))
      decode_tokens = ['<s>'] + decode_tokens + ['</s>']  + ['<pad>'] * (max_token_size - len(decode_tokens))

      
      encode_tokens = list(map(lambda x: sp.piece_to_id(x), encode_tokens))
      decode_tokens = list(map(lambda x: sp.piece_to_id(x), decode_tokens))
      output_tokens = list(map(lambda x: [sp.piece_to_id(x)], output_tokens))

      encoder_inputs_no_padding.append(input_token)
      encoder_inputs.append(encode_tokens)
      decoder_inputs.append(decode_tokens)
      decoder_outputs.append(output_tokens)
    else:
      break

#Convert for input to the training model
X = [np.asarray(encoder_inputs), np.asarray(decoder_inputs)]
Y = np.asarray(decoder_outputs)

To learn

Now let's train the transformer.

from keras_transformer import get_model

# Build the model
model = get_model(
    token_num=sp.GetPieceSize(),
    embed_dim=32,
    encoder_num=2,
    decoder_num=2,
    head_num=4,
    hidden_dim=128,
    dropout_rate=0.05,
    use_same_embed=True,  # Use different embeddings for different languages
)

model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
)
model.summary()

# Train the model
model.fit(
    x=X,
    y=Y,
    epochs=10,
    batch_size=32,
)

This is the execution result.

Epoch 1/10
33361/33361 [==============================] - 68s 2ms/step - loss: 0.2818
Epoch 2/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2410
Epoch 3/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2331
Epoch 4/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2274
Epoch 5/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2230
Epoch 6/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2193
Epoch 7/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2163
Epoch 8/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2137
Epoch 9/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2114
Epoch 10/10
33361/33361 [==============================] - 66s 2ms/step - loss: 0.2094

The loss doesn't look bad.

Infer

Let's make inferences with the learned model.

from keras_transformer import decode

input = "It's nice weather today, is not it"
encode = sp.EncodeAsIds(input)

decoded = decode(
    model,
    encode,
    start_token=sp.bos_id(),
    end_token=sp.eos_id(),
    pad_token=sp.pad_id(),
    max_len=170
)

decoded = np.array(decoded,dtype=int)
decoded = decoded.tolist()
print(sp.decode(decoded))

This is the execution result.

Hey, but that, that, that, that, that, that, that, that, that, that, that

That ... the result is not good at all. This is a communication disorder.

Consideration so far

What was the cause of the failure?

Is the length of one sentence too long to learn well?
Insufficient corpus (30,000 sentences each for questions and answers this time)
Parameter settings are not good

Etc. can be given first?

To improve the performance of the model

So I tried to tune with Optuna but it didn't work for the following reasons:

Tuning takes too long (I haven't done it to the end, but I think it will take a day to spare)
Difficult to run in Colab for the above reasons (because it gets stuck in the 12 hour limit)
I can't do it with Colab, so I tried it in my PC environment (GPU cannot be used), but I gave up because it took more time.
Then I thought I would do it on the cloud and tried AWS, but the creation of an instance with GPU was not allowed (although it was allowed while trying the method in the next article ...)
In the first place, the result of trying halfway is not very good (almost no change in accuracy)

So I decided to try another method. The method that worked is here

[PYTHON] A story that stumbled when I made a chatbot with Transformer

Introduction

Author's specs

Examine machine learning in the natural language processing area

Chapter 1 Section 2 I don't know right or left, but let's make a chatbot for the time being

Preparation

Preparation

To learn

Infer

Consideration so far

To improve the performance of the model