[PYTHON] I made a chatbot with Tensor2Tensor and this time it worked

Introduction

I tried to create a chatbot by changing the method from the previous failure. This time it worked, but it's not very interesting because it's almost as per the official documentation.

The last failure is from here The entire code is from here

How to make

This time I decided to use Tensor2Tensor (t2t) provided by the Google Brain team. The feature of t2t is that it is easy to execute without writing code (command only) if you just learn with the already prepared data set. It's a lot easier to run your own dataset, as you only need a few lines of code and a well-formed dataset as described in most official documentation.

This time, I will try to learn and infer using input_corpus.txt and output_corpus.txt extracted from the Nagoya University Conversation Corpus created last time as a data set. The execution environment is Google Colab.

In addition, we will proceed according to the contents of the official document [^ 1], here [^ 2], here [^ 3], etc.

Data set preparation

If you proceed according to the above reference page, you need the following two things.

For details on how to make it, leave it to the reference page. For the time being, just put the code

myproblem.py


from tensor2tensor.data_generators import problem
from tensor2tensor.data_generators import text_problems
from tensor2tensor.utils import registry
 
 
@registry.register_problem
class chat_bot(text_problems.Text2TextProblem):
    @property
    def approx_vocab_size(self):
        return 2**13
    
    @property
    def is_generate_per_split(self):
        return False
 
    @property
    def dataset_splits(self):
        return [{
            "split": problem.DatasetSplit.TRAIN,
            "shards": 9,
        }, {
            "split": problem.DatasetSplit.EVAL,
            "shards": 1,
        }]
 
    def generate_samples(self, data_dir, tmp_dir, dataset_split):
        filename_input = '/content/drive/My Drive/Colab Notebooks/input_corpus.txt'
        filename_output = '/content/drive/My Drive/Colab Notebooks/output_corpus.txt'
 
        with open(filename_input) as f_in, open(filename_output) as f_out:
            for src, tgt in zip(f_in, f_out):
                src = src.strip()
                tgt = tgt.strip()
                if not src or not tgt:
                    continue
                yield {'inputs': src, 'targets': tgt}

The changes from the official documentation are the class name and the required parts of the generate_samples function. It is customary to write the class name in camel case, but for some reason I had to write it in snake case for some reason. It's a little mystery that it should work with camel case.

__init__.py


from . import myproblem

For this, just write the above contents and put it in the same directory as myproblem.py.

Data preprocessing

t2t also preprocesses data almost automatically. It's convenient. (The code below is converting from notebook format to .py format)

ChatBot_with_t2t.py


#Tensorflow version 1,x
"""

# Commented out IPython magic to ensure Python compatibility.
# %tensorflow_version 1.x

"""#Machine learning model(Transformer)To install"""

!pip install tensor2tensor

"""#Google Drive mount"""

from google.colab import drive
drive.mount('/content/drive')

"""#Change working directory"""

cd /content/drive/My Drive/Colab Notebooks

"""#Preprocessing of training data"""

!t2t-datagen \
  --data_dir=. \
  --tmp_dir=./t2t \
  --problem=chat_bot \
  --t2t_usr_dir=./t2t

This time, put myproblem.py and \ _ \ _ init__.py in the t2t directory one level below ChatBot \ _with \ _ t2t.ipynb. Also, this time I put input_corpus.txt and output_corpus.txt in the same directory as .ipynb, but since there are files generated after execution, it may be better to save them in a separate folder.

For the command line option problem =, specify the class name changed in myproblem.py (it was originally automatically converted from camel case to snake case, but it didn't work).

To learn

ChatBot_with_t2t.py


"""#Execution of learning"""

!t2t-trainer \
  --data_dir=/content/drive/My\ Drive/Colab\ Notebooks \
  --problem=chat_bot \
  --model=transformer \
  --hparams_set=transformer_base_single_gpu \
  --output_dir=/content/drive/My\ Drive/Colab\ Notebooks/t2t \
  --t2t_usr_dir=/content/drive/My\ Drive/Colab\ Notebooks/t2t

In the preprocessing, the directory was specified with a relative path, but of course it is possible to describe with an absolute path. This time I escaped with \ because it contains spaces in the path. This time as well, the learning model is performed by the transformer. This study took about 3-4 hours. In addition, in order to clear the 90-minute limit, it is automatically reloaded by the Chorme extension.

There is one caveat: when you perform learning, many intermediate products are created. In my own environment, the trash can in Google Drive became full and I had to empty it during execution. The learning files generated after execution still take up a lot of space, so you may need to delete them each time you no longer need them.

Infer

ChatBot_with_t2t.py


"""#Infer"""

!t2t-decoder \
   --data_dir=./ \
   --problem=chat_bot \
   --model=transformer \
   --hparams_set=transformer_base_single_gpu \
   --output_dir=./t2t \
   --decode_hparams="beam_size=4,alpha=0.6" \
   --decode_interactive=true \
   --t2t_usr_dir=./t2t

It can be executed interactively with the above command.

Execution result

You can talk to your bot by hitting the decode command. Below is an excerpt of the input and output content.

Input: Hello
Output: What does that mean?

Input: No, you're saying good morning
Output: All right.

Input: What's okay(Lol)
Output: Yeah.

Input: Yeah
Output: Personality?

Input: That's the personality
Output: Well, that's right.

Input: I'm convinced
Output: Telephone cards, telephone cards, usually thank you, it's natural to give them.

Input: Suddenly rampage
Output: <Laughter> If you stab, you.

Input: Suddenly scared
Output: It's scary.

Input: I'll stop you
Output: What is it for the first time?

You can exit interactive mode by typing q at the end.

The unnaturalness cannot be wiped out and sometimes it goes out of control, but it seems that it is generally working well. Since this dataset is mainly about broken conversations, it may be difficult to write stiff sentences.

Summary

Last time I made a big mistake, but this time I was able to make a chat chatbot by using t2t. I'm just answering questions rather than talking, but I think it's working to some extent.

It's easy to create a chatbot, and it also supports other machine learning tasks, so you may be able to easily create what you want.

References

[^ 1]: Official document How to create your own dataset [^ 2]: Come on! Chatbot whywaita-kun! [^ 3]: Japanese-English translation with tensor2tensor

Recommended Posts

I made a chatbot with Tensor2Tensor and this time it worked
I made a LINE BOT with Python and Heroku
I made a web application that maps IT event information with Vue and Flask
A story that stumbled when I made a chatbot with Transformer
I made a package to filter time series with python
This time I learned python III and IV with Prorate
I made a Chatbot using LINE Messaging API and Python
I made a fortune with Python.
I made a daemon with Python
I made a server with Python socket and ssl and tried to access it from a browser
I made a simple circuit with Python (AND, OR, NOR, etc.)
I made a Nyanko tweet form with Python, Flask and Heroku
I made a segment tree with python, so I will introduce it
I made a Chatbot using LINE Messaging API and Python (2) ~ Server ~
I made a character counter with Python
I made a Hex map with Python
I made a life game with Numpy
I made a stamp generator with GAN
I made a roguelike game with Python
I made a configuration file with Python
I made a WEB application with Django
I made a neuron simulator with Python
I made a stamp substitute bot with line
[Python] I introduced Word2Vec and played with it.
I made a competitive programming glossary with Python
I made a weather forecast bot-like with Python.
I made a GUI application with Python + PyQt5
I made a Twitter fujoshi blocker with Python ①
[Python] I made a Youtube Downloader with Tkinter.
I made a simple Bitcoin wallet with pycoin
I made a LINE Bot with Serverless Framework!
I made a random number graph with Numpy
I made a bin picking game with Python
I made a Mattermost bot with Python (+ Flask)
I made a QR code image with CuteR
I made a tool to notify Slack of Connpass events and made it Terraform
[For beginners] I made a motion sensor with Raspberry Pi and notified LINE!
I made a system that automatically decides whether to run tomorrow with Python and adds it to Google Calendar.
[AWS] I made a reminder BOT with LINE WORKS
I made a Twitter BOT with GAE (python) (with a reference)
I made a ready-to-use syslog server with Play with Docker
I made a Christmas tree lighting game with Python
I made a vim learning game "PacVim" with Go
I made a program to convert images into ASCII art with Python and OpenCV
I made a window for Log output with Tkinter
I made a Docker Image that reads RSS and automatically tweets regularly and released it.
I made a net news notification app with Python
This time I learned Python I and II at Progate.
I made a Python3 environment on Ubuntu with direnv.
[Introduction to system trading] I drew a Stochastic Oscillator with python and played with it ♬
Image processing with Python (I tried binarizing it into a mosaic art of 0 and 1)
I made a falling block game with Sense HAT
How to test the current time with Go (I made a very thin library)
I made it with processing, "Sakanaction's live Othello guy".
I don't like to be frustrated with the release of Pokemon Go, so I made a script to detect the release and tweet it
I made a tool that makes it a little easier to create and install a public key.
I wrote python3.4 in .envrc with direnv and allowed it, but I got a syntax error
When I deploy a Django app with Apache2 and it no longer reads static files
I made a simple typing game with tkinter in Python
I made a LINE BOT that returns parrots with Go
I scraped the Organization member team and made a ranking