[PYTHON] Japanese preprocessing for machine learning


A memorandum when creating a simple neural network that uses text as training data in order to understand the mechanism of chatbots using machine learning.


Applying a rule-based chatbot created in English text to Japanese text to operate it. Preprocess the Japanese text and make sure it can be passed through the neural network. As training data, we used a web scraped support page related to Niantic's "Pokemon GO".

Niantic Support Page

CSV file used (GitHub)

Multi-class classification

With reference to the "rule-based type" that returns the response sentence prepared in advance according to the input, even the part of the multi-class classification that identifies and predicts "Intents" (intention) is formed.

Since it predicts related "Frequently Asked Questions (FAQ)" from input information rather than "generated type", the model is created with a normal neural network layer instead of "RNN".


Build a virtual environment without using Jupyter notebook. − macOS Mojave 10.14.6

Reference page for MeCab installation


Reading training data and preprocessing Japanese text


import MeCab
import csv
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences

def create_tokenizer() :
#Read CSV file
    text_list = []
    with open("pgo_train_texts.csv", "r") as csvfile :
        texts = csv.reader(csvfile)

        for text in texts :

#Use MeCab to divide Japanese text.
        wakati_list = []
        label_list = []
        for label, text in text_list :
            text = text.lower()

            wakati = MeCab.Tagger("-O wakati")
            text_wakati = wakati.parse(text).strip()

#Find out the number of elements in the largest sentence.
#Create a list of text data to use in the tokenizer.
        max_len = -1
        split_list = []
        sentences = []
        for text in wakati_list :
            text = text.split()

            if len(text) > max_len :
                max_len = len(text)
        print("Max length of texts: ", max_len)
        vocab_size = len(set(split_list))
        print("Vocabularay size: ", vocab_size)
        label_size = len(set(label_list))

#Use Tokenizer to assign numbers from index 1 to words.
#Also create a dictionary.
        tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token="<oov>")
        word_index = tokenizer.word_index
        print("Dictionary size: ", len(word_index))
        sequences = tokenizer.texts_to_sequences(sentences)

#Label data used for supervised learning is also numbered using Tokenizer.
        label_tokenizer = tf.keras.preprocessing.text.Tokenizer()
        label_index = label_tokenizer.word_index
        label_sequences = label_tokenizer.texts_to_sequences(label_list)

#The Tokenizer assigns numbers from 1, while the actual label starts indexing from 0, so it is -1.
        label_seq = []
        for label in label_sequences :
            l = label[0] - 1

# to_categorical()Is the actual label data passed to the model using One-Create Hot vector.
        one_hot_y = tf.keras.utils.to_categorical(label_seq)

#To match the size of the training data, add 0 to the short text to match the longest text data.
        padded = pad_sequences(sequences, maxlen=max_len, padding="post", truncating="post")
        print("padded sequences: ", padded)

        reverse_index = dict()
        for intent, i in label_index.items() :
            reverse_index[i] = intent

    return padded, one_hot_y, word_index, reverse_index, tokenizer, max_len, vocab_size

Model creation using TensorFlow


import tensorflow as tf

def model(training, label, vocab_size) :
    model = tf.keras.models.Sequential([
        tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=16, input_length=len(training[0])),
        tf.keras.layers.Dense(30, activation="relu"),
        tf.keras.layers.Dense(len(label[0]), activation="softmax")

    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
    model.fit(x=training, y=label, epochs=100)


    return model

--First, use the Embedding layer so that you can capture the relationships between words as vectors. --Flatten the Embedding Matrix so that it can be passed to the Fully connected Dense layer with Flatten () in between. --If AveragePooling1D () is used instead, the number of parameters of the neural network can be reduced and the calculation cost can be reduced. --Since the number of "Intents" you want to identify is the same as the label type, match it with the number of elements at index 0 of the One-Hot vector. --For the activation function of the output layer, select "softmax" that supports multi-class classification. --When compiling the model, set the loss calculation method for multi-class classification. --Use "Adam" as the learning algorithm.

Creating an input screen


import MeCab
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences

#Arrange the text received by the console so that the model can process it.
def prepro_wakati(input, tokenizer, max_len) :
    sentence = []

    input = input.lower()
    wakati = MeCab.Tagger("-O wakati")
    text_wakati = wakati.parse(input).strip()

    seq = tokenizer.texts_to_sequences(sentence)
    seq = list(seq)
    padded = pad_sequences(seq, maxlen=max_len, padding="post", truncating="post")

    return padded

def chat(model, tokenizer, label_index, max_len) :
    print("Start talking with the bot (type quit to stop): ")
    while True :
        input_text = input("You: ")
        if input_text.lower() == "quit" :

        x = prepro_wakati(input_text, tokenizer, max_len)
        results = model.predict(x, batch_size=1)
        print("results: ", results)
        results_index = np.argmax(results)
        print("Predicted index: ", results_index)

        intent = label_index[results_index + 1]

        print("Type of intent: ", intent)

Console screen. Apply the entered text to the trained model and predict which of the nine "Intents" it applies to.

--Types of Intents


Call and execute the defined function.


import wakatigaki
import model
import chat

padded, one_hot_y, word_index, label_index, tokenizer, max_len, vocab_size = wakatigaki.create_tokenizer()

model = model.model(padded, one_hot_y, vocab_size)

chat.chat(model, tokenizer, label_index, max_len)

Training results

スクリーンショット 2020-04-29 16.39.32.png

The numbers in "results:" represent the corresponding probabilities for each category.

Predict a "start guide" for the input "how to catch Pokemon". Next, predict "shop" for "poke coins and items". Both were able to predict the appropriate category.

Recommended Posts

Japanese preprocessing for machine learning
Data set for machine learning
Machine learning
<For beginners> python library <For machine learning>
Python: Preprocessing in Machine Learning: Overview
Machine learning meeting information for HRTech
Preprocessing in machine learning 2 Data acquisition
[Recommended tagging for machine learning # 4] Machine learning script ...?
Preprocessing in machine learning 4 Data conversion
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Amplify images for machine learning with python
First Steps for Machine Learning (AI) Beginners
An introduction to OpenCV for machine learning
Why Python is chosen for machine learning
"Usable" one-hot Encoding method for machine learning
Python: Preprocessing in machine learning: Data acquisition
[Shakyo] Encounter with Python for machine learning
[Python] Web application design for machine learning
[Memo] Machine learning
An introduction to Python for machine learning
Machine learning classification
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
Creating a development environment for machine learning
Machine Learning sample
[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
An introduction to machine learning for bot developers
Japanese text preprocessing without for statement in pandas
Recommended study order for machine learning / deep learning beginners
Machine learning starting from 0 for theoretical physics students # 1
Upgrade the Azure Machine Learning SDK for Python
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
[Python] Collect images with Icrawler for machine learning [1000 images]
Machine learning starting from 0 for theoretical physics students # 2
Collect images for machine learning (Bing Search API)
I started machine learning with Python Data preprocessing
[For beginners] Introduction to vectorization in machine learning
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Feature preprocessing for modeling
Reinforcement learning for tic-tac-toe
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Summary for learning RAPIDS
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
Image collection Python script for creating datasets for machine learning
Build an interactive environment for machine learning in Python
[Recommended tagging for machine learning # 2] Extension of scraping script