[PYTHON] Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 01 Memo "Let's Make a Dialogue Agent"

Contents

This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 01, make a note of your own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

Experience some of the elements of natural language processing programming with a simple dialogue agent as the subject.

--Writing --Feature vectorization --Identification --Evaluation

Script execution method in docker environment

#Script dialogue you want to execute_agent.Run in the directory where py exists
#However, it is assumed that the csv required for execution is also in the same directory.

# docker run -it -v $(pwd):/usr/src/app/ <docker image>:<tag> python <Execution script>
$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python dialogue_agent.py

01.1 Dialogue agent system

What should be created is "a system that predicts the class to which the sentence belongs and outputs the class ID when the sentence is input". It is a problem setting called text classification.

#Execution image of interactive agent system
~~

dialogue_agent = DialogueAgent()
dialogue_agent.train(training_data)
predicted_class = dialogue_agent.predict('Input statement')

~~

01.2 Word-separation

Breaking down a sentence into words is called word-separation. Languages with spaces between words, such as English, are not needed. ** MeCab ** is widely used as software for writing words in Japanese. The word-separation that includes the addition of part-of-speech information is called ** morphological analysis **.

If you want to get only the surface layer type, use parseToNode. (Mecab-python3 seems to have a bug that doesn't work properly in versions older than 0.996.2)

import MeCab

tagger = MeCab.Tagger()
node = tagger.parseToNode('<Input statement>')

#First and last node.surface will be an empty string
while node:
  print(node.surface)
  node = node.next

** -Owakati ** If you execute MeCab.Tagger () with an argument, you can output only the result of the word-separation divided by spaces ('') like when you execute $ mecab -Owakati on the command line. .. However, when a word containing a half-width space appears, the delimiter and a part of the half-width space of the word cannot be distinguished and cannot be correctly divided. ** This implementation should be avoided. ** (Because some dictionaries handle words that include spaces)

01.3 Feature vectorization

By representing one sentence with one vector (of a fixed length), it becomes a computer-computable format.

Bag of Words

  1. Assign an index to a word
  2. Count the number of times a word appears for each sentence
  3. Arrange the number of appearances of each word for each sentence

As shown below, a sentence is represented by a vector of a fixed length (here, length 10).

Bag of Words example


#I like you i like you i like me
bow0 = [2, 1, 0, 2, 1, 2, 1, 1, 1, 1]

#I like ramen
bow1 = [1, 0, 1, 1, 0, 1, 1, 0, 0, 1]

Implementation of Bag of Words

Code details omitted. Check the result of the inclusion notation.

test_bag_of_words.py


from tokenizer import tokenize
import pprint

texts = [
    'I like you i like i',
    'I like ramen',
    'Mt. Fuji is the highest mountain in Japan'
]

tokenized_texts = [tokenize(text) for text in texts]
pprint.pprint(tokenized_texts)

bow = [[0] * 14 for i in range(len(tokenized_texts))]
pprint.pprint(bow)

Execution result


$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python test_bag_of_words.py
[['I', 'Is', 'I', 'But', 'Like', 'Nana', 'あNanaた', 'But', 'Like', 'is'],
 ['I', 'Is', 'ramen', 'But', 'Like', 'is'],
 ['Fuji Mountain', 'Is', 'No. 1 in Japan', 'high', 'Mountain', 'is']]
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Using Column collections.Counter

The Bag of Words can be implemented more simply by using the Counter class of the collections module, which is a standard Python library.

Check the progress on the way. The result of vectorization is the same as the result of "Implementation of Bag of Words" above.

test_bag_of_words_counter_ver.py


from collections import Counter
from tokenizer import tokenize

import pprint

texts = [
    'I like you i like you i like me',
    'I like ramen',
    'Mt. Fuji is the highest mountain in Japan'
]

tokenized_texts = [tokenize(text) for text in texts]

print('# Counter(..)')
print(Counter(tokenized_texts[0]))

counts = [Counter(tokenized_text) for tokenized_text in tokenized_texts]
print('# [Counter(..) for .. in ..]')
pprint.pprint(counts)

sum_counts = sum(counts, Counter())
print('# sum(.., Counter())')
pprint.pprint(sum_counts)

vocabulary = sum_counts.keys()
print('# sum_counts.keys')
print(vocabulary)

print('# [[count[..] for .. in .. ] for .. in ..]')
pprint.pprint([[count[word] for word in vocabulary] for count in counts])

Execution result


$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python test_bag_of_words_counter_ver.py
# Counter(..)
Counter({'I': 2, 'But': 2, 'Like': 2, 'Is': 1, 'of': 1, 'thing': 1, 'Nana': 1, 'あNanaた': 1, 'is': 1})
# [Counter(..) for .. in ..]
[Counter({'I': 2,
          'But': 2,
          'Like': 2,
          'Is': 1,
          'of': 1,
          'thing': 1,
          'Nana': 1,
          'you': 1,
          'is': 1}),
 Counter({'I': 1, 'Is': 1, 'ramen': 1, 'But': 1, 'Like': 1, 'is': 1}),
 Counter({'Fuji Mountain': 1, 'Is': 1, 'No. 1 in Japan': 1, 'high': 1, 'Mountain': 1, 'is': 1})]
# sum(.., Counter())
Counter({'I': 3,
         'Is': 3,
         'But': 3,
         'Like': 3,
         'is': 3,
         'of': 1,
         'thing': 1,
         'Nana': 1,
         'you': 1,
         'ramen': 1,
         'Fuji Mountain': 1,
         'No. 1 in Japan': 1,
         'high': 1,
         'Mountain': 1})
# sum_counts.keys
dict_keys(['I', 'Is', 'of', 'thing', 'But', 'Like', 'Nana', 'あNanaた', 'is', 'ramen', 'Fuji Mountain', 'No. 1 in Japan', 'high', 'Mountain'])
# [[count[..] for .. in .. ] for .. in ..]
[[2, 1, 1, 1, 2, 2, 1, 1, 1, 0, 0, 0, 0, 0],
 [1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1]]

Calculation of BoW by scikit-learn

It can be implemented manually as described above, but scikit-learn provides ** sklearn.feature_extraction.text.CountVectorizer ** as a class with the ability to calculate BoW, so the implementation will use this. ..

vectorizer = CountVectorizer(tokenizer = tokenize) #collable to tokenizer(Functions, methods)To specify how to split the sentence
vectorizer.fit(texts) #Dictionary creation
bow = vectorizer.transform(texts) #Calculate BoW

01.4 Discriminator

In the context of machine learning, inputting a feature vector and outputting its class ID is called identification, and the object or method that does it is called a classifier.

Bundle elements of Scikit-learn with pipeline

Each component provided by scikit-learn (CountVectorizer, SVC, etc.) is designed to have a unified API such as fit (), predict (), transform (), ** sklearn.pipeline.Pipeline ** It can be summarized with.

pipeline example


from sklearn.pipeline import Pipeline

pipeline = Pipeline([
  ('vectorizer', CountVectorizer(tokenizer = tokenizer),
  ('classifier', SVC()),
])

# vectorizer.fit() +
# vectorizer.transform() +
# classifier.fit()
pipeline.fit(texts, labels)

# vectorizer.transform() +
# classifier.predict()
pipeline.predict(texts) #

01.5 Rating

Evaluate the performance of machine learning systems with quantitative indicators.

Evaluate the dialogue agent

There are various indicators, but here we will look at accuracy (accuracy rate). As shown in the example below, accuracy calculates the ratio of the model identification result and the test data label to the test data.

from dialogue_agent import DialogueAgent

dialogue_agent = DialogueAgent()
dialogue_agent.train(<train_text>, <train_label>)

predictions = dialogue_agent.predict(<test_text>)

print(accuracy_score(<test_label>, predictions))

Execution result


0.37234042...

The correct answer rate is still only 37%.

Recommended Posts

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 01 Memo "Let's Make a Dialogue Agent"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 06 Memo "Identifier"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 07 Memo "Evaluation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 04 Memo "Feature Extraction"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 08 Memo "Introduction to Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 11 Memo "Word Embeddings"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 12 Memo "Convolutional Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 13 Memo "Recurrent Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 03 Memo "Morphological Analysis and Word Separation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 10 Memo "Details and Improvements of Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 1 Memo "Preliminary Knowledge Before Beginning Exercises"
Try to make a Python module in C language
Introduction to Python Let's prepare the development environment
Steps to develop a web application in Python
Try to make a blackjack strategy by reinforcement learning (② Register the environment in gym)
Summary from the beginning to Chapter 1 of the introduction to design patterns learned in the Java language
Let's make a leap in the manufacturing industry by utilizing the Web in addition to Python
Set up a development environment for natural language processing
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Natural language processing with Word2Vec developed by a researcher in the US google (original data)