This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 01, make a note of your own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

Experience some of the elements of natural language processing programming with a simple dialogue agent as the subject.

--Writing --Feature vectorization --Identification --Evaluation

Script execution method in docker environment

#Script dialogue you want to execute_agent.Run in the directory where py exists
#However, it is assumed that the csv required for execution is also in the same directory.

# docker run -it -v $(pwd):/usr/src/app/ <docker image>:<tag> python <Execution script>
$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python dialogue_agent.py

01.1 Dialogue agent system

What should be created is "a system that predicts the class to which the sentence belongs and outputs the class ID when the sentence is input". It is a problem setting called text classification.

#Execution image of interactive agent system
~~

dialogue_agent = DialogueAgent()
dialogue_agent.train(training_data)
predicted_class = dialogue_agent.predict('Input statement')

~~

01.2 Word-separation

Breaking down a sentence into words is called word-separation. Languages with spaces between words, such as English, are not needed. ** MeCab ** is widely used as software for writing words in Japanese. The word-separation that includes the addition of part-of-speech information is called ** morphological analysis **.

If you want to get only the surface layer type, use parseToNode. (Mecab-python3 seems to have a bug that doesn't work properly in versions older than 0.996.2)

import MeCab

tagger = MeCab.Tagger()
node = tagger.parseToNode('<Input statement>')

#First and last node.surface will be an empty string
while node:
  print(node.surface)
  node = node.next

** -Owakati ** If you execute MeCab.Tagger () with an argument, you can output only the result of the word-separation divided by spaces ('') like when you execute $ mecab -Owakati on the command line. .. However, when a word containing a half-width space appears, the delimiter and a part of the half-width space of the word cannot be distinguished and cannot be correctly divided. ** This implementation should be avoided. ** (Because some dictionaries handle words that include spaces)

01.3 Feature vectorization

By representing one sentence with one vector (of a fixed length), it becomes a computer-computable format.

Bag of Words

Assign an index to a word
Count the number of times a word appears for each sentence
Arrange the number of appearances of each word for each sentence

As shown below, a sentence is represented by a vector of a fixed length (here, length 10).

`Bag of Words example`


#I like you i like you i like me
bow0 = [2, 1, 0, 2, 1, 2, 1, 1, 1, 1]

#I like ramen
bow1 = [1, 0, 1, 1, 0, 1, 1, 0, 0, 1]

Implementation of Bag of Words

Code details omitted. Check the result of the inclusion notation.

`test_bag_of_words.py`


from tokenizer import tokenize
import pprint

texts = [
    'I like you i like i',
    'I like ramen',
    'Mt. Fuji is the highest mountain in Japan'
]

tokenized_texts = [tokenize(text) for text in texts]
pprint.pprint(tokenized_texts)

bow = [[0] * 14 for i in range(len(tokenized_texts))]
pprint.pprint(bow)

`Execution result`


$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python test_bag_of_words.py
[['I', 'Is', 'I', 'But', 'Like', 'Nana', 'あNanaた', 'But', 'Like', 'is'],
 ['I', 'Is', 'ramen', 'But', 'Like', 'is'],
 ['Fuji Mountain', 'Is', 'No. 1 in Japan', 'high', 'Mountain', 'is']]
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]

Using Column collections.Counter

The Bag of Words can be implemented more simply by using the Counter class of the collections module, which is a standard Python library.

Check the progress on the way. The result of vectorization is the same as the result of "Implementation of Bag of Words" above.

`test_bag_of_words_counter_ver.py`


from collections import Counter
from tokenizer import tokenize

import pprint

texts = [
    'I like you i like you i like me',
    'I like ramen',
    'Mt. Fuji is the highest mountain in Japan'
]

tokenized_texts = [tokenize(text) for text in texts]

print('# Counter(..)')
print(Counter(tokenized_texts[0]))

counts = [Counter(tokenized_text) for tokenized_text in tokenized_texts]
print('# [Counter(..) for .. in ..]')
pprint.pprint(counts)

sum_counts = sum(counts, Counter())
print('# sum(.., Counter())')
pprint.pprint(sum_counts)

vocabulary = sum_counts.keys()
print('# sum_counts.keys')
print(vocabulary)

print('# [[count[..] for .. in .. ] for .. in ..]')
pprint.pprint([[count[word] for word in vocabulary] for count in counts])

`Execution result`


$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python test_bag_of_words_counter_ver.py
# Counter(..)
Counter({'I': 2, 'But': 2, 'Like': 2, 'Is': 1, 'of': 1, 'thing': 1, 'Nana': 1, 'あNanaた': 1, 'is': 1})
# [Counter(..) for .. in ..]
[Counter({'I': 2,
          'But': 2,
          'Like': 2,
          'Is': 1,
          'of': 1,
          'thing': 1,
          'Nana': 1,
          'you': 1,
          'is': 1}),
 Counter({'I': 1, 'Is': 1, 'ramen': 1, 'But': 1, 'Like': 1, 'is': 1}),
 Counter({'Fuji Mountain': 1, 'Is': 1, 'No. 1 in Japan': 1, 'high': 1, 'Mountain': 1, 'is': 1})]
# sum(.., Counter())
Counter({'I': 3,
         'Is': 3,
         'But': 3,
         'Like': 3,
         'is': 3,
         'of': 1,
         'thing': 1,
         'Nana': 1,
         'you': 1,
         'ramen': 1,
         'Fuji Mountain': 1,
         'No. 1 in Japan': 1,
         'high': 1,
         'Mountain': 1})
# sum_counts.keys
dict_keys(['I', 'Is', 'of', 'thing', 'But', 'Like', 'Nana', 'あNanaた', 'is', 'ramen', 'Fuji Mountain', 'No. 1 in Japan', 'high', 'Mountain'])
# [[count[..] for .. in .. ] for .. in ..]
[[2, 1, 1, 1, 2, 2, 1, 1, 1, 0, 0, 0, 0, 0],
 [1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0],
 [0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1]]

Calculation of BoW by scikit-learn

It can be implemented manually as described above, but scikit-learn provides ** sklearn.feature_extraction.text.CountVectorizer ** as a class with the ability to calculate BoW, so the implementation will use this. ..

vectorizer = CountVectorizer(tokenizer = tokenize) #collable to tokenizer(Functions, methods)To specify how to split the sentence
vectorizer.fit(texts) #Dictionary creation
bow = vectorizer.transform(texts) #Calculate BoW

01.4 Discriminator

In the context of machine learning, inputting a feature vector and outputting its class ID is called identification, and the object or method that does it is called a classifier.

Bundle elements of Scikit-learn with pipeline

Each component provided by scikit-learn (CountVectorizer, SVC, etc.) is designed to have a unified API such as fit (), predict (), transform (), ** sklearn.pipeline.Pipeline ** It can be summarized with.

`pipeline example`


from sklearn.pipeline import Pipeline

pipeline = Pipeline([
  ('vectorizer', CountVectorizer(tokenizer = tokenizer),
  ('classifier', SVC()),
])

# vectorizer.fit() +
# vectorizer.transform() +
# classifier.fit()
pipeline.fit(texts, labels)

# vectorizer.transform() +
# classifier.predict()
pipeline.predict(texts) #

01.5 Rating

Evaluate the performance of machine learning systems with quantitative indicators.

Evaluate the dialogue agent

There are various indicators, but here we will look at accuracy (accuracy rate). As shown in the example below, accuracy calculates the ratio of the model identification result and the test data label to the test data.

from dialogue_agent import DialogueAgent

dialogue_agent = DialogueAgent()
dialogue_agent.train(<train_text>, <train_label>)

predictions = dialogue_agent.predict(<test_text>)

print(accuracy_score(<test_label>, predictions))

`Execution result`


0.37234042...

The correct answer rate is still only 37%.

[PYTHON] Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 01 Memo "Let's Make a Dialogue Agent"

Contents

Preparation

Chapter overview

Script execution method in docker environment

01.1 Dialogue agent system

01.2 Word-separation

01.3 Feature vectorization

`Bag of Words example`

Implementation of Bag of Words

`test_bag_of_words.py`

`Execution result`

Using Column collections.Counter

`test_bag_of_words_counter_ver.py`

`Execution result`

Calculation of BoW by scikit-learn

01.4 Discriminator

Bundle elements of Scikit-learn with pipeline

`pipeline example`

01.5 Rating

Evaluate the dialogue agent

`Execution result`