This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 01, make a note of your own points.
--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server
Experience some of the elements of natural language processing programming with a simple dialogue agent as the subject.
--Writing --Feature vectorization --Identification --Evaluation
#Script dialogue you want to execute_agent.Run in the directory where py exists
#However, it is assumed that the csv required for execution is also in the same directory.
# docker run -it -v $(pwd):/usr/src/app/ <docker image>:<tag> python <Execution script>
$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python dialogue_agent.py
What should be created is "a system that predicts the class to which the sentence belongs and outputs the class ID when the sentence is input". It is a problem setting called text classification.
#Execution image of interactive agent system
~~
dialogue_agent = DialogueAgent()
dialogue_agent.train(training_data)
predicted_class = dialogue_agent.predict('Input statement')
~~
Breaking down a sentence into words is called word-separation. Languages with spaces between words, such as English, are not needed. ** MeCab ** is widely used as software for writing words in Japanese. The word-separation that includes the addition of part-of-speech information is called ** morphological analysis **.
If you want to get only the surface layer type, use parseToNode. (Mecab-python3 seems to have a bug that doesn't work properly in versions older than 0.996.2)
import MeCab
tagger = MeCab.Tagger()
node = tagger.parseToNode('<Input statement>')
#First and last node.surface will be an empty string
while node:
print(node.surface)
node = node.next
** -Owakati ** If you execute MeCab.Tagger () with an argument, you can output only the result of the word-separation divided by spaces ('') like when you execute $ mecab -Owakati
on the command line. ..
However, when a word containing a half-width space appears, the delimiter and a part of the half-width space of the word cannot be distinguished and cannot be correctly divided. ** This implementation should be avoided. ** (Because some dictionaries handle words that include spaces)
By representing one sentence with one vector (of a fixed length), it becomes a computer-computable format.
Bag of Words
As shown below, a sentence is represented by a vector of a fixed length (here, length 10).
Bag of Words example
#I like you i like you i like me
bow0 = [2, 1, 0, 2, 1, 2, 1, 1, 1, 1]
#I like ramen
bow1 = [1, 0, 1, 1, 0, 1, 1, 0, 0, 1]
Code details omitted. Check the result of the inclusion notation.
test_bag_of_words.py
from tokenizer import tokenize
import pprint
texts = [
'I like you i like i',
'I like ramen',
'Mt. Fuji is the highest mountain in Japan'
]
tokenized_texts = [tokenize(text) for text in texts]
pprint.pprint(tokenized_texts)
bow = [[0] * 14 for i in range(len(tokenized_texts))]
pprint.pprint(bow)
Execution result
$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python test_bag_of_words.py
[['I', 'Is', 'I', 'But', 'Like', 'Nana', 'あNanaた', 'But', 'Like', 'is'],
['I', 'Is', 'ramen', 'But', 'Like', 'is'],
['Fuji Mountain', 'Is', 'No. 1 in Japan', 'high', 'Mountain', 'is']]
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
The Bag of Words can be implemented more simply by using the Counter class of the collections module, which is a standard Python library.
Check the progress on the way. The result of vectorization is the same as the result of "Implementation of Bag of Words" above.
test_bag_of_words_counter_ver.py
from collections import Counter
from tokenizer import tokenize
import pprint
texts = [
'I like you i like you i like me',
'I like ramen',
'Mt. Fuji is the highest mountain in Japan'
]
tokenized_texts = [tokenize(text) for text in texts]
print('# Counter(..)')
print(Counter(tokenized_texts[0]))
counts = [Counter(tokenized_text) for tokenized_text in tokenized_texts]
print('# [Counter(..) for .. in ..]')
pprint.pprint(counts)
sum_counts = sum(counts, Counter())
print('# sum(.., Counter())')
pprint.pprint(sum_counts)
vocabulary = sum_counts.keys()
print('# sum_counts.keys')
print(vocabulary)
print('# [[count[..] for .. in .. ] for .. in ..]')
pprint.pprint([[count[word] for word in vocabulary] for count in counts])
Execution result
$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python test_bag_of_words_counter_ver.py
# Counter(..)
Counter({'I': 2, 'But': 2, 'Like': 2, 'Is': 1, 'of': 1, 'thing': 1, 'Nana': 1, 'あNanaた': 1, 'is': 1})
# [Counter(..) for .. in ..]
[Counter({'I': 2,
'But': 2,
'Like': 2,
'Is': 1,
'of': 1,
'thing': 1,
'Nana': 1,
'you': 1,
'is': 1}),
Counter({'I': 1, 'Is': 1, 'ramen': 1, 'But': 1, 'Like': 1, 'is': 1}),
Counter({'Fuji Mountain': 1, 'Is': 1, 'No. 1 in Japan': 1, 'high': 1, 'Mountain': 1, 'is': 1})]
# sum(.., Counter())
Counter({'I': 3,
'Is': 3,
'But': 3,
'Like': 3,
'is': 3,
'of': 1,
'thing': 1,
'Nana': 1,
'you': 1,
'ramen': 1,
'Fuji Mountain': 1,
'No. 1 in Japan': 1,
'high': 1,
'Mountain': 1})
# sum_counts.keys
dict_keys(['I', 'Is', 'of', 'thing', 'But', 'Like', 'Nana', 'あNanaた', 'is', 'ramen', 'Fuji Mountain', 'No. 1 in Japan', 'high', 'Mountain'])
# [[count[..] for .. in .. ] for .. in ..]
[[2, 1, 1, 1, 2, 2, 1, 1, 1, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1]]
It can be implemented manually as described above, but scikit-learn provides ** sklearn.feature_extraction.text.CountVectorizer ** as a class with the ability to calculate BoW, so the implementation will use this. ..
vectorizer = CountVectorizer(tokenizer = tokenize) #collable to tokenizer(Functions, methods)To specify how to split the sentence
vectorizer.fit(texts) #Dictionary creation
bow = vectorizer.transform(texts) #Calculate BoW
In the context of machine learning, inputting a feature vector and outputting its class ID is called identification, and the object or method that does it is called a classifier.
Each component provided by scikit-learn (CountVectorizer, SVC, etc.) is designed to have a unified API such as fit (), predict (), transform (), ** sklearn.pipeline.Pipeline ** It can be summarized with.
pipeline example
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('vectorizer', CountVectorizer(tokenizer = tokenizer),
('classifier', SVC()),
])
# vectorizer.fit() +
# vectorizer.transform() +
# classifier.fit()
pipeline.fit(texts, labels)
# vectorizer.transform() +
# classifier.predict()
pipeline.predict(texts) #
Evaluate the performance of machine learning systems with quantitative indicators.
There are various indicators, but here we will look at accuracy (accuracy rate). As shown in the example below, accuracy calculates the ratio of the model identification result and the test data label to the test data.
from dialogue_agent import DialogueAgent
dialogue_agent = DialogueAgent()
dialogue_agent.train(<train_text>, <train_label>)
predictions = dialogue_agent.predict(<test_text>)
print(accuracy_score(<test_label>, predictions))
Execution result
0.37234042...
The correct answer rate is still only 37%.