[PYTHON] Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 06 Memo "Identifier"

Contents

This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 06, I will write down my own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

The classifier can predict the class ID from the feature vector by learning with the feature vector and the class ID as inputs.

06.1 To master the classifier

You can easily switch the classifier by assigning the classifier to the classifier variable.

# SVC
from sklearn.svm import SVC
classifier = SVC()

# RandomForest
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()

06.2 SVM(Support Vector Machine) --Relatively high performance can be obtained stably --Easy to apply to tasks with relatively little learning data --The kernel method can handle complex problems

The outline of the basic SVM is omitted. A problem that cannot be identified by dividing the feature space on a straight line or plane identification surface (linear identification surface) is called a ** linear separable ** problem. There are the following methods to deal with this.

--Soft margin --Kernel

Hard margin SVM, soft margin SVM

There are two types, ** soft margin SVM ** and simple ** hard margin SVM **, which try to set the boundary surface so as to "allow the boundary to protrude and reduce the protrusion as much as possible". The degree of allowance for protrusion can be specified at the time of instance creation (default 1.0).

kernel

In cases where the boundary cannot be separated to the extent that the boundary is allowed to protrude, the kernel method is used in which the feature vector is copied to a feature space higher in dimension than the original feature space and the discrimination surface is set there. The kernel type can be specified at instance creation.

--RBF kernel (Gauss kernel): Default specification, most orthodox --Polynomial kernel: popular in natural language processing --Other --Kernel unused (linear SVM)

Classes provided by Scikit-learn

06.3 Ensemble

A method of combining a plurality of classifiers to form one classifier is called an ensemble. Although the explanation of the decision tree is omitted, the properties of the decision tree ensemble are as follows.

--No need for pre-processing of features --The number of parameters set by the designer is small

Typical ensemble methods are bagging and boosting (reference).

--Bagging --Learning multiple times separately using part of the data, and finally matching the results - Random Forest --Boostering --Learning using a part of the data, and repeating the learning many times using the previous result --Gradient Boosted Decision Trees (GBDT)

item Random Forest GBDT
Ensemble method Bagging Boosting
Method sklearn.ensemble.RandomForestClassifier() skleran.ensemble.GradientBoostingClassifier()
Creation of decision tree A few deep trees Many shallow trees
Run - Faster and more memory efficient than Random Forest

06.4 k-nearest neighbor method

It is a classifier that selects k feature vectors close to the input feature vector from the training data and takes a majority vote by their class ID. Depending on the numerical value of k, the identified class ID may change, or if the number of class IDs in the training data is biased, the desired result may not be obtained.

Distance to measure closeness

type Contents
Euclidean distance Vector length in space
Manhattan distance Sum of axial lengths in space
Minkowski distance A generalized distance of Euclidean distance and Manhattan distance
Levenshtein distance Insert one character at a time to represent the distance between strings/Delete/Number of times to replace and make the same string

Parametric and nonparametric

item parametric ノンparametric
parameter settings It is necessary to pay attention to the parameters so that the identification surface can be set properly. You don't have to think about parameters related to the identification surface
Computational cost Calculation cost of identification surface is required at the time of learning
Calculation cost at the time of identification is almost constant
Calculation cost at the time of learning is basically zero
The calculation cost at the time of identification increases reflecting the amount of training data.
Number of training data required Relatively few Relatively many
Discriminator example SVM k-nearest neighbor method

06.5 Apply to Dialogue Agent

Additions / changes from the previous chapter

  1. Identifyer: SVM → Random Forest
  2. TF-IDF ngram_range: 1 ~ 3 → 1 ~ 2
~~
pipeline = Pipeline([
#    ('vectorizer', TfidfVectorizer(tokenizer=self._tokenize, ngram_range=(1, 3))),
    ('vectorizer', TfidfVectorizer(tokenizer=self._tokenize, ngram_range=(1, 2))),
#    ('classifier', SVC()),
    ('classifier', RandomForestClassifier(n_estimators=30)),
])
~~

Execution result


# evaluate_dialogue_agent.Modify py loading module name as needed
from dialogue_agent import DialogueAgent

$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python evaluate_dialogue_agent.py
0.61702127

--Normal implementation (Step01): 37.2% --Addition of preprocessing (Step02): 43.6% --Pre-processing + feature extraction change (Step04): 58.5% --Pretreatment + feature extraction change + classifier change (Step06): 61.7%

Recommended Posts

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 06 Memo "Identifier"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 07 Memo "Evaluation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 09 Memo "Identifier by Neural Network"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 14 Memo "Hyperparameter Search"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 04 Memo "Feature Extraction"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 08 Memo "Introduction to Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 05 Memo "Features Conversion"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 11 Memo "Word Embeddings"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 12 Memo "Convolutional Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 01 Memo "Let's Make a Dialogue Agent"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 03 Memo "Morphological Analysis and Word Separation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 10 Memo "Details and Improvements of Neural Networks"
[WIP] Pre-processing memo in natural language processing
Summary from the beginning to Chapter 1 of the introduction to design patterns learned in the Java language
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Job change meeting] Try to classify companies by processing word-of-mouth in natural language with word2vec
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
[Python] Try to classify ramen shops by natural language processing
Summary of Chapter 2 of Introduction to Design Patterns Learned in Java Language
Chapter 4 Summary of Introduction to Design Patterns Learned in Java Language
Summary of Chapter 3 of Introduction to Design Patterns Learned in Java Language
[Introduction to RasPi4] Environment construction; natural language processing system mecab, etc. .. .. ♪
Dockerfile with the necessary libraries for natural language processing in python
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock Chapter 1 in Python
Web application development memo in python
Cython to try in the shortest
Preparing to start natural language processing
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
Introduction to Effectiveness Verification Chapter 1 in Python
Convenient goods memo around natural language processing
Even if the development language is changed to python3 in Cloud9, version 2 is displayed in python --version
Let the COTOHA API do the difficult things-Introduction to "learn using" natural language processing-
I tried to extract named entities with the natural language processing library GiNZA
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]