[PYTHON] Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 05 Memo "Features Conversion"

Contents

This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 05, I will write down my own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

In Step04, the feature extraction method is learned, and in the next Step06, the extracted feature vector is learned to create a classifier. In Step05, you will learn the dimensional compression method that processes the feature vector into the desired shape for the classifier in the process during that time.

--Latent Semantics (LSA) --Principal component analysis (PCA)

05.1 Feature pretreatment

BoW is a vectorized version of the frequency of occurrence of words, and "the distribution of feature vector values tends to be very biased."

--Solved by feature extraction --Step04 TF-IDF etc. --Solution by processing the feature vector after extraction --With sklearn.preprocessing.QuantileTransformer, set the values in the range of 0 or more and 1 or less, and make the distribution of values uniform.

It was difficult to understand the example of the reference book, so I will check it myself.

test_quantileTransformer.py


import numpy as np
import MeCab
import pprint

from sklearn.preprocessing import QuantileTransformer
from sklearn.feature_extraction.text import CountVectorizer

def _tokenize(text):
~~

texts = [
    'Cars, cars, cars run fast',
    'The bike runs fast',
    'Bicycle runs slowly',
    'Tricycle runs slowly',
    'Programming is fun',
    'Python is Python Python is Python Python is fun',
]

vectorizer = CountVectorizer(tokenizer=_tokenize, max_features = 5)
bow = vectorizer.fit_transform(texts)
pprint.pprint(bow.toarray())

qt = QuantileTransformer()
qtd = qt.fit_transform(bow)
pprint.pprint(qtd.toarray())

Execution example


array([[0, 3, 0, 1, 3],
       [0, 1, 0, 1, 0],
       [0, 2, 1, 1, 0],
       [0, 1, 1, 1, 0],
       [0, 1, 0, 0, 0],
       [5, 5, 0, 0, 0]], dtype=int64)
array([[0.00000000e+00, 7.99911022e-01, 0.00000000e+00, 9.99999900e-01,
        9.99999900e-01],
       [0.00000000e+00, 9.99999998e-08, 0.00000000e+00, 9.99999900e-01,
        0.00000000e+00],
       [0.00000000e+00, 6.00000000e-01, 9.99999900e-01, 9.99999900e-01,
        0.00000000e+00],
       [0.00000000e+00, 9.99999998e-08, 9.99999900e-01, 9.99999900e-01,
        0.00000000e+00],
       [0.00000000e+00, 9.99999998e-08, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00],
       [9.99999900e-01, 9.99999900e-01, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00]])

-(5,0) and (0,4) appear frequently, but they do not appear in other sentences, so the converted value is almost 1. -(2, :) and (3, :) have 1 occurrence, but there are many other sentences, so the converted value is almost 1. -(1, :) has various appearances of 1,2,3,5, and the converted values are also different. --Before conversion 1: Almost 0 after conversion --Before conversion 2: Almost 0.6 after conversion --Before conversion 3: Almost 0.8 after conversion --Before conversion 5: Almost 1 after conversion

05.2 Latent Semantics (LSA) 05.3 Principal Component Analysis (PCA)

Contents LAS PCA
Overview A method to obtain a vector that expresses a document at the "meaning" level behind a "word" based on a group of feature vectors that represent the relationship between a document and a word, such as BoW. Method to find "direction in which data points are widely scattered"
Mathematical manipulation SVD (Singular Value Decomposition) EVD (eigenvalue decomposition)
Implementation svd = sklearn.decomposition.TruncatedSVD()
svd.fit_transform()
evd = sklearn.decomposition.PCA()
evd.fit_transform()
Importance of each dimension singular_values_You can see the importance of each dimension after compression by referring to. explained_variance_ratio_The cumulative contribution rate can be found by referring to.
Dimensionality reduction When instantiating, n_Specify components When instantiating, n_Specify components

Points to consider with both methods

LSA-Topic model

The point that the topic model, which is a question of "whether one sentence and another sentence have the same meaning" instead of explicitly giving the class ID to the learning data, does not explicitly give the correct answer (class ID). It is a kind of "unsupervised learning".

PCA-whitening

By uncorrelated each component of the vector (multiplying the target vector by the eigenvector obtained by PCA) to make the average 0 variance 1 the original "spreading degree in each axial direction" of the data is erased. It may be expected that the identification performance will be improved.

PCA-Visualization method

Since a high-dimensional vector can be converted into a low-dimensional vector, it can also be used as a visualization method.

05.4 Application / Implementation

I couldn't do it just by rewriting Truncated SVD to PCA. (You cannot enter sparse into PCA)

Execution example


    def train(self, texts, labels):
        vectorizer = TfidfVectorizer(tokenizer=self._tokenize, ngram_range=(1, 3))
        bow = vectorizer.fit_transform(texts).toarray()

        pca = PCA(n_components = 500)
        pca_feat = pca.fit_transform(bow)

        classifier = SVC()
        classifier.fit(pca_feat, labels)

        self.vectorizer = vectorizer
        self.pca = pca
        self.classifier = classifier

    def predict(self, texts):
        bow = self.vectorizer.transform(texts).toarray()
        pca_feat = self.pca.transform(bow)
        return self.classifier.predict(pca_feat)

It can be executed by stopping the pipeline notation, inputting the result (sparse) of vectorizer to array () and then inputting it to PCA.

Recommended Posts

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 05 Memo "Features Conversion"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 06 Memo "Identifier"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 07 Memo "Evaluation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 04 Memo "Feature Extraction"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 08 Memo "Introduction to Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 11 Memo "Word Embeddings"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 12 Memo "Convolutional Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 13 Memo "Recurrent Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 01 Memo "Let's Make a Dialogue Agent"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 03 Memo "Morphological Analysis and Word Separation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 10 Memo "Details and Improvements of Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 1 Memo "Preliminary Knowledge Before Beginning Exercises"
[WIP] Pre-processing memo in natural language processing
Summary from the beginning to Chapter 1 of the introduction to design patterns learned in the Java language
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
[Python] Try to classify ramen shops by natural language processing
Summary of Chapter 2 of Introduction to Design Patterns Learned in Java Language
Chapter 4 Summary of Introduction to Design Patterns Learned in Java Language
Summary of Chapter 3 of Introduction to Design Patterns Learned in Java Language
[Introduction to RasPi4] Environment construction; natural language processing system mecab, etc. .. .. ♪
Dockerfile with the necessary libraries for natural language processing in python
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock Chapter 1 in Python
Web application development memo in python
Cython to try in the shortest
Preparing to start natural language processing
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]