This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 05, I will write down my own points.
--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server
In Step04, the feature extraction method is learned, and in the next Step06, the extracted feature vector is learned to create a classifier. In Step05, you will learn the dimensional compression method that processes the feature vector into the desired shape for the classifier in the process during that time.
--Latent Semantics (LSA) --Principal component analysis (PCA)
BoW is a vectorized version of the frequency of occurrence of words, and "the distribution of feature vector values tends to be very biased."
--Solved by feature extraction --Step04 TF-IDF etc. --Solution by processing the feature vector after extraction --With sklearn.preprocessing.QuantileTransformer, set the values in the range of 0 or more and 1 or less, and make the distribution of values uniform.
It was difficult to understand the example of the reference book, so I will check it myself.
test_quantileTransformer.py
import numpy as np
import MeCab
import pprint
from sklearn.preprocessing import QuantileTransformer
from sklearn.feature_extraction.text import CountVectorizer
def _tokenize(text):
~~
texts = [
'Cars, cars, cars run fast',
'The bike runs fast',
'Bicycle runs slowly',
'Tricycle runs slowly',
'Programming is fun',
'Python is Python Python is Python Python is fun',
]
vectorizer = CountVectorizer(tokenizer=_tokenize, max_features = 5)
bow = vectorizer.fit_transform(texts)
pprint.pprint(bow.toarray())
qt = QuantileTransformer()
qtd = qt.fit_transform(bow)
pprint.pprint(qtd.toarray())
Execution example
array([[0, 3, 0, 1, 3],
[0, 1, 0, 1, 0],
[0, 2, 1, 1, 0],
[0, 1, 1, 1, 0],
[0, 1, 0, 0, 0],
[5, 5, 0, 0, 0]], dtype=int64)
array([[0.00000000e+00, 7.99911022e-01, 0.00000000e+00, 9.99999900e-01,
9.99999900e-01],
[0.00000000e+00, 9.99999998e-08, 0.00000000e+00, 9.99999900e-01,
0.00000000e+00],
[0.00000000e+00, 6.00000000e-01, 9.99999900e-01, 9.99999900e-01,
0.00000000e+00],
[0.00000000e+00, 9.99999998e-08, 9.99999900e-01, 9.99999900e-01,
0.00000000e+00],
[0.00000000e+00, 9.99999998e-08, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00],
[9.99999900e-01, 9.99999900e-01, 0.00000000e+00, 0.00000000e+00,
0.00000000e+00]])
-(5,0) and (0,4) appear frequently, but they do not appear in other sentences, so the converted value is almost 1. -(2, :) and (3, :) have 1 occurrence, but there are many other sentences, so the converted value is almost 1. -(1, :) has various appearances of 1,2,3,5, and the converted values are also different. --Before conversion 1: Almost 0 after conversion --Before conversion 2: Almost 0.6 after conversion --Before conversion 3: Almost 0.8 after conversion --Before conversion 5: Almost 1 after conversion
Contents | LAS | PCA |
---|---|---|
Overview | A method to obtain a vector that expresses a document at the "meaning" level behind a "word" based on a group of feature vectors that represent the relationship between a document and a word, such as BoW. | Method to find "direction in which data points are widely scattered" |
Mathematical manipulation | SVD (Singular Value Decomposition) | EVD (eigenvalue decomposition) |
Implementation | svd = sklearn.decomposition.TruncatedSVD() svd.fit_transform( |
evd = sklearn.decomposition.PCA() evd.fit_transform( |
Importance of each dimension | singular_values_You can see the importance of each dimension after compression by referring to. | explained_variance_ratio_The cumulative contribution rate can be found by referring to. |
Dimensionality reduction | When instantiating, n_Specify components | When instantiating, n_Specify components |
The point that the topic model, which is a question of "whether one sentence and another sentence have the same meaning" instead of explicitly giving the class ID to the learning data, does not explicitly give the correct answer (class ID). It is a kind of "unsupervised learning".
By uncorrelated each component of the vector (multiplying the target vector by the eigenvector obtained by PCA) to make the average 0 variance 1 the original "spreading degree in each axial direction" of the data is erased. It may be expected that the identification performance will be improved.
Since a high-dimensional vector can be converted into a low-dimensional vector, it can also be used as a visualization method.
I couldn't do it just by rewriting Truncated SVD to PCA. (You cannot enter sparse into PCA)
Execution example
def train(self, texts, labels):
vectorizer = TfidfVectorizer(tokenizer=self._tokenize, ngram_range=(1, 3))
bow = vectorizer.fit_transform(texts).toarray()
pca = PCA(n_components = 500)
pca_feat = pca.fit_transform(bow)
classifier = SVC()
classifier.fit(pca_feat, labels)
self.vectorizer = vectorizer
self.pca = pca
self.classifier = classifier
def predict(self, texts):
bow = self.vectorizer.transform(texts).toarray()
pca_feat = self.pca.transform(bow)
return self.classifier.predict(pca_feat)
It can be executed by stopping the pipeline notation, inputting the result (sparse) of vectorizer to array () and then inputting it to PCA.