[PYTHON] Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 04 Memo "Feature Extraction"

Contents

This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 04, I will make a note of my own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

The dialogue agents created so far used BoW for feature extraction, but we aim to learn various feature extraction methods such as TF-IDF, BM25, and N-gram, and convert character strings into appropriate feature vectors. ..

04.1 Bag of Words revisited

The nature of the Bag of Words

BoW is a vectorized version of the frequency of occurrence of words, and captures some similarity in the meaning of sentences that "represent my taste" that includes words such as "I" and "like."

On the other hand, since it does not include word order information, each has its own merit. In the following sections 04.2 and 04.3, improvements are introduced in consideration of the word frequency and sentence length of the entire sentence, and in 04.4 and 04.5, improvements are introduced by changing the word division method.

The name of Bag of Words seems to be derived from the image of breaking down sentences into words, throwing them into bags, and counting the number of sentences, ignoring the order.

Unknown word

When creating a dictionary with .fit of CountVectorizer and vectorizing .transform, if there are words that you want to meet and ignore when vectorizing, you can separate the target sentence set.

04.2 TF-IDF

BoW problems and TF-IDF solutions

--BoW problem: Words that characterize sentences and words that do not characterize sentences are treated equally. --TF-IDF solution: reduce the contribution of words that do not characterize the sentence ――Words that appear widely in various sentences are general words, so they are not important in expressing the meaning of each sentence.

Calculation of TF-IDF by Scikit-learn

Use TfidfVectorizer instead of CountVectorizer (BoW).

TF-IDF calculation method

The final value is the product of the word frequency TF (TermFrequency) and the reciprocal logarithm IDF (InverseDocumentFrequency) of the document frequency. TF-IDF (t, d) = TF (t, d) ・ IDF (t)

--TF: The value increases when it appears frequently in the text --IDF: The value becomes smaller when it appears in many sentences

04.3 BM25 This is a modification of TF-IDF to take the sentence length into consideration.

04.4 Word N-gram

Until now, as a result of word-separation, one word was treated as one-dimensional, so it can be said to be a ** word uni-gram ** method. On the other hand, the method of treating two words as one dimension is called ** word bi-gram **. If you want to put together 3 words, use ** word tri-gram **, and put them together as ** word N-gram **.

#Word-separated results
Tokyo/From/Osaka/To/go

# uni-gram is 5 dimensions
1.Tokyo
2.From
3.Osaka
4.To
5.go

# bi-gram is 4D
1.from Tokyo
2.From Osaka
3.In Osaka
4.go to

Things to consider

By using the word N-gram, it becomes possible to perform feature extraction that takes into account some word order information that was ignored by BoW. On the other hand, as you increase ** N, the dimension increases, and the feature becomes sparse, so the generalization performance drops **, so the above trade-off when using N-gram Need to be considered.

Use with BoW and TF-IDF by Scikit-learn

Give the ngram_range = (minimum, maximum) argument to the CountVectorizer and TfidVectorizer constructors. By giving the minimum value and the maximum value, all N-grams within the specified range can be made into feature vectors. (Example: It is also possible to generate a dictionary using both uni-gram and bi-gram)

04.5 characters N-gram

The idea is to compose BoW with N characters as a group of vocabulary for letters, not words.

Character N-gram features (points to consider)

It is strong against word notation fluctuations, and it is also strong against compound words and unknown words because it does not perform morphological analysis (word-separation) in the first place. On the other hand, ** the ability to distinguish words / sentences that are similar as character strings but have different meanings may be reduced, and the number of dimensions may increase because there are many types of characters in Japanese **.

04.6 Combination of multiple features

Different features can be combined, just as multiple N-grams can be combined and treated as a feature vector with the word N-gram.

#When combining after calculating each feature vector
bow1 = bow1_vectorizer.fit_transform(texts)
bow2 = bow2_vectorizer.fit_transform(texts)

feature = spicy.sparse.hstack((bow1, bow2))

# scikit-learn.pipeline.When using Feature Union
combined = FeatureUnion(
  [
    ('bow', word_bow_vectorizer),
    ('char_bigram', char_bigram_vectorizer),
  ])

feature = combined.fit_transform(texts)

Things to consider when connecting multiple features

--The dimension gets bigger --If you connect features with different properties, the accuracy may decrease. --The range of values is very different --The sparseness is very different

04.7 Other ad hoc features

The following features can also be added.

--Sentence length (example below) --Number of sentences when separated by punctuation marks (example below) --Number of occurrences of a specific word

Combining ad hoc features with Scikit-learn

Check the progress on the way.

test_sklearn_adhoc_union.py


###Main source omitted

import print

print('# num_sentences - \'Hello. Good evening.\':')
print([sent for sent in rx_periods.split(texts[0]) if len(sent) > 0])

print('\n# [{} for .. in ..]')
print([{text} for text in texts])

textStats = TextStats()
print('\n# TextStats.fit():' + str(type(textStats.fit(texts))))
fitTransformTextStats = textStats.fit_transform(texts)
print('\n# TextStats.fit_transform():'+ str(type(fitTransformTextStats)))
pprint.pprint(fitTransformTextStats)

dictVectorizer = DictVectorizer()
print('\n# DictVectorizer.fit():' + str(type(dictVectorizer.fit(fitTransformTextStats))))
fitTransformDictVectorizer = dictVectorizer.fit_transform(textStats.transform(texts))
print('\n# DictVectorizer.fit_transform():' + str(type(fitTransformDictVectorizer)))
pprint.pprint(fitTransformDictVectorizer.toarray())

countVectorizer = CountVectorizer(analyzer = 'char', ngram_range = (2, 2))
print('\n# CountVectorizer.fit():' + str(type(countVectorizer.fit(texts))))

Execution result


$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python test_sklearn_adhoc_union.py
# num_sentences - 'Hello. Good evening.':
['Hello', 'Good evening']

# [{} for .. in ..]
[{'Hello. Good evening.'}, {'I want to eat yakiniku'}]

# TextStats.fit():<class '__main__.TextStats'>

# TextStats.fit_transform():<class 'list'>
[{'length': 12, 'num_sentences': 2}, {'length': 7, 'num_sentences': 1}]

# DictVectorizer.fit():<class 'sklearn.feature_extraction.dict_vectorizer.DictVectorizer'>

# DictVectorizer.fit_transform():<class 'scipy.sparse.csr.csr_matrix'>
array([[12.,  2.],
       [ 7.,  1.]])

# CountVectorizer.fit():<class 'sklearn.feature_extraction.text.CountVectorizer'>

04.8 Vector space model

Imagine a 2D / 3D vector space of linear algebra.

The role of the classifier

In the case of a three-dimensional vector space and binary classification (which class it belongs to), the boundary at the time of judgment is called the discrimination surface / decision boundary.

--Learning: Processing to draw boundaries in vector space to satisfy the contents of teacher data --Prediction: Processing to determine which side of the boundary the newly input feature is on

04.9 Application to dialogue agent

Additions / changes from the previous chapter

  1. Features: BoW → TF-IDF
  2. Add word N-gram (uni-gram, bi-gram, tri-gram)
~~

pipeline = Pipeline([
  # ('vectorizer', CountVectorizer(tokenizer = tokenizer),↓
  ('vectorizer', TfidVectorizer(
      tokenizer = tokenizer,
      ngram_range=(1,3))),
~~

Execution result


# evaluate_dialogue_agent.Fixed loading module name of py
from dialogue_agent import DialogueAgent
↓
from dialogue_agent_with_preprocessing_and_tfidf import DialogueAgent

$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python evaluate_dialogue_agent.py
0.58510638

--Normal implementation (Step01): 37.2% --Addition of preprocessing (Step02): 43.6% --Pretreatment + feature extraction change (Step04): 58.5%

Recommended Posts

Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 04 Memo "Feature Extraction"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 06 Memo "Identifier"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 02 Memo "Pre-processing"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 07 Memo "Evaluation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 14 Memo "Hyperparameter Search"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 4 Step 15 Memo "Data Collection"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 08 Memo "Introduction to Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 05 Memo "Features Conversion"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 11 Memo "Word Embeddings"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 12 Memo "Convolutional Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 13 Memo "Recurrent Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 09 Memo "Identifier by Neural Network"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 01 Memo "Let's Make a Dialogue Agent"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 03 Memo "Morphological Analysis and Word Separation"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 3 Step 10 Memo "Details and Improvements of Neural Networks"
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 1 Memo "Preliminary Knowledge Before Beginning Exercises"
[WIP] Pre-processing memo in natural language processing
Summary from the beginning to Chapter 1 of the introduction to design patterns learned in the Java language
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
[Job change meeting] Try to classify companies by processing word-of-mouth in natural language with word2vec
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
[Python] Try to classify ramen shops by natural language processing
Summary of Chapter 2 of Introduction to Design Patterns Learned in Java Language
Chapter 4 Summary of Introduction to Design Patterns Learned in Java Language
Summary of Chapter 3 of Introduction to Design Patterns Learned in Java Language
[Introduction to RasPi4] Environment construction; natural language processing system mecab, etc. .. .. ♪
Dockerfile with the necessary libraries for natural language processing in python
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
100 natural language processing knocks Chapter 4 Commentary
100 Language Processing Knock Chapter 1 in Python
Web application development memo in python
Cython to try in the shortest
Preparing to start natural language processing
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]