This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 04, I will make a note of my own points.

Preparation

--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server

Chapter overview

The dialogue agents created so far used BoW for feature extraction, but we aim to learn various feature extraction methods such as TF-IDF, BM25, and N-gram, and convert character strings into appropriate feature vectors. ..

04.1 Bag of Words revisited

The nature of the Bag of Words

BoW is a vectorized version of the frequency of occurrence of words, and captures some similarity in the meaning of sentences that "represent my taste" that includes words such as "I" and "like."

On the other hand, since it does not include word order information, each has its own merit. In the following sections 04.2 and 04.3, improvements are introduced in consideration of the word frequency and sentence length of the entire sentence, and in 04.4 and 04.5, improvements are introduced by changing the word division method.

merit --In a language such as Japanese where the word order is free, the following two sentences have the same vector. -* Go to the amusement park with friends tomorrow * -* Go to the amusement park with friends tomorrow *
Demerit --Sentences in which the subject and object are interchanged become the same vector. -* A dog bit a person * -* A person bit a dog *

The name of Bag of Words seems to be derived from the image of breaking down sentences into words, throwing them into bags, and counting the number of sentences, ignoring the order.

Unknown word

When creating a dictionary with .fit of CountVectorizer and vectorizing .transform, if there are words that you want to meet and ignore when vectorizing, you can separate the target sentence set.

04.2 TF-IDF

BoW problems and TF-IDF solutions

--BoW problem: Words that characterize sentences and words that do not characterize sentences are treated equally. --TF-IDF solution: reduce the contribution of words that do not characterize the sentence ――Words that appear widely in various sentences are general words, so they are not important in expressing the meaning of each sentence.

Calculation of TF-IDF by Scikit-learn

Use TfidfVectorizer instead of CountVectorizer (BoW).

TF-IDF calculation method

The final value is the product of the word frequency TF (TermFrequency) and the reciprocal logarithm IDF (InverseDocumentFrequency) of the document frequency. TF-IDF (t, d) = TF (t, d) ・ IDF (t)

--TF: The value increases when it appears frequently in the text --IDF: The value becomes smaller when it appears in many sentences

04.3 BM25 This is a modification of TF-IDF to take the sentence length into consideration.

04.4 Word N-gram

Until now, as a result of word-separation, one word was treated as one-dimensional, so it can be said to be a ** word uni-gram ** method. On the other hand, the method of treating two words as one dimension is called ** word bi-gram **. If you want to put together 3 words, use ** word tri-gram **, and put them together as ** word N-gram **.

#Word-separated results
Tokyo/From/Osaka/To/go

# uni-gram is 5 dimensions
1.Tokyo
2.From
3.Osaka
4.To
5.go

# bi-gram is 4D
1.from Tokyo
2.From Osaka
3.In Osaka
4.go to

Things to consider

By using the word N-gram, it becomes possible to perform feature extraction that takes into account some word order information that was ignored by BoW. On the other hand, as you increase ** N, the dimension increases, and the feature becomes sparse, so the generalization performance drops **, so the above trade-off when using N-gram Need to be considered.

Use with BoW and TF-IDF by Scikit-learn

Give the ngram_range = (minimum, maximum) argument to the CountVectorizer and TfidVectorizer constructors. By giving the minimum value and the maximum value, all N-grams within the specified range can be made into feature vectors. (Example: It is also possible to generate a dictionary using both uni-gram and bi-gram)

04.5 characters N-gram

The idea is to compose BoW with N characters as a group of vocabulary for letters, not words.

Character N-gram features (points to consider)

It is strong against word notation fluctuations, and it is also strong against compound words and unknown words because it does not perform morphological analysis (word-separation) in the first place. On the other hand, ** the ability to distinguish words / sentences that are similar as character strings but have different meanings may be reduced, and the number of dimensions may increase because there are many types of characters in Japanese **.

04.6 Combination of multiple features

Different features can be combined, just as multiple N-grams can be combined and treated as a feature vector with the word N-gram.

#When combining after calculating each feature vector
bow1 = bow1_vectorizer.fit_transform(texts)
bow2 = bow2_vectorizer.fit_transform(texts)

feature = spicy.sparse.hstack((bow1, bow2))

# scikit-learn.pipeline.When using Feature Union
combined = FeatureUnion(
  [
    ('bow', word_bow_vectorizer),
    ('char_bigram', char_bigram_vectorizer),
  ])

feature = combined.fit_transform(texts)

Things to consider when connecting multiple features

--The dimension gets bigger --If you connect features with different properties, the accuracy may decrease. --The range of values is very different --The sparseness is very different

04.7 Other ad hoc features

The following features can also be added.

--Sentence length (example below) --Number of sentences when separated by punctuation marks (example below) --Number of occurrences of a specific word

Combining ad hoc features with Scikit-learn

Check the progress on the way.

`test_sklearn_adhoc_union.py`


###Main source omitted

import print

print('# num_sentences - \'Hello. Good evening.\':')
print([sent for sent in rx_periods.split(texts[0]) if len(sent) > 0])

print('\n# [{} for .. in ..]')
print([{text} for text in texts])

textStats = TextStats()
print('\n# TextStats.fit():' + str(type(textStats.fit(texts))))
fitTransformTextStats = textStats.fit_transform(texts)
print('\n# TextStats.fit_transform():'+ str(type(fitTransformTextStats)))
pprint.pprint(fitTransformTextStats)

dictVectorizer = DictVectorizer()
print('\n# DictVectorizer.fit():' + str(type(dictVectorizer.fit(fitTransformTextStats))))
fitTransformDictVectorizer = dictVectorizer.fit_transform(textStats.transform(texts))
print('\n# DictVectorizer.fit_transform():' + str(type(fitTransformDictVectorizer)))
pprint.pprint(fitTransformDictVectorizer.toarray())

countVectorizer = CountVectorizer(analyzer = 'char', ngram_range = (2, 2))
print('\n# CountVectorizer.fit():' + str(type(countVectorizer.fit(texts))))

`Execution result`


$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python test_sklearn_adhoc_union.py
# num_sentences - 'Hello. Good evening.':
['Hello', 'Good evening']

# [{} for .. in ..]
[{'Hello. Good evening.'}, {'I want to eat yakiniku'}]

# TextStats.fit():<class '__main__.TextStats'>

# TextStats.fit_transform():<class 'list'>
[{'length': 12, 'num_sentences': 2}, {'length': 7, 'num_sentences': 1}]

# DictVectorizer.fit():<class 'sklearn.feature_extraction.dict_vectorizer.DictVectorizer'>

# DictVectorizer.fit_transform():<class 'scipy.sparse.csr.csr_matrix'>
array([[12.,  2.],
       [ 7.,  1.]])

# CountVectorizer.fit():<class 'sklearn.feature_extraction.text.CountVectorizer'>

04.8 Vector space model

Imagine a 2D / 3D vector space of linear algebra.

The role of the classifier

In the case of a three-dimensional vector space and binary classification (which class it belongs to), the boundary at the time of judgment is called the discrimination surface / decision boundary.

--Learning: Processing to draw boundaries in vector space to satisfy the contents of teacher data --Prediction: Processing to determine which side of the boundary the newly input feature is on

04.9 Application to dialogue agent

Additions / changes from the previous chapter

Features: BoW → TF-IDF
Add word N-gram (uni-gram, bi-gram, tri-gram)

~~

pipeline = Pipeline([
  # ('vectorizer', CountVectorizer(tokenizer = tokenizer),↓
  ('vectorizer', TfidVectorizer(
      tokenizer = tokenizer,
      ngram_range=(1,3))),
~~

`Execution result`


# evaluate_dialogue_agent.Fixed loading module name of py
from dialogue_agent import DialogueAgent
↓
from dialogue_agent_with_preprocessing_and_tfidf import DialogueAgent

$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python evaluate_dialogue_agent.py
0.58510638

--Normal implementation (Step01): 37.2% --Addition of preprocessing (Step02): 43.6% --Pretreatment + feature extraction change (Step04): 58.5%

[PYTHON] Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 04 Memo "Feature Extraction"

Contents

Preparation

Chapter overview

04.1 Bag of Words revisited

The nature of the Bag of Words

Unknown word

BoW problems and TF-IDF solutions

Calculation of TF-IDF by Scikit-learn

TF-IDF calculation method

04.4 Word N-gram

Things to consider

Use with BoW and TF-IDF by Scikit-learn

04.5 characters N-gram

Character N-gram features (points to consider)

04.6 Combination of multiple features

Things to consider when connecting multiple features

04.7 Other ad hoc features

Combining ad hoc features with Scikit-learn

`test_sklearn_adhoc_union.py`

`Execution result`

04.8 Vector space model

The role of the classifier

04.9 Application to dialogue agent

`Execution result`