This is a memo for myself as I read Introduction to Natural Language Processing Applications in 15 Steps. This time, in Chapter 2, Step 04, I will make a note of my own points.
--Personal MacPC: MacOS Mojave version 10.14.6 --docker version: Version 19.03.2 for both Client and Server
The dialogue agents created so far used BoW for feature extraction, but we aim to learn various feature extraction methods such as TF-IDF, BM25, and N-gram, and convert character strings into appropriate feature vectors. ..
BoW is a vectorized version of the frequency of occurrence of words, and captures some similarity in the meaning of sentences that "represent my taste" that includes words such as "I" and "like."
On the other hand, since it does not include word order information, each has its own merit. In the following sections 04.2 and 04.3, improvements are introduced in consideration of the word frequency and sentence length of the entire sentence, and in 04.4 and 04.5, improvements are introduced by changing the word division method.
The name of Bag of Words seems to be derived from the image of breaking down sentences into words, throwing them into bags, and counting the number of sentences, ignoring the order.
When creating a dictionary with .fit of CountVectorizer and vectorizing .transform, if there are words that you want to meet and ignore when vectorizing, you can separate the target sentence set.
04.2 TF-IDF
--BoW problem: Words that characterize sentences and words that do not characterize sentences are treated equally. --TF-IDF solution: reduce the contribution of words that do not characterize the sentence ――Words that appear widely in various sentences are general words, so they are not important in expressing the meaning of each sentence.
Use TfidfVectorizer instead of CountVectorizer (BoW).
The final value is the product of the word frequency TF (TermFrequency) and the reciprocal logarithm IDF (InverseDocumentFrequency) of the document frequency.
TF-IDF (t, d) = TF (t, d) ・ IDF (t)
--TF: The value increases when it appears frequently in the text --IDF: The value becomes smaller when it appears in many sentences
04.3 BM25 This is a modification of TF-IDF to take the sentence length into consideration.
Until now, as a result of word-separation, one word was treated as one-dimensional, so it can be said to be a ** word uni-gram ** method. On the other hand, the method of treating two words as one dimension is called ** word bi-gram **. If you want to put together 3 words, use ** word tri-gram **, and put them together as ** word N-gram **.
#Word-separated results
Tokyo/From/Osaka/To/go
# uni-gram is 5 dimensions
1.Tokyo
2.From
3.Osaka
4.To
5.go
# bi-gram is 4D
1.from Tokyo
2.From Osaka
3.In Osaka
4.go to
By using the word N-gram, it becomes possible to perform feature extraction that takes into account some word order information that was ignored by BoW. On the other hand, as you increase ** N, the dimension increases, and the feature becomes sparse, so the generalization performance drops **, so the above trade-off when using N-gram Need to be considered.
Give the ngram_range = (minimum, maximum)
argument to the CountVectorizer and TfidVectorizer constructors.
By giving the minimum value and the maximum value, all N-grams within the specified range can be made into feature vectors. (Example: It is also possible to generate a dictionary using both uni-gram and bi-gram)
The idea is to compose BoW with N characters as a group of vocabulary for letters, not words.
It is strong against word notation fluctuations, and it is also strong against compound words and unknown words because it does not perform morphological analysis (word-separation) in the first place. On the other hand, ** the ability to distinguish words / sentences that are similar as character strings but have different meanings may be reduced, and the number of dimensions may increase because there are many types of characters in Japanese **.
Different features can be combined, just as multiple N-grams can be combined and treated as a feature vector with the word N-gram.
#When combining after calculating each feature vector
bow1 = bow1_vectorizer.fit_transform(texts)
bow2 = bow2_vectorizer.fit_transform(texts)
feature = spicy.sparse.hstack((bow1, bow2))
# scikit-learn.pipeline.When using Feature Union
combined = FeatureUnion(
[
('bow', word_bow_vectorizer),
('char_bigram', char_bigram_vectorizer),
])
feature = combined.fit_transform(texts)
--The dimension gets bigger --If you connect features with different properties, the accuracy may decrease. --The range of values is very different --The sparseness is very different
The following features can also be added.
--Sentence length (example below) --Number of sentences when separated by punctuation marks (example below) --Number of occurrences of a specific word
Check the progress on the way.
test_sklearn_adhoc_union.py
###Main source omitted
import print
print('# num_sentences - \'Hello. Good evening.\':')
print([sent for sent in rx_periods.split(texts[0]) if len(sent) > 0])
print('\n# [{} for .. in ..]')
print([{text} for text in texts])
textStats = TextStats()
print('\n# TextStats.fit():' + str(type(textStats.fit(texts))))
fitTransformTextStats = textStats.fit_transform(texts)
print('\n# TextStats.fit_transform():'+ str(type(fitTransformTextStats)))
pprint.pprint(fitTransformTextStats)
dictVectorizer = DictVectorizer()
print('\n# DictVectorizer.fit():' + str(type(dictVectorizer.fit(fitTransformTextStats))))
fitTransformDictVectorizer = dictVectorizer.fit_transform(textStats.transform(texts))
print('\n# DictVectorizer.fit_transform():' + str(type(fitTransformDictVectorizer)))
pprint.pprint(fitTransformDictVectorizer.toarray())
countVectorizer = CountVectorizer(analyzer = 'char', ngram_range = (2, 2))
print('\n# CountVectorizer.fit():' + str(type(countVectorizer.fit(texts))))
Execution result
$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python test_sklearn_adhoc_union.py
# num_sentences - 'Hello. Good evening.':
['Hello', 'Good evening']
# [{} for .. in ..]
[{'Hello. Good evening.'}, {'I want to eat yakiniku'}]
# TextStats.fit():<class '__main__.TextStats'>
# TextStats.fit_transform():<class 'list'>
[{'length': 12, 'num_sentences': 2}, {'length': 7, 'num_sentences': 1}]
# DictVectorizer.fit():<class 'sklearn.feature_extraction.dict_vectorizer.DictVectorizer'>
# DictVectorizer.fit_transform():<class 'scipy.sparse.csr.csr_matrix'>
array([[12., 2.],
[ 7., 1.]])
# CountVectorizer.fit():<class 'sklearn.feature_extraction.text.CountVectorizer'>
Imagine a 2D / 3D vector space of linear algebra.
In the case of a three-dimensional vector space and binary classification (which class it belongs to), the boundary at the time of judgment is called the discrimination surface / decision boundary.
--Learning: Processing to draw boundaries in vector space to satisfy the contents of teacher data --Prediction: Processing to determine which side of the boundary the newly input feature is on
Additions / changes from the previous chapter
~~
pipeline = Pipeline([
# ('vectorizer', CountVectorizer(tokenizer = tokenizer),↓
('vectorizer', TfidVectorizer(
tokenizer = tokenizer,
ngram_range=(1,3))),
~~
Execution result
# evaluate_dialogue_agent.Fixed loading module name of py
from dialogue_agent import DialogueAgent
↓
from dialogue_agent_with_preprocessing_and_tfidf import DialogueAgent
$ docker run -it -v $(pwd):/usr/src/app/ 15step:latest python evaluate_dialogue_agent.py
0.58510638
--Normal implementation (Step01): 37.2% --Addition of preprocessing (Step02): 43.6% --Pretreatment + feature extraction change (Step04): 58.5%