[PYTHON] Text data preprocessing (vectorization, TF-IDF)

Text vectorization

To apply a machine learning algorithm to text data, it is necessary to convert raw data, which is a list of words, into a numerical feature vector.

Bag-of-words: The idea of ignoring grammar and word order and treating sentences as a set of words.

Here, a library for topic models called gensim is used. Use a short nine-sentence corpus for practice as described in the official tutorial.

from gensim import corpora

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

Common words such as a, the, for, of, and have no meaning in Bag-of-words, so they are excluded as stop words. Also, in order to eliminate the influence of notational blurring and the presence of words at the beginning of sentences, all letters are converted to lowercase.

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

Words that appear only once in the corpus do not have much information and are excluded.

# remove words that appear only once
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

from pprint import pprint  # pretty-printer
pprint(texts)

Here are the results.

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

Up to this point, the word has been divided into spaces and tokenized. Next, give an id to each word and count the number of occurrences (count). gensim.corpora.dictionary.Dictionary class is a dictionary that represents the mapping between tokens and ids.

dictionary = corpora.Dictionary(texts)

print(dictionary)
print(dictionary.token2id)

Dictionary(12 unique tokens: ['human', 'interface', 'computer', 'survey', 'user']...)
{'human': 0, 'interface': 1, 'computer': 2, 'survey': 3, 
'user': 4, 'system': 5, 'response': 6, 'time': 7, 
'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}

There are 12 unique words in the corpus. Each sentence is represented by a 12-dimensional vector containing the count of each word. Use doc2bow to vectorize new text.

new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)  # the word "interaction" does not appear in the dictionary and is ignored

[(0, 1), (2, 1)]

The result is returned in the form [(word_id, word_count), ...]. In this case, one human with id = 0 and one interaction with id = 2 are included. Words with 0 occurrences are skipped. Also, "interaction" is not included in the dictionary and is ignored.

Vectorize the original text.

corpus = [dictionary.doc2bow(text) for text in texts]
for c in corpus:
    print(c)

[(0, 1), (1, 1), (2, 1)]
[(2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(1, 1), (4, 1), (5, 1), (8, 1)]
[(0, 1), (5, 2), (8, 1)]
[(4, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(3, 1), (10, 1), (11, 1)]

TF-IDF

the term	meaning
TF (term frequency)	Frequency of word occurrence
IDF (inverse document frequency)	Reverse document appearance frequency

Words that appear extremely often in the corpus may not have very useful information. Giving word count data (TF) directly to the classifier results in less frequent but more meaningful words being buried in these words. Therefore, in addition to TF, the features are weighted by IDF.

IDF is defined as follows. $ n_d $ is the total number of documents, $ df (d, t) $ is the number of documents containing the word t. The fewer documents that contain the word t, the larger the IDF (increasing the weight of rare words).

idf(t)=log\frac{1+n_d}{1+df(d,t)}+1

Convert with TF-IDF with gensim.

from gensim import models
# step 1 -- initialize a model
tfidf = models.TfidfModel(corpus)

# step 2 -- use the model to transform vectors
corpus_tfidf = tfidf[corpus]
for doc in corpus_tfidf:
    print(doc)

[(1, 0.5710059809418182), (4, 0.4170757362022777), (5, 0.4170757362022777), (8, 0.5710059809418182)]
[(0, 0.49182558987264147), (5, 0.7184811607083769), (8, 0.49182558987264147)]
[(4, 0.45889394536615247), (6, 0.6282580468670046), (7, 0.6282580468670046)]
[(9, 1.0)]
[(9, 0.7071067811865475), (10, 0.7071067811865475)]
[(9, 0.5080429008916749), (10, 0.5080429008916749), (11, 0.695546419520037)]
[(3, 0.6282580468670046), (10, 0.45889394536615247), (11, 0.6282580468670046)]

Here, the corpus itself used for training was converted by TF-IDF, but of course any sentence vector can be converted (as long as it belongs to the same vector space).