[PYTHON] Calculate tf-idf with scikit-learn

** This post is the 4th day article of Escapism Advent Calendar 2013. ** **

Day 2 article uses html fetched using Bing API, so Day 2 It is easier to understand if you read / items / 62291ba328de9d12bd30) first.

Summarize this article in 3 lines

-I checked a Python library called scikit-learn. -Calculated the tf-idf of the word in the html saved in Day 2 --Checked the mapping between words and tfidf

reference

scikit-learn official, text feature extraction document

Calculate tfidf of words in Tweet using scikit-learn

Finished product

https://github.com/katryo/tfidf_with_sklearn

Fork me!

theory

Definition of tfidf

--tf-idf is the value of ** tf * idf **. It is attached to a word of a document in a document set. Words with high tf-idf can be considered important. It can be used to weight words in information retrieval. --tf (Term Frequency) is ** the number of occurrences of the word (Term) in the document / the total number of all words that have appeared in the document **. It grows when the word is used many times in the document. --idf (Inverse Document Frequency) is the reciprocal of df. However, in reality, logarithms are taken in log to make it easier to calculate. So it becomes ** log (1 / df) **. The base of log is usually 2, but it can be e or 10. Should be. --df (Document Frequency) is ** the number of documents in which the word appears / the total number of documents **. It grows when the word is used in a wide range of topics. “Ha” and “o”, and in English, “is” and “that” are very large. The value given to a word in a document set.

Extraction of features from text with scikit-learn

While partially translating the contents of scikit-learn official, text feature extraction document, understand your own In addition, it will be explained.

When extracting features from text with scikit-learn, three processes are required.

-tokenizing: Convert text to bag-of-words. In the case of English, it is OK to divide it by white space and then remove noise such as symbols, but when doing it in Japanese, use a morphological analyzer such as MeCab or KyTea. Since scikit-learn does not include a Japanese morphological analyzer, this process is required separately. --counting: Count the frequency of occurrence of each word for each individual document. --normalizing and weighting: Normalizing and weighting. Calculate tf-idf based on the frequency of word occurrence, the number of words in the document, and the number of documents, and then convert it to a value that is easy to use.

In scikit-learn, the above three steps are collectively called ** vectorization **, that is, "vectorization". The later Tfidf Vectorizer can do all three steps. If you have already completed the procedure halfway, you can calculate from the middle or halfway.

By the way, scikit-learn can do not only bag-of-words but also tfidf calculation with n-gram focusing on the continuation of two or more words, but I will not do it this time.

CountVectorizer The Count Vectorizer in sklearn.feature_extraction.text can tokenize and count. Since the result of Counting is represented by a vector, Vectorizer.

The official documentation explains it here.

TfidfTransformer The Tfidf Transformer, also in sklearn.feature_extraction.text, is responsible for normalizing. The fit_transform method calculates tfidf based on just "frequency of word occurrence for each document" and even normalizes it. Here in the official documentation.

TfidfVectorizer Existence that has the functions of Count Vectorizer and Tfidf Transformer. The Trinity, Trinity Form. This is convenient when extracting features from raw text.

I actually used it

Take a look at the high words of tfidf

The main subject is from here. 36,934 words in 400 web pages retrieved by 8 queries. From these, print "words with tfidf greater than 0.1 in the document that appears".

First of all, the calculation of tfidf is quite expensive, so let's calculate the tfidf and then pickle the result.

`set_tfidf_with_sklearn_to_fetched_pages.py`


import utils
import constants
import pickle
import os
from sklearn.feature_extraction.text import TfidfVectorizer


def is_bigger_than_min_tfidf(term, terms, tfidfs):
    '''
    [term for term in terms if is_bigger_than_min_tfidf(term, terms, tfidfs)]Use in
A function that guesses in order from the tfidf values of words that have been listed.
The value of tfidf is MIN_Returns True if greater than TFIDF
    '''
    if tfidfs[terms.index(term)] > constants.MIN_TFIDF:
        return True
    return False


def tfidf(pages):
    #analyzer is a function that returns a list of strings when you enter a string
    vectorizer = TfidfVectorizer(analyzer=utils.stems, min_df=1, max_df=50)
    corpus = [page.text for page in pages]

    x = vectorizer.fit_transform(corpus)

    #From here on down has nothing to do with the value returned. I just wanted to see what the high tfidf words look like
    terms = vectorizer.get_feature_names()
    tfidfs = x.toarray()[constants.DOC_NUM]
    print([term for term in terms if is_bigger_than_min_tfidf(term, terms, tfidfs)])

    print('total%i kinds of words%Found from i page.' % (len(terms), len(pages)))

    return x, vectorizer  #x is tfidf_Receive in main as result

if __name__ == '__main__':
    utils.go_to_fetched_pages_dir()
    pages = utils.load_all_html_files()  #pages fetch html and set it to text
    tfidf_result, vectorizer = tfidf(pages)  # tfidf_result is the x of the tfidf function

    pkl_tfidf_result_path = os.path.join('..', constants.TFIDF_RESULT_PKL_FILENAME)
    pkl_tfidf_vectorizer_path = os.path.join('..', constants.TFIDF_VECTORIZER_PKL_FILENAME)

    with open(pkl_tfidf_result_path, 'wb') as f:
        pickle.dump(tfidf_result, f)
    with open(pkl_tfidf_vectorizer_path, 'wb') as f:
        pickle.dump(vectorizer, f)

In the tfidf function

vectorizer = TfidfVectorizer(analyzer=utils.stems, min_df=1, max_df=50)

It is said. analyzer puts a function that returns a list of strings when you put a string. By default, it is divided by white space and only one character symbol is removed, but when doing it in Japanese, it is necessary to create and set a function using a morphological analyzer by yourself. The utils.stems function is a function that morphologically analyzes with MeCab, converts it into a stem, and returns it as a list. It was written in utils.py, which will be described later.

What is printed in the tfidf function is a word whose tfidf value is 0.1 or more among the words that can be found in one of the result pages searched by "stomach leaning". The result of this will be described later.

The utils that appear in the code are as follows, and are a collection of useful functions that can be used in various situations.

`utils.py`


import MeCab
import constants
import os
import pdb
from web_page import WebPage

def _split_to_words(text, to_stem=False):
    """
input: 'All to myself'
output: tuple(['all', 'myself', 'of', 'How', 'What'])
    """
    tagger = MeCab.Tagger('mecabrc')  #You can use another Tagger
    mecab_result = tagger.parse(text)
    info_of_words = mecab_result.split('\n')
    words = []
    for info in info_of_words:
        #When divided by macab, "" is at the end of the sentence, before that.'EOS'Is coming
        if info == 'EOS' or info == '':
            break
            # info => 'Nana\t particle,Final particle,*,*,*,*,Nana,Na,Na'
        info_elems = info.split(',')
        #Sixth, inflected words are included. If the sixth is'*'If so, enter the 0th
        if info_elems[6] == '*':
            # info_elems[0] => 'Van Rossum\t noun'
            words.append(info_elems[0][:-3])
            continue
        if to_stem:
            #Convert to stem
            words.append(info_elems[6])
            continue
        #Word as it is
        words.append(info_elems[0][:-3])
    return words


def words(text):
    words = _split_to_words(text=text, to_stem=False)
    return words


def stems(text):
    stems = _split_to_words(text=text, to_stem=True)
    return stems


def load_all_html_files():
    pages = []
    for query in constants.QUERIES:
        pages.extend(load_html_files_with_query(query))
    return pages


def load_html_files_with_query(query):
    pages = []
    for i in range(constants.NUM_OF_FETCHED_PAGES):
        with open('%s_%s.html' % (query, str(i)), 'r') as f:
            page = WebPage()
            page.html_body = f.read()
        page.remove_html_tags()
        pages.append(page)
    return pages

def load_html_files():
    """
Use on the assumption that the HTML file is in the directory
    """
    pages = load_html_files_with_query(constants.QUERY)
    return pages


def go_to_fetched_pages_dir():
    if not os.path.exists(constants.FETCHED_PAGES_DIR_NAME):
        os.mkdir(constants.FETCHED_PAGES_DIR_NAME)
    os.chdir(constants.FETCHED_PAGES_DIR_NAME)

And the constants are as follows.

`constants.py`


FETCHED_PAGES_DIR_NAME = 'fetched_pages'
QUERIES = 'Stomach leaning Caries pollinosis measures Depression Machine fracture Stiff shoulder Documents'.split(' ')
NUM_OF_FETCHED_PAGES = 50
NB_PKL_FILENAME = 'naive_bayes_classifier.pkl'
DOC_NUM = 0
MIN_TFIDF = 0.1
TFIDF_RESULT_PKL_FILENAME = 'tfidf_result.pkl'
TFIDF_VECTORIZER_PKL_FILENAME = 'tfidf_vectorizer.pkl'

If you look at the order of QUERIES, you can see that the "stomach leaning" category comes first. The DOC_NUM constant was created for this experiment and was used to specify the 0th file in the "stomach leaning" category, that is, the file named "stomach leaning_0.html".

Now. Let's run this code.

$ python set_tfidf_with_sklearn_to_fetched_pages.py

Even if you use scikit-learn, it takes time to calculate tfidf. It took 25.81 seconds in my environment. result.

['gaJsHost', 'https', 'Dripping', 'Burn', 'Aerophagia', 'Hyperacidity', 'breast', 'cooking', 'Foodstuff', 'Hiatal hernia']
A total of 36934 words were found on 400 pages.

It's a word that feels like a stomachache. Among the words in stomach upset_0.html, it was found that the above 10 kinds of words have tfidf exceeding 0.1.

gaJsHost and https seem to be part of the JavaScript code. Hmmm. I want to get rid of all this noise, but I can't think of a good way. Better yet, it may be better to eliminate alphabet-only words.

By the way, words like "hiatal hernia" are not included in MeCab's IPADIC (see this article for the origin of IPADIC), so Wikipedia It is necessary to strengthen it by putting the words of Hatena keywords in the dictionary. Please google how to do it.

Check word and tfidf value mapping

I read the official page, but the result of calculating tfidf is output in the type of csr_matrix of scipy. This is a sparse (mostly 0) matrix that represents the word tf-idf in individual documents as a decimal from 0 to 1.

(Pdb) type(x)
<class 'scipy.sparse.csr.csr_matrix'>

I didn't know what the tfidf value set was mapped to a word (I found out later), so I did a simple experiment using pdb.set_trace ().

TfidfVectorizer has the method to use

get_feature_names
inverse_transform

And scipy.sparse.csr_matrix has

toarray

Is.

First, when I checked the Web Page with document number 0, it was a page called Stomach leaning.com. Find out how the words that appear on this page are represented.

After converting the calculation result of tfidf to pickle, the following code was executed.

`play_with_tfidf.py`


# -*- coding: utf-8 -*-
import pickle
import constants
import pdb

def is_bigger_than_min_tfidf(term, terms, tfidfs):
    '''
    [term for term in terms if is_bigger_than_min_tfidf(term, terms, tfidfs)]Use in
A function that guesses in order from the tfidf values of words that have been listed.
The value of tfidf is MIN_Returns True if greater than TFIDF
    '''
    if tfidfs[terms.index(term)] > constants.MIN_TFIDF:
        return True
    return False

if __name__ == '__main__':
    with open(constants.TFIDF_VECTORIZER_PKL_FILENAME, 'rb') as f:
        vectorizer = pickle.load(f)
    with open(constants.TFIDF_RESULT_PKL_FILENAME, 'rb') as f:
        x = pickle.load(f)

    pdb.set_trace()

    terms = vectorizer.get_feature_names()
    for i in range(3):
        tfidfs = x.toarray()[i]
        print([term for term in terms if is_bigger_than_min_tfidf(term, terms, tfidfs)])

Since it becomes a breakpoint with pdb.set_trace and the value can be output from there in an interactive environment, various confirmation work can be performed.

(Pdb) vectorizer.inverse_transform(x)[0]
> array(['Hiatal hernia', 'Foodstuff', 'diet remedy', 'Operation', 'Reflux esophagitis', 'cooking', 'breast', 'Hyperacidity', 'Stomach pain',
       'Gastric ulcer', 'Gastroptosis', 'Gastric cancer', 'Aerophagia', 'Chinese herbal medicine', 'Construction', 'Chronic gastritis', 'Duodenal ulcer', 'medical insurance',
       'Disclaimer', 'Corporate information', 'polyp', 'pot', 'care', 'American family life insurance company', 'Aflac', 'go',
       'Burn', 'Regarding', 'Dripping', 'unescape', 'try', 'ssl', 'protocol',
       'javascript', 'inquiry', 'https', 'gaJsHost', 'ga', 'err',
       'comCopyright', 'analytics', 'Inc', 'Cscript', 'CROSSFINITY',
       '=\'"', "='", ':"', '.")', '."', ')\u3000', '(("', '("%', "'%",
       '"))'],
      dtype='<U26')

The term "hiatal hernia" is rare and seems to rarely appear on other pages, so I decided to use it as a marker.

(Pdb) vectorizer.get_feature_names().index('Hiatal hernia')
36097

It turned out to be the 36097th word. So what is the value of tfidf for the 36097th word in the 0th document (ie, stomach upset.com)?

(Pdb) x.toarray()[0][36097]
0.10163697033184078

Quite expensive. In document number 0, the word with word number 36097 was found to have tfidf of 0.10163697033184078. I don't think that such a high (first and foremost non-zero) tfidf value happens to appear at word number 36097. x.toarray () is a very sparse matrix and most of the elements should be 0. Therefore, it can be considered that the order of the word list that can be taken by vectorizer.get_feature_names () and the order of the words that have tfidf that can be taken by x.toarray () are the same.

In this way, it was confirmed that the word list was kept in the same order. I think it says somewhere in the official documentation that "the order of the words is kept".

After that, I deleted pdb.set_trace () and ran play_with_tfidf.py again.

['gaJsHost', 'https', 'Dripping', 'Burn', 'Aerophagia', 'Hyperacidity', 'breast', 'cooking', 'Foodstuff', 'Hiatal hernia']
['Dripping', 'Disgusting', 'Burn', 'Stomach pain', 'breast', 'Pass']
['TVCM', 'Gusuru', 'Dripping', 'Drinking', 'もDripping', 'Burn', 'Burnる', 'Ri', 'action', 'Science', 'Sacron', 'Cerbere', 'triple', 'Veil', 'hangover', 'Weak', 'Arrange', 'mucus', 'Stomach pain', 'Stomach medicine', 'breast', 'Fullness']

These words have a high tfidf (which seems to be very high) and are considered to be useful as a feature in calculating the similarity between documents and the gastric leaning category.

Summary

scikit-learn convenient.

I posted the code on Github.

https://github.com/katryo/tfidf_with_sklearn

Next time preview

I want to implement the calculation function of tfidf and compare it with scikit-learn.