[PYTHON] Select features using text data

What to introduce in this article

Is it necessary to select features?

If you are reading this, I think that you have already tried some kind of machine learning by extracting features (hereinafter referred to as features) from text data. For example, document classification.

Even if you search for Qiita quickly, you will find some "tried" articles. Morning Musume. I tried to automatically classify my blog. Natural language processing with R. Attempt document classification with Naive Bayes

In document classification, the basic method is to create matrix data by using words as features. It's a matrix called the frequency matrix.

Now, one question comes up here. __ There are many words that are not related to classification, is that okay? __

It's a good question. It's not okay. If there are many features that are not related to the classification, it will act as noise. Noise hinders the improvement of classification performance, isn't it? Troubled. Troubled.

Then, there comes up the idea of __ "You only have to leave the relevant features" __. Yes, this is the one called feature selection.

There are two merits in selecting features.

  1. Before rushing into the machine learning algorithm, the feature selection should be done properly to improve the performance of the model (in some cases, like Random Forest, the feature selection is included in the algorithm itself, but that is another story).
  2. Make it easy to observe the data

I made a package that makes it easy to select features.

It is quite troublesome to seriously select the feature amount. Therefore, I used Package feature selection method.

Works with Python 3.x. Python 2.x will support it soon.

Supported methods

Package features

(Maybe) early

All internal processing uses scipy sparse matrices. In addition, all the parts that can be distributed can be multi-processed, so it is reasonably fast.

It does the pre-processing to the techto.

If you make a dict in the state of morpheme division and throw it in, it will even build a sparse matrix.

For example, the input dict looks like this


input_dict = {
    "label_a": [
        ["I", "aa", "aa", "aa", "aa", "aa"],
        ["bb", "aa", "aa", "aa", "aa", "aa"],
        ["I", "aa", "hero", "some", "ok", "aa"]
    ],
    "label_b": [
        ["bb", "bb", "bb"],
        ["bb", "bb", "bb"],
        ["hero", "ok", "bb"],
        ["hero", "cc", "bb"],
    ],
    "label_c": [
        ["cc", "cc", "cc"],
        ["cc", "cc", "bb"],
        ["xx", "xx", "cc"],
        ["aa", "xx", "cc"],
    ]
}

Let's play a little

I made it with much effort, so I will try it. I put the ipython notes I tried in Gist.

For ipython notes, scipy, morphological analysis wrapper package and feature selection package Use 0.9).

The text has prepared 5 genres. I picked up the text that seems to be applicable from the net and made it by copying it. (~~ This is collective intelligence ~~)

5 genres

is. [^ 1]

I tried PMI and SOA.

I will try to extract from the result.

PMI results

These results were seen in descending order of score.

{'label': 'iranian_cities', 'score': 0.67106056632551592, 'word': 'population'},
{'label': 'conan_movies', 'score': 0.34710665998172219, 'word': 'Appearance'},
 {'label': 'av_actress', 'score': 0.30496452198069324, 'word': 'AV actress'},
 {'label': 'av_actress', 'score': 0.26339266409673928, 'word': 'Appearance'},
{'label': 'av_actress', 'score': 0.2313987055319647, 'word': 'Female'},

The words "Uh, yeah, that's right ~" are lined up.

Words that are easily related to labels are highly weighted, so it will be a success in terms of feature selection.

There seems to be no particular suggestion in terms of observing the data.

On the contrary, what happens to the places where the score is low?

 {'label': 'av_actress', 'score': 5.7340738217327128e-06, 'word': 'Man'},
 {'label': 'conan_movies', 'score': 5.7340738217327128e-06, 'word': '3'},
 {'label': 'conan_movies', 'score': 5.7340738217327128e-06, 'word': 'To'},
 {'label': 'conan_movies', 'score': 5.7340738217327128e-06, 'word': 'Notation'},
 {'label': 'terror', 'score': 5.7340738217327128e-06, 'word': 'Mold'}

?? The result is also mixed. It seems to be a word used functionally in the document. The number "3" is mixed in, which is a mistake in morphological analysis ... (This often happens when using Mecab's Neologd dictionary).

I kept the function word words to a low score. In that respect, it looks like it's working.

SOA results

The order has changed slightly. This is often the case (probably) because SOA is based on PMI expressions.

[{'label': 'conan_movies', 'score': 5.3625700793847084, 'word': 'Appearance'},
 {'label': 'iranian_cities', 'score': 5.1604646721932461, 'word': 'population'},
 {'label': 'av_actress', 'score': 5.1395513523987937, 'word': 'AV actress'},
 {'label': 'av_actress', 'score': 4.8765169465650002, 'word': 'Sa'},
 {'label': 'av_actress', 'score': 4.8765169465650002, 'word': 'Hmm'},
 {'label': 'av_actress', 'score': 4.8765169465650002, 'word': 'Female'},
 {'label': 'terror', 'score': 4.8765169465650002, 'word': 'Syria'},

Now, let's look at the part where the score is low. The low score in SOA can be interpreted as "label irrelevance".

{'label': 'terror', 'score': -1.4454111483223628, 'word': 'population'},
 {'label': 'iranian_cities', 'score': -1.6468902498643583, 'word': 'By the way'},
 {'label': 'iranian_cities', 'score': -1.6468902498643583, 'word': 'thing'},
 {'label': 'iranian_cities', 'score': -1.6468902498643583, 'word': 'During ~'},
 {'label': 'iranian_cities', 'score': -1.6468902498643583, 'word': 'Manufacturing'},
 {'label': 'iranian_cities', 'score': -2.009460329249066, 'word': 'thing'},
 {'label': 'airplane', 'score': -3.3923174227787602, 'word': 'Man'}]

Somehow, it doesn't feel right.

If you look at the frequency in the document, this word appears only once or twice. In other words, it can be said that the relationship with the label is weak, and it is reasonable that the negative value becomes large.

Summary

In this article, we talked about feature selection and packages that make feature selection easy.

This time, we did not check the performance of document classification after selecting features.

However, it is a method that has been sufficiently effective in previous studies. Please use it for document classification tasks.

You can install it with pip install DocumentFeature Selection.

Supplement

From version 1.0 of the package, input data can be designed flexibly.

In one example, if you want to design features with (surface word, POS) as a bigram, you can give an array of tuples like this. Here, ((" he "," N "), (" is "," V ")) is one feature.

input_dict_tuple_feature = {
    "label_a": [
        [ (("he", "N"), ("is", "V")), (("very", "ADV"), ("good", "ADJ")), (("guy", "N"),) ],
        [ (("you", "N"), ("are", "V")), (("very", "ADV"), ("awesome", "ADJ")), (("guy", "N"),) ],
        [ (("i", "N"), ("am", "V")), (("very", "ADV"), ("good", "ADJ")), (("guy", "N"),) ]
    ],
    "label_b": [
        [ (("she", "N"), ("is", "V")), (("very", "ADV"), ("good", "ADJ")), (("girl", "N"),) ],
        [ (("you", "N"), ("are", "V")), (("very", "ADV"), ("awesome", "ADJ")), (("girl", "N"),) ],
        [ (("she", "N"), ("is", "V")), (("very", "ADV"), ("good", "ADJ")), (("guy", "N"),) ]
    ]
}

Since tuples can be given as features of input sentences, users can freely design features. [^ 2] For example, it can be used for such tasks

--When you want to use (surface word, some tag) as a feature --When you want to use the edge label of the dependency as a feature


[^ 1]: I'm often asked, "Why does the sample text include adult videos and Persian or Iran?" That's because I like adult videos. Because I was studying Persian, I am attached to it. [^ 2]: Even in the past, it is possible to forcibly extract features by making the surface word_tag a str type. But isn't that something smart? Do you need pre-processing and post-processing? I thought, so I added this function.

Recommended Posts

Select features using text data
Analyze data using RegEx's 100x Flash Text
SELECT data using client library in BigQuery
Data analysis using xarray
Data cleansing 2 Data cleansing using DataFrame
Data cleaning using Python
[Memo] Text matching in pandas data frame using flashtext
Inflating text data by retranslation using google translate in Python
Data visualization method using matplotlib (1)
Text data preprocessing (vectorization, TF-IDF)
Data visualization method using matplotlib (2)
Data analysis using python pandas
Text development using Mac services
Obtain OTU (microorganism) count data as a text file using QIIME2
[Translation] scikit-learn 0.18 Tutorial Text data manipulation
Get Salesforce data using REST API
Data acquisition using python googlemap api
Data visualization method using matplotlib (+ pandas) (5)
(sqlalchemy) Display text in select field
Generating multilingual text images using Python
Parsing CSV format data using SQL
Get Amazon data using Keep API # 1 Get data
Checklist using Checkbox Treeview → Text creation
Data visualization method using matplotlib (+ pandas) (3)
Data acquisition memo using Backlog API
Recommendation of data analysis using MessagePack
Get data from Twitter using Tweepy
Data visualization method using matplotlib (+ pandas) (4)