[PYTHON] Feature extraction by TF method using the result of morphological analysis

Introduction

In this article, we will explain the TF (Term Frequency method) method as a feature extraction method for implementing a document classifier.

1. Morphological analysis

Document classification uses word information in a document. Japanese is a non-separate language, not a separate word-separated language like English, so it is necessary to divide each sentence in a document into words. Dividing a sentence into words and estimating the part of speech of each word is called morphological analysis.

Here, we use the open source morphological analysis software MeCab. • http://taku910.github.io/mecab/

2. Feature extraction

When dealing with a classification problem, the information used for classification in the data is generally called a feature amount, and the work of extracting this feature amount from the data is called a feature extraction. In document classification, words in a document are used as features.

The frequency of occurrence of each word in a document is often used as a word weight. This weighting method is called the TF method (Term Frequency method). In the TF method, words that appear more frequently are considered to be characteristic words in the document. Note that the order of words and the order of appearance are not taken into consideration.

3. Implementation code

tf.py


import MeCab as mc
from collections import Counter
import sys
import fileinput
from pathlib import Path

def mecab_analysis(text):
    t = mc.Tagger("-Ochasen")
    t.parse('')
    node = t.parseToNode(text)
    output = []
    while node:
        if node.surface != "":
            word_type = node.feature.split(",")[0]
            if word_type in ["adjective","verb","noun","adverb","助verb","symbol","Particle"]:
                output.append(node.surface)
        node = node.next
        if node is None:
            break
    return output

if Path(sys.argv[1]).exists():  
    for line in fileinput.input():  
        if line:
            line = line.replace('"', '')
            line = line.replace('\\', '')
            words = mecab_analysis(line)
            counter = Counter(words)
            for word, count in counter.most_common():
                if len(word) > 0:
                    print("%s:%d "%(word, count),end ="")
            print("")
        else:
            break

4. How to execute

Target file

test.txt


It is compact and can be leaned against the corner of the room so that it can be used immediately when you want to use it. The suction power is also wonderful, was there so much dust? I am surprised. The price is reasonable and it is a recommended product.
I decided to buy after seeing this review. My vacuum cleaner is made by Hitachi (purchased last year), but it doesn't fit in the mouthpiece, and it feels like it's stuck with the attached attachment. But since it's wobbly, it naturally comes off many times while I'm wearing it. Sure, you can vacuum the futon without sucking it in, but I was disappointed that it didn't fit the same Hitachi product. I was looking forward to seeing how much dust it would collect, so I replaced it with a new paper carton and then cleaned it. After finishing all the steps, I looked inside the paper carton, but it was irresistible. Isn't it sucking in dust because it's wobbly? I'm disappointed again.

Each line becomes an input document.

Execution method

Execute the command as follows.

python3 tf.py test.txt > output.txt

The target file is given as the first argument, and the result of feature extraction by the TF method is output to output.txt here.

output

output.txt


To:3 、:3 。:3 of:2:2:2 too:2 compact:At 1:1 room:1 corner:1 leaning:1 Hey:1 Use:1 want:1 time:1 Immediately:1 Use:1 suction:1 force:1 wonderfully:1 こんなTo:1 Dust:1 is:1 Oh:1:1 ?:1 and:1 Surprise:1 price:1 Affordable:1 Recommended:1 product:At 1す:1 
hand:8 。:7 to:7 of:6:6:5:4 、:4 Cleaning:3 is:3:At 3:3 is:3 purchase:2 machines:2 Hitachi:Made of 2:2 fit:2:2:2:2 wobbly:2 pieces:2:2 not:2 disappointed:2:2 Dust:2:2 paper:2 pack:2 only:2:2 here:1 Review:1 look:1 decision:1 out:1 (:1 Last year:1 ):1 suck:1 mouth:1 included:1 attachment:1 somehow:1 stick:1:1 feeling:1 so:1 Naturally:1 in the middle:1 what:Once:1 too:1 off:1 End:1 Certainly:1 cloth:1 suck込ま:1 cloth団:1 multiply:1 thing:1 can:1:1 なんhand:1 which:First place:1 Accumulate:1 fun:1 new:1 Replacement:From 1:1 one:1 way:1 end:1 medium:1 peep:1 accumulated:1 no:1:1:1 suck込ん:1ょ:1:1 ?:1 again:1 

The feature extraction result is output.

Recommended Posts

Feature extraction by TF method using the result of morphological analysis
Reuse the behavior of the @property method by using a descriptor [16/100]
Try cluster analysis using the K-means method
Display the result of video analysis using Cloud Video Intelligence API from Colaboratory.
Extraction of synonyms using Word2Vec went well, so I summarized the analysis
Try using the Chinese morphological analysis engine jieba
Predicting the future of Numazu's population transition by time-series regression analysis using Prophet
Calculation of the shortest path using the Monte Carlo method
Explanation of the concept of regression analysis using python Part 2
ML Pipeline: Highlights the Challenge of Manual Feature Extraction
Object tracking using OpenCV3 and Python3 (tracking feature points specified by the mouse using the Lucas-Kanade method)
Explanation of the concept of regression analysis using Python Part 1
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 4: Feature extraction of data using T-SQL
Explanation of the concept of regression analysis using Python Extra 1
The copy method of pandas.DataFrame is deep copy by default
I tried morphological analysis of the general review of Kusoge of the Year
[TF] I tried to visualize the learning result using Tensorboard
Output the result of gradient descent method as matplotlib animation
Scraping the result of "Schedule-kun"
Output the result of morphological analysis with Mecab to a WEB browser compatible with Sakura server / UTF-8
cv2.Canny (): Makes the adjustment of edge detection by the Canny method nice
Flow of getting the result of asynchronous processing using Django and Celery
Perform morphological analysis in the machine learning environment launched by GCE
[GWAS] Plot the results of principal component analysis (PCA) by PLINK
Check the drawing result using Plotly by embedding CodePen in Qiita
Full-text search of Wikipedia by implementing a morphological analysis type full-text search engine
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
[Anomaly detection] Try using the latest method of deep distance learning
Find the ratio of the area of Lake Biwa by the Monte Carlo method
Process the result of% time,% timeit
10 selections of data extraction by pandas.DataFrame.query
Recommendation of data analysis using MessagePack
Installation method using the pip command of the Python package (library) Mac environment
I tried fractal dimension analysis by the box count method in 3D
I tried to verify the result of A / B test by chi-square test