[PYTHON] Learn Japanese text categories with tf-idf and Random Forest ~ [Tuning]

Learning Japanese text categories with tf-idf and random forest ~ livedoor news seems to have room for tuning, so try your best I will.

Code to use

It is the same as the previous article.

import glob
import random
import numpy as np
from natto import MeCab
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

def load_livedoor_news_corpus():
    category = {
        'dokujo-tsushin': 1,
        'it-life-hack':2,
        'kaden-channel': 3,
        'livedoor-homme': 4,
        'movie-enter': 5,
        'peachy': 6,
        'smax': 7,
        'sports-watch': 8,
        'topic-news':9
    }
    docs  = []
    labels = []

    for c_name, c_id in category.items():
        files = glob.glob("./text/{c_name}/{c_name}*.txt".format(c_name=c_name))

        text = ''
        for file in files:
            with open(file, 'r') as f:
                lines = f.read().splitlines() 

                url = lines[0]
                datetime = lines[1]
                subject = lines[2]
                body = "\n".join(lines[3:])
                text = subject + "\n" + body

            docs.append(text)
            labels.append(c_id)
    
    return docs, labels

docs, labels = load_livedoor_news_corpus()


random.seed()
indices = list(range(len(docs)))
random.shuffle(indices)

split_size = 7000

train_data = [docs[i] for i in indices[0:split_size]]
train_labels = [labels[i] for i in indices[0:split_size]]
test_data = [docs[i] for i in indices[split_size:]]
test_labels = [labels[i] for i in indices[split_size:]]



def tokenize(text):
    tokens = []
    with MeCab('-F%f[0],%f[6]') as nm:
        for n in nm.parse(text, as_nodes=True):
            # ignore any end-of-sentence nodes
            if not n.is_eos() and n.is_nor():
                klass, word = n.feature.split(',', 1)
                if klass in ['noun']:
                    tokens.append(word)

    return tokens


vectorizer = TfidfVectorizer(tokenizer=tokenize)
train_matrix = vectorizer.fit_transform(train_data)
test_matrix = vectorizer.transform(test_data)


clf2 = RandomForestClassifier(n_estimators=100, max_features=3000, oob_score=True)
clf2.fit(train_matrix, train_labels)

print(clf2.score(train_matrix, train_labels))
print(clf2.score(test_matrix, test_labels))

result

Dataset	Score
Training	1.0
Test	0.901

Since the identification rate of Training data is 100%, it seems good to remove the part of speech and reduce the features.

Limit part of speech to nouns only

def tokenize(text):
    tokens = []
    with MeCab('-F%f[0],%f[6]') as nm:
        for n in nm.parse(text, as_nodes=True):
            # ignore any end-of-sentence nodes
            if not n.is_eos() and n.is_nor():
                klass, word = n.feature.split(',', 1)
                if klass in ['noun']:
                    tokens.append(word)

    return tokens

result

Dataset	Score
Training	1.0
Test	0.918

The identification rate of Test data has improved to 91.8%.

Increase max_features in Random Forest

As measured by the get_feature_names method of TfidfVectorizer, the feature was 31258.

The max_features of RandomForestClassifier was sqrt by default, 176 in this case. I feel that this is too small, so I will increase it a little.

clf2 = RandomForestClassifier(n_estimators=100, max_features=num_features)
clf2.fit(train_matrix, train_labels)

result

num_features = 1000

Dataset	Score
Training	1.0
Test	0.931

num_features = 3000

Dataset	Score
Training	1.0
Test	0.937

The identification rate of Test data has improved to 93.7%.

Use oob_score

The API document for sklearn says "Whether to use out-of-bag samples to estimate the generalization error."

I can't understand how it behaves from the document. ..

clf2 = RandomForestClassifier(n_estimators=100, max_features=3000, oob_score=True)
clf2.fit(train_matrix, train_labels)

result

Dataset	Score
Training	1.0
Test	0.948

In this case, it was better to tune with oob_score = True.

Summary

Before tuning, it was 90.1%, but eventually the identification rate improved to 94.8%.