[PYTHON] I tried the simplest method of multi-label document classification

What is document classification?

One of the tasks of NLP is document classification. It estimates the label for a labeled document.

Document classification can be broadly divided into the following two types according to the nature of the label attached to the document.

--Topic classification --Documents labeled on the topic --Frequently look at news articles labeled as politics, sports, entertainment, etc. ――Some are classified into two categories, and some are multi-labeled (there are more multi-labeled ones). --Applied to news article recommendations, etc.

--Sentiment analysis --Labeled as to whether the document is positive or negative ――There is a binary classification, and it is also classified into more numbers (3 labels of positive, neutral, negative, etc.) ――It is also used for marketing research

Document classification model

There are many ways to solve these document classification problems. There are the following two typical methods. (I think there are others)

--Create a document vector and classify it by machine learning method --How to make a document vector - Tf-idf --bag of embedding (mean or maximum for the distributed representation of each word in the document) --How to classify - Logistic Regression - Naive Baise model - Support Vector Machine - Random Forest, Xgboost --And so on

--Put row text into a neural network - LSTM - BERT fine tuning --And so on

I tried the easiest way

Even though I did the easiest one, I'm not really sure which one is the easiest (hey). This time, I would like to work on the method of SVM (with linear karnel) the Tf-idf vector. Tf-idf is a vector that has the frequency of occurrence of each word in a document multiplied by the importance of that word as an element. Therefore, the dimension of the document vector is equal to the number of vocabularies.

SVM with linear kernel seems to be a little difficult to explain, so I will omit it.

This time, I will use the one included in sklearn.

Since the model is simple (?), I will try to use a corpus that is a little complicated (multi-label + some topics are given to each document). The corpus used is a Reuters news article with about 10,000 documents and 90 labels.

First download the corpus

Download the corpus. The python module nltk contains a Reuters corpus, so use that.

First, if nltk is not included

pip install nltk

Then type the following in a python interactive shell:

python
>>> import nltk
>>> nltk.download("reuters")

Then, a directory called nltk_data is created under the user directory, and the data is in that directory. ____ is inside.

Implementation code


import glob
import nltk
import re
import codecs
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords, reuters
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn import metrics

path = "../nltk_data/corpora/reuters/"
with open(path+"stopwords") as sw:
    stopwords = [x for x in sw]

#Define tokenizer
def tokenize(text):
  min_length = 3
  words = map(lambda word: word.lower(), word_tokenize(text))
  words = [word for word in words if word not in stopwords]
  tokens = (list(map(lambda token: PorterStemmer().stem(token), words)))
  p = re.compile('[a-zA-Z]+');
  filtered_tokens = list(filter (lambda token: p.match(token) and len(token) >= min_length, tokens))
  return filtered_tokens
    
#Get document id and its category from data
with codecs.open("../nltk_data/corpora/reuters/cats.txt", "r", "utf-8", "ignore") as categories:
    train_docs_id = [line.split(" ")[0][9:] for line in categories if line.split(" ")[0][:9] == 'training/']
    categories.seek(0)
    test_docs_id = [line.split(" ")[0][5:] for line in categories if line.split(" ")[0][:5] == 'test/']
    categories.seek(0)
    train_docs_cat = [line.strip("\n").split(" ")[1:] for line in categories if line.split(" ")[0][:9] == 'training/']
    categories.seek(0)
    test_docs_cat = [line.strip("\n").split(" ")[1:] for line in categories if line.split(" ")[0][:5] == 'test/']

#List documents
train_docs = []
test_docs = []
for num in train_docs_id:
    with codecs.open(path+"training/"+num, "r", "utf-8", "ignore") as doc:
        train_docs.append(" ".join([line.strip(" ") for line in doc.read().split("\n")]))
for num in test_docs_id:
    with codecs.open(path+"test/"+num, "r", "utf-8", "ignore") as doc:
        test_docs.append(" ".join([line.strip(" ") for line in doc.read().split("\n")]))

#Sklearn from the document list.Generate document vector with TfidfVectorizer
vectorizer = TfidfVectorizer(tokenizer=tokenize)
vectorised_train_documents = vectorizer.fit_transform(train_docs)
vectorised_test_documents = vectorizer.transform(test_docs)

#Binary label(0 or 1)Convert to vector of
# Transform multilabel labels
mlb = MultiLabelBinarizer()
train_labels = mlb.fit_transform(train_docs_cat)
test_labels = mlb.transform(test_docs_cat)

# Classifier
#Try different parameters
param_list = [0.001, 0.01, 0.1, 0.5, 1, 10, 100]
for C in param_list:
    classifier = OneVsRestClassifier(LinearSVC(C=C, random_state=42))
    classifier.fit(vectorised_train_documents, train_labels)
    predictions = classifier.predict(vectorised_test_documents)
    train_predictions = classifier.predict(vectorised_train_documents)
    ftest = metrics.f1_score(test_labels, predictions, average="macro")
    ftrain = metrics.f1_score(train_labels, train_predictions, average="macro")
    print("parameter       test_f1                 train_f1")
    print("c={}:\t{}\t{}".format(C, ftest, ftrain))  

Running the above code gives the following result:

parameter       test_f1                 train_f1
c=0.001:	0.009727246626471432	0.007884179312750742
c=0.01:	0.02568945815128711	0.02531440097069285
c=0.1:	0.20504347026711428	0.26430270726815386
c=0.5:	0.3908058642922242	0.6699048987962078
c=1:	0.45945765878179573	0.9605946547451458
c=10:	0.5253686991407462	0.9946632502765812
c=100:	0.5312185383446876	0.9949908225328556

You are overfitting to your heart's content. The same method according to the paper below should give an accuracy of the latter half of 80% ... https://www.aclweb.org/anthology/N19-1408/

If anyone knows why it's not good, please let me know.

Recommended Posts

I tried the simplest method of multi-label document classification
I tried the asynchronous server of Django 3.0
I tried to summarize the frequently used implementation method of pytest-mock
I tried the simplest method of multi-label document classification
Launch a simple WEB server that can check the header
[Linux] I tried to verify the secure confirmation method of FQDN (CentOS7)
I tried the pivot table function of pandas
I tried the least squares method in Python
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried using the image filter of OpenCV
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried running the TensorFlow tutorial with comments (text classification of movie reviews)
I tried to summarize the basic form of GPLVM
I tried the MNIST tutorial for beginners of tensorflow.
I tried clustering ECG data using the K-Shape method
I tried using the API of the salmon data project
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to simulate the dollar cost averaging method
I tried to classify the voices of voice actors
I tried running the sample code of the Ansible module
I tried to summarize the string operations of Python
I tried AutoGluon's Image Classification
I tried the changefinder library!
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried the accuracy of three Stirling's approximations in python
I tried to find the average of the sequence with TensorFlow
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried face recognition of the laughter problem using Keras.
I tried morphological analysis of the general review of Kusoge of the Year
[Python] I tried to visualize the follow relationship of Twitter
I tried a little bit of the behavior of the zip function
[Machine learning] I tried to summarize the theory of Adaboost
[Python] I tried collecting data using the API of wikipedia
I tried to fight the Local Minimum of Goldstein-Price Function
I tried to verify the yin and yang classification of Hololive members by machine learning
I want to get the name of the function / method being executed
I tried scraping the ranking of Qiita Advent Calendar with Python
[Linux] I tried to summarize the command of resource confirmation system
I tried the TensorFlow tutorial 1st
I tried the Naro novel API 2
I tried to automate the watering of the planter with Raspberry Pi
I tried to build the SD boot image of LicheePi Nano
I tried using GrabCut of OpenCV
I tried the TensorFlow tutorial 2nd
I looked at the meta information of BigQuery & tried using it
I tried to expand the size of the logical volume with LVM
I tried the Naruro novel API
I tried running the DNN part of OpenPose with Chainer CPU
I tried to visualize the common condition of VTuber channel viewers
I tried to move the ball
[Sentence classification] I tried various pooling methods of Convolutional Neural Networks
I tried using the checkio API
I tried to estimate the interval.
I implemented the K-means method (clustering method)
I tried to transform the face image using sparse_image_warp of TensorFlow Addons
I tried to get the batting results of Hachinai using image processing
I tried to visualize the age group and rate distribution of Atcoder