[PYTHON] Multi-label classification by random forest with scikit-learn

Preface

This article is based on ❷ of Collection and classification of machine learning related information (concept).

For example, let's say you have news that a company has built a Q & A system using a service on the cloud.

Then, as you can infer from Example of folder of local file system in ❷, Internet shortcuts for this news

・ Tools / Cloud / Certain services ・ Machine learning / application / Bot ・ Dialogue system ・ Social trends / companies / certain companies

Must be placed in at least three locations. These classifications are not exclusive, so they are so-called multi-label classifications.

There are various algorithms that deal with such problems in Python / scikit-learn, but it seems that the API interface is unified as it is, and you can write code that works just by replacing the algorithm.

So, in this article, let's check the API interface.

Since the script that handles the actual crawl result is not general to introduce on Qiita, I will publish it on GitHub, and in this article I will check the interface Only sample scripts are handled. It's also easy to choose Random Forest [^ 1]. The data also uses artificial values to confirm the correspondence between the numerical values, and is not really meaningful data. Please understand this point.

Multi-class and multi-label algorithms in scikit-learn

As a framework for handling multi-class algorithms and multi-label algorithms with scikit-learn, [Translation] scikit-learn 0.18 User Guide 1.12. Multi-class algorithms and multi-label algorithms There is sklearn.multiclass described in / 9decf45d106accc6afe1). However, this module is a general-purpose module that breaks down the problem into binary classification problems and handles them, and is not optimized for individual classification algorithms.

Decision Tree, [Random Forest](https://ja.wikipedia. org / wiki /% E3% 83% A9% E3% 83% B3% E3% 83% 80% E3% 83% A0% E3% 83% 95% E3% 82% A9% E3% 83% AC% E3% 82 % B9% E3% 83% 88), the nearest neighbor method implements multi-label classification in each algorithm itself, so you will use each algorithm directly.

Sample script

Now let's write a sample script.

random-forest.py


#from sklearn.tree import DecisionTreeClassifier as classifier
from sklearn.ensemble import RandomForestClassifier as classifier
from gensim import matutils

corpus = [[(1,10),(2,20)],[(3,30),(4,40)],[(5,50),(6,60)]]
labels = [[100,500,900],[300,400,800],[200,600,700]]

dense = matutils.corpus2dense(corpus, 7)

print(dense)   #=> (*1)

print(dense.T) #=> (*2)

clf = classifier(random_state=777)
clf.fit(dense.T, labels)

for target in [[[0,10,20, 0, 0, 0, 0]], #=> (*3)
               [[0,10,20,30,40,50,60]], #=> (*4)
               [[0,10,10,0,0,0,0],      #=> (*5)
                [0,0,0,20,20,0,0],
                [0,0,0,0,0,30,30]]]:
    print(clf.predict(target))
    print(clf.predict_proba(target))

classifier

If you change the import source of the classifier algorithm as in the comment-out example, it will be classified by the specified algorithm.

corpus and labels

The input corpus is a sparse matrix that describes the vocabulary frequencies of the three documents. First document-ID: 1 word 10 times, ID: 2 word 20 times, other words frequency 0 The following document-ID: 3 words 30 times, ID: 4 words 40 times, other words frequency 0 ・ Last document-ID: 5 words 50 times, ID: 6 words 60 times, other words frequency 0

Teacher data labels describe the classification of each document. First document-the first label has a value of 100, the second label has a value of 500, and the last label has a value of 900. · Next document-the first label has a value of 300, the next label has a value of 400, and the last label has a value of 800. · Last document-first label has a value of 200, second label has a value of 600, last label has a value of 700 In other words · First label-three classes with values 100, 200 and 300 · Next label-three classes with values 400, 500 and 600 -Last label-three classes with values 700, 800 and 900

sparse → dense conversion

The classification algorithm provided by scikit-learn expects a dense matrix as input, so we have to convert the sparse matrix to a dense matrix.

You can use corpus2dense in the gensim.matutils module for this. Let's see the result (* 1).

dense.


[[  0.   0.   0.]
 [ 10.   0.   0.]
 [ 20.   0.   0.]
 [  0.  30.   0.]
 [  0.  40.   0.]
 [  0.   0.  50.]
 [  0.   0.  60.]]

that?

It is true that it is a dense matrix that expresses 0 without omitting it properly, but since the row is the word ID and the column is the document number, there is no correspondence with labels as it is. Therefore, in order to input the classification algorithm, the rows and columns must be transposed (→ result (* 2)).

dense.T


[[  0.  10.  20.   0.   0.   0.   0.]
 [  0.   0.   0.  30.  40.   0.   0.]
 [  0.   0.   0.   0.   0.  50.  60.]]

I think that there are overwhelmingly many use cases where dense.T is used for dense and dense.T, but I'm not sure why corpus2dense has such a specification.

Classification

・ Classifier generation

python


clf = RandomForestClassifier(random_state=777)

The Random Forest algorithm uses random numbers internally, so if you don't fix random_state, you'll get different results each time.

・ Classification

print(clf.predict(target))
print(clf.predict_proba(target))

In this sample script, predict is used to calculate the classification result, and predict_proba is used to calculate the estimated probability value [^ 2].

Let's look at the results in order.

(*3) [[0,10,20, 0, 0, 0, 0]]

[[ 100.  500.  900.]]
[array([[ 0.8,  0.1,  0.1]]), array([[ 0.1,  0.8,  0.1]]), array([[ 0.1,  0.1,  0.8]])]

Since this is one of the trained patterns, it is expected that the classification will be as per the teacher data.

-First label-Value 100: Probability 0.8, 200: Probability 0.1, 300: Probability 0.1 → Value 100. -Next label-Value 400: Probability 0.1, 500: Probability 0.8, 600: Probability 0.1 → Value 500. -Last label-Value 700: Probability 0.1, 800: Probability 0.1, 900: Probability 0.8 → Value 900.

Certainly the teacher data can be reproduced.

Note that the values are managed in numerical order, not in order of appearance.

(*4) [[0,10,20,30,40,50,60]]

[[ 200.  600.  700.]]
[array([[ 0.3,  0.5,  0.2]]), array([[ 0.2,  0.3,  0.5]]), array([[ 0.5,  0.2,  0.3]])]

-First label-Value 100: Probability 0.3, 200: Probability 0.5, 300: Probability 0.2 → Value 200. -Next label-Value 400: Probability 0.2, 500: Probability 0.3, 600: Probability 0.5 → Value 600. -Last label-Value 700: Probability 0.5, 800: Probability 0.2, 900: Probability 0.3 → Value 700.

(*5) [[0,10,10,0,0,0,0],[0,0,0,20,20,0,0],[0,0,0,0,0,30,30]]]

[[ 100.  500.  900.]
 [ 100.  400.  800.]
 [ 200.  600.  700.]]
[array([[ 0.7,  0.2,  0.1],
        [ 0.4,  0.2,  0.4],
        [ 0.3,  0.5,  0.2]]),
 array([[ 0.1,  0.7,  0.2],
        [ 0.4,  0.4,  0.2],
        [ 0.2,  0.3,  0.5]]),
 array([[ 0.2,  0.1,  0.7],
        [ 0.2,  0.4,  0.4],
        [ 0.5,  0.2,  0.3]])]

-First label-Value 100: Probability 0.7, 200: Probability 0.2, 300: Probability 0.1 → Value 100. -Next label-Value 400: Probability 0.1, 500: Probability 0.7, 600: Probability 0.2 → Value 500. -Last label-Value 700: Probability 0.2, 800: Probability 0.1, 900: Probability 0.7 → Value 900.

-First label-Value 100: Probability 0.4, 200: Probability 0.2, 300: Probability 0.4 → Value 100. -Next label-Value 400: Probability 0.4, 500: Probability 0.4, 600: Probability 0.2 → Value 400. -Last label-Value 700: Probability 0.2, 800: Probability 0.4, 900: Probability 0.4 → Value 800.

-First label-Value 100: Probability 0.3, 200: Probability 0.5, 300: Probability 0.2 → Value 200. -Next label-Value 400: Probability 0.2, 500: Probability 0.3, 600: Probability 0.5 → Value 600. -Last label-Value 700: Probability 0.5, 800: Probability 0.2, 900: Probability 0.3 → Value 700.

Note that the outermost dimension of predict_proba is the label No.

When the problem is decomposed into binary classification problems with sklearn.multiclass, the outermost repetition is inevitably the label No. [^ 3], so I think it is a natural specification.

Afterword

In this way, what each dimension of the matrix that serves as an interface corresponds to is complicated and difficult to understand, especially in the case of multiple labels, and I think it was meaningful to leave such a memo.

I would appreciate it if you could point out any misunderstandings.

[^ 1]: Regarding random forest, there is an article Classifying news articles by scikit-learn and gensim. [^ 2]: Depending on the classification algorithm, it may not always be considered as a probability. For example, in the case of DecisionTree, this value seems to be either 0. or 1. because it is "decided". [^ 3]: I haven't actually confirmed it.

Recommended Posts

Multi-label classification by random forest with scikit-learn
Classification / regression by stacking (scikit-learn)
Random forest (classification) and hyperparameter tuning
Random Forest (2)
Random Forest
Multi-class, multi-label classification of images with pytorch
Disease classification in Random Forest using Python
Isomap with Scikit-learn
Challenge text classification by Naive Bayes with sklearn
DBSCAN with scikit-learn
Clustering with scikit-learn (1)
Clustering with scikit-learn (2)
PCA with Scikit-learn
kmeans ++ with scikit-learn
Cross Validation with scikit-learn
Multi-class SVM with scikit-learn
Learn with chemoinformatics scikit-learn
DBSCAN (clustering) with scikit-learn
Install scikit.learn with pip
Calculate tf-idf with scikit-learn
Learn Japanese text categories with tf-idf and Random Forest ~ [Tuning]
Measure the importance of features with a random forest tool
Image classification with self-made neural network by Keras and PyTorch
"Garbage classification by image!" App creation diary day2-Fine-tuning with VGG16-
Scikit-learn decision Generate Python code from tree / random forest rules