[PYTHON] The most basic clustering analysis with scikit-learn

scikit-learn + [clustering](http://ja.wikipedia.org/wiki/%E3%83%87%E3%83% BC% E3% 82% BF% E3% 83% BB% E3% 82% AF% E3% 83% A9% E3% 82% B9% E3% 82% BF% E3% 83% AA% E3% 83% B3% Regarding E3% 82% B0), I touched it a little at the very beginning of this blog, but I can't deny the lack of explanation. However, I think there is an opinion that scikit-learn cannot be used for the desired purpose in the first place. Therefore, I will explain the basic clustering by scikit-learn again.

However, as a basic story, the story ends with Read the original document, but Japanese information I think that there may be some help.

Group based on student grades

A common case is when you want to divide into several groups based on your students' national language, math, and English grades. At this time, you can divide them in order from the top of the total score of each subject, but some students may be good at Japanese but not good at math, and some students are good at math but not good at Japanese. It would be unfortunate if these students were simply grouped from the top of the total score. That's where clustering analysis comes in. By grouping students with similar grade tendencies, you will be able to create a more suitable grouping than just the top grades.

Write the code for K-means clustering and give it a try. Let's do it. Please refer to this article for a detailed explanation of the specifications of K-means clustering. please look. This time, let's divide the students into three groups.

import numpy as np
from sklearn.cluster import KMeans

#Give students' national, math, and English scores as an array
features = np.array([
        [  80,  85, 100 ],
        [  96, 100, 100 ],
        [  54,  83,  98 ],
        [  80,  98,  98 ],
        [  90,  92,  91 ],
        [  84,  78,  82 ],
        [  79, 100,  96 ],
        [  88,  92,  92 ],
        [  98,  73,  72 ],
        [  75,  84,  85 ],
        [  92, 100,  96 ],
        [  96,  92,  90 ],
        [  99,  76,  91 ],
        [  75,  82,  88 ],
        [  90,  94,  94 ],
        [  54,  84,  87 ],
        [  92,  89,  62 ],
        [  88,  94,  97 ],
        [  42,  99,  80 ],
        [  70,  98,  70 ],
        [  94,  78,  83 ],
        [  52,  73,  87 ],
        [  94,  88,  72 ],
        [  70,  73,  80 ],
        [  95,  84,  90 ],
        [  95,  88,  84 ],
        [  75,  97,  89 ],
        [  49,  81,  86 ],
        [  83,  72,  80 ],
        [  75,  73,  88 ],
        [  79,  82,  76 ],
        [ 100,  77,  89 ],
        [  88,  63,  79 ],
        [ 100,  50,  86 ],
        [  55,  96,  84 ],
        [  92,  74,  77 ],
        [  97,  50,  73 ],
        ])

# K-means clustering
#In this example, it is divided into three groups(Let 10 be the random number seed for Mersenne Twister)
kmeans_model = KMeans(n_clusters=3, random_state=10).fit(features)

#Get the label that was classified
labels = kmeans_model.labels_

#label(Team), Grades, and total score of 3 subjects
for label, feature in zip(labels, features):
    print(label, feature, feature.sum())
#=>
# 2 [ 80  85 100] 265
# 2 [ 96 100 100] 296
# 0 [54 83 98] 235
# 2 [80 98 98] 276
# 2 [90 92 91] 273
# 1 [84 78 82] 244
# 2 [ 79 100  96] 275
# 2 [88 92 92] 272
# 1 [98 73 72] 243
# 2 [75 84 85] 244
# 2 [ 92 100  96] 288
# 2 [96 92 90] 278
# 1 [99 76 91] 266
# 2 [75 82 88] 245
# 2 [90 94 94] 278
# 0 [54 84 87] 225
# 1 [92 89 62] 243
# 2 [88 94 97] 279
# 0 [42 99 80] 221
# 0 [70 98 70] 238
# 1 [94 78 83] 255
# 0 [52 73 87] 212
# 1 [94 88 72] 254
# 1 [70 73 80] 223
# 2 [95 84 90] 269
# 2 [95 88 84] 267
# 2 [75 97 89] 261
# 0 [49 81 86] 216
# 1 [83 72 80] 235
# 1 [75 73 88] 236
# 1 [79 82 76] 237
# 1 [100  77  89] 266
# 1 [88 63 79] 230
# 1 [100  50  86] 236
# 0 [55 96 84] 235
# 1 [92 74 77] 243
# 1 [97 50 73] 220

By the way, it seems that they have been safely divided into three groups. Let's scrutinize the contents. First, the students with a label of 2 are a relatively good group with high scores. Next, looking at the students with the label 1, there are some students who are good at only the national language and students who are not so good at other than the national language, although they are in the middle. Apparently, it seems to be a group of students who are close to humanities. Students with a label of 0 at the end are categorized as students with slightly lower grades overall or who are good at math.

Summary

In this way, instead of dividing into three groups from the top of the total score, we were able to divide into groups according to the tendency of each student.

Even in a real problem, clustering analysis can be performed by selecting the features well and drawing a vector of numerical values in this way.

With scikit-learn, K-means clustering itself can be done in just one line, so you can feel the powerful power of machine learning libraries.

Recommended Posts

The most basic clustering analysis with scikit-learn
Clustering with scikit-learn (1)
Clustering with scikit-learn (2)
Clustering with scikit-learn + DBSCAN
DBSCAN (clustering) with scikit-learn
Let's tune the model hyperparameters with scikit-learn!
[Scikit-learn] I played with the ROC curve
Clustering representative schools in summer 2016 with scikit-learn
Isomap with Scikit-learn
Clustering with python-louvain
DBSCAN with scikit-learn
Predict the second round of summer 2016 with scikit-learn
PCA with Scikit-learn
kmeans ++ with scikit-learn
One of the cluster analysis methods, k-means, is executed with scikit-learn or implemented without scikit-learn.
Collaborative filtering with principal component analysis and K-means clustering
Calculate the regression coefficient of simple regression analysis with python
Summary of the basic flow of machine learning with Python
Solving the iris problem with scikit-learn ver1.0 (logistic regression)
Cross Validation with scikit-learn
Basket analysis with Spark (1)
Multi-class SVM with scikit-learn
Dependency analysis with CaboCha
Learn with chemoinformatics scikit-learn
Voice analysis with python
Voice analysis with python
Dynamic analysis with Valgrind
Regression analysis with NumPy
Data analysis with Python
Install scikit.learn with pip
Calculate tf-idf with scikit-learn
Visualize the results of decision trees performed with Python scikit-learn
I wrote the basic grammar of Python with Jupyter Lab
I wrote the basic operation of matplotlib with Jupyter Lab