[PYTHON] The most basic clustering analysis with scikit-learn

scikit-learn + [clustering](http://ja.wikipedia.org/wiki/%E3%83%87%E3%83% BC% E3% 82% BF% E3% 83% BB% E3% 82% AF% E3% 83% A9% E3% 82% B9% E3% 82% BF% E3% 83% AA% E3% 83% B3% Regarding E3% 82% B0), I touched it a little at the very beginning of this blog, but I can't deny the lack of explanation. However, I think there is an opinion that scikit-learn cannot be used for the desired purpose in the first place. Therefore, I will explain the basic clustering by scikit-learn again.

However, as a basic story, the story ends with Read the original document, but Japanese information I think that there may be some help.

Group based on student grades

A common case is when you want to divide into several groups based on your students' national language, math, and English grades. At this time, you can divide them in order from the top of the total score of each subject, but some students may be good at Japanese but not good at math, and some students are good at math but not good at Japanese. It would be unfortunate if these students were simply grouped from the top of the total score. That's where clustering analysis comes in. By grouping students with similar grade tendencies, you will be able to create a more suitable grouping than just the top grades.

Write the code for K-means clustering and give it a try. Let's do it. Please refer to this article for a detailed explanation of the specifications of K-means clustering. please look. This time, let's divide the students into three groups.

import numpy as np
from sklearn.cluster import KMeans

#Give students' national, math, and English scores as an array
features = np.array([
        [  80,  85, 100 ],
        [  96, 100, 100 ],
        [  54,  83,  98 ],
        [  80,  98,  98 ],
        [  90,  92,  91 ],
        [  84,  78,  82 ],
        [  79, 100,  96 ],
        [  88,  92,  92 ],
        [  98,  73,  72 ],
        [  75,  84,  85 ],
        [  92, 100,  96 ],
        [  96,  92,  90 ],
        [  99,  76,  91 ],
        [  75,  82,  88 ],
        [  90,  94,  94 ],
        [  54,  84,  87 ],
        [  92,  89,  62 ],
        [  88,  94,  97 ],
        [  42,  99,  80 ],
        [  70,  98,  70 ],
        [  94,  78,  83 ],
        [  52,  73,  87 ],
        [  94,  88,  72 ],
        [  70,  73,  80 ],
        [  95,  84,  90 ],
        [  95,  88,  84 ],
        [  75,  97,  89 ],
        [  49,  81,  86 ],
        [  83,  72,  80 ],
        [  75,  73,  88 ],
        [  79,  82,  76 ],
        [ 100,  77,  89 ],
        [  88,  63,  79 ],
        [ 100,  50,  86 ],
        [  55,  96,  84 ],
        [  92,  74,  77 ],
        [  97,  50,  73 ],
        ])

# K-means clustering
#In this example, it is divided into three groups(Let 10 be the random number seed for Mersenne Twister)
kmeans_model = KMeans(n_clusters=3, random_state=10).fit(features)

#Get the label that was classified
labels = kmeans_model.labels_

#label(Team), Grades, and total score of 3 subjects
for label, feature in zip(labels, features):
    print(label, feature, feature.sum())
#=>
# 2 [ 80  85 100] 265
# 2 [ 96 100 100] 296
# 0 [54 83 98] 235
# 2 [80 98 98] 276
# 2 [90 92 91] 273
# 1 [84 78 82] 244
# 2 [ 79 100  96] 275
# 2 [88 92 92] 272
# 1 [98 73 72] 243
# 2 [75 84 85] 244
# 2 [ 92 100  96] 288
# 2 [96 92 90] 278
# 1 [99 76 91] 266
# 2 [75 82 88] 245
# 2 [90 94 94] 278
# 0 [54 84 87] 225
# 1 [92 89 62] 243
# 2 [88 94 97] 279
# 0 [42 99 80] 221
# 0 [70 98 70] 238
# 1 [94 78 83] 255
# 0 [52 73 87] 212
# 1 [94 88 72] 254
# 1 [70 73 80] 223
# 2 [95 84 90] 269
# 2 [95 88 84] 267
# 2 [75 97 89] 261
# 0 [49 81 86] 216
# 1 [83 72 80] 235
# 1 [75 73 88] 236
# 1 [79 82 76] 237
# 1 [100  77  89] 266
# 1 [88 63 79] 230
# 1 [100  50  86] 236
# 0 [55 96 84] 235
# 1 [92 74 77] 243
# 1 [97 50 73] 220

By the way, it seems that they have been safely divided into three groups. Let's scrutinize the contents. First, the students with a label of 2 are a relatively good group with high scores. Next, looking at the students with the label 1, there are some students who are good at only the national language and students who are not so good at other than the national language, although they are in the middle. Apparently, it seems to be a group of students who are close to humanities. Students with a label of 0 at the end are categorized as students with slightly lower grades overall or who are good at math.

Summary

In this way, instead of dividing into three groups from the top of the total score, we were able to divide into groups according to the tendency of each student.

Even in a real problem, clustering analysis can be performed by selecting the features well and drawing a vector of numerical values in this way.

With scikit-learn, K-means clustering itself can be done in just one line, so you can feel the powerful power of machine learning libraries.