[PYTHON] Classify data by k-means method

Hello, this is Motty. This time, classification (clustering) was done in Python.

What is classification?

Classification in statistics and machine learning refers to classifying data into groups of similar features. It is one of "unsupervised learning" because it is done without a standard in advance.

KMeans method

The K-means method is an algorithm that classifies into a given number of clusters (k) using the average of clusters. The classification structure is optimized by classifying each data according to how close it is to the center of gravity and updating the center of gravity sequentially.

 2020-04-12 18.49.13.png

Implemented in Python

KMeans.py


import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs as mb


clf = KMeans(n_clusters = 3)
N = 100 #Number of sample

dataset = mb(centers = 3)
features = np.array(dataset[0])
pred = clf.fit_predict(features)

 2020-04-12 18.51.40.png I was able to classify it neatly.

It should be noted that the data itself is clean, the number of K is appropriate, and the algorithm selection is appropriate. If the conditions are not met, it may not be possible to divide the data neatly in this way.

If there is an outlier

NOISE = [25,25]
features = np.append(features,NOISE).reshape(-1,2)

 2020-04-12 18.56.36.png

If the number of clusters is not appropriate

dataset = mb(centers = 4)

 2020-04-12 18.59.44.png

Cases where the classification algorithm is not suitable for KMeans in the first place

makemoons.py


import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs as mb

clf = KMeans(n_clusters = 2)

X1,y1 = make_moons(noise = 0.05, random_state=0)
pred1 = clf.fit_predict(X1)

for i in range(2):
    labels = X1[pred1 == i]
    plt.scatter(labels[:,0],labels[:,1])

plt.show()

 2020-04-12 19.03.03.png

At the end

There are various classification algorithms, and this time I described one of them, the KMeans method. I would like to describe the classification of SVM and run ram forest later.

Recommended Posts

Classify data by k-means method
Understand k-means method
[Roughly] Clustering by KMeans
Split data by threshold
Training data by CNN
Correlation by data preprocessing
Clustering and principal component analysis by K-means method (beginner)
Data batch extraction method by regular expression from Series
Visualization method of data by explanatory variable and objective variable
Gzip the data by streaming
Visualization of data by prefecture
Data visualization method using matplotlib (1)
Data acquired by Django releted
Data visualization method using matplotlib (2)
First satellite data analysis by Tellus
Data visualization method using matplotlib (+ pandas) (5)
Estimating π by Monte Carlo method
Data visualization method using matplotlib (+ pandas) (3)
Efficient PCR test by pool method
10 selections of data extraction by pandas.DataFrame.query
Animation of geographic data by geopandas
SVM optimization by active set method
Data visualization method using matplotlib (+ pandas) (4)
I implemented the K-means method (clustering method)
I tried to classify mnist numbers by unsupervised learning [PCA, t-SNE, k-means]
Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)