[PYTHON] I implemented the K-means method (clustering method)

Articles sent by data scientists from the manufacturing industry
This time, we implemented the K-means method in the clustering method.

What is clustering?

Clustering is the classification of a set by some rule. In machine learning, clustering is categorized as "unsupervised learning."

There are several ways to calculate clustering, but they are grouped based on the similarity between the samples. The calculation methods for clustering can be broadly divided into "hierarchical clustering" and "non-hierarchical clustering". The K-means method implemented this time is classified as "non-hierarchical clustering".

What is K-means method?

It is a classification method that uses the average of clusters to determine the number of clusters. The outline of the algorithm of the K-means method is as follows.

Determine k initial values for the center of the cluster
Find the center distance between all samples and k clusters and classify them into the closest clusters.
Find the center of the k clusters formed
Repeat steps 2 and 3 until the center does not change

スクリーンショット 2021-01-06 12.58.05.png

Implementation of K-means method

The python code is below.

#Installation of required libraries
import numpy as np
import pandas as pd

#Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
%matplotlib inline
sns.set_style('whitegrid')

#Class for normalization
from sklearn.preprocessing import StandardScaler

# k-Import what you need for the means method
from sklearn.cluster import KMeans

First import the required libraries. This time I will try to implement it using iris data.

#iris data
from sklearn.datasets import load_iris

#Data read
iris = load_iris()
iris.keys()

#Store in data frame
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df_iris['target'] = iris.target #Types of irises (correct label)
df_iris.head()

#Scatter plot of 2 variables (color coded by correct label)
plt.scatter(df_iris['petal length (cm)'], df_iris['petal width (cm)'], c=df_iris.target, cmap=mpl.cm.jet)
plt.xlabel('petal_length')
plt.ylabel('petal_width')

2変数の散布図.png

I tried to visualize it with two variables, "petal_length" and "petal_width". Next, I would like to visualize it with a scatter plot matrix.

#Scatterplot matrix (color coded by correct label)
sns.pairplot(df_iris, hue='target', height=1.5)

散布図行列.png

Next, I would like to determine the number of clusters using the elbow method. It is clear that iris data should be divided into three, but when actually using clustering, you have to decide the number of clusters yourself because of unsupervised learning. Therefore, there is an elbow method as one of the methods for determining the number of clusters.

# Elbow Method
wcss = []

for i in range(1, 10):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 30, random_state = 0)
    kmeans.fit(df_iris.iloc[:, 2:4])
    wcss.append(kmeans.inertia_)


plt.plot(range(1, 10), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS') 
plt.show()

エルボー法.png

Looking at the results of the elbow method, you can see that there is no point in increasing the number of clusters by 3 or more.

I would like to start modeling from here.

#modeling
clf = KMeans(n_clusters=3, random_state=1)
clf.fit(df_iris.iloc[:, 2:4])

#Training data cluster number
clf.labels_

#Assign a cluster number to unknown data
#This time we are predicting for the training data, so`clf.labels_`Same result as
y_pred = clf.predict(df_iris.iloc[:, 2:4])
y_pred

#Compare the actual type with the result of clustering
fig, (ax1, ax2) = plt.subplots(figsize=(16, 4), ncols=2)

#Actual type distribution
ax1.scatter(df_iris['petal length (cm)'], df_iris['petal width (cm)'], c=df_iris.target, cmap=mpl.cm.jet)
ax1.set_xlabel('petal_length')
ax1.set_ylabel('petal_width')
ax1.set_title('Actual')
#Distribution of clusters classified by cluster analysis
ax2.scatter(df_iris['petal length (cm)'], df_iris['petal width (cm)'], c=y_pred, cmap=mpl.cm.jet)
ax2.set_xlabel('petal_length')
ax2.set_ylabel('petal_width')
ax2.set_title('Predict')

at the end

Thank you for reading to the end. This time, I implemented the K-means method.

If you have a request for correction, we would appreciate it if you could contact us.