Introduction

This article describes the K-means method (data clustering) and principal component analysis that can be effectively used when proceeding with data analysis using python's pandas. In order to understand clustering, it is good to have knowledge of mean, deviation, standardization, etc. in advance. It would be good if you had the knowledge up to the second chapter of the book, "You can learn statistics for 4 years in college in 10 hours." I imported pandas with the name pd in advance.

What is K-means method?

Algorithm for classifying into k clusters

What is principal component analysis?

Technique to reduce the number of dimensions (Since it is difficult to output data holding 3 or more variables on a plane, reducing the number of dimensions makes it possible to plot on a plane in an easy-to-understand manner.)

Import library used by K-means from Scikit-learn

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Importing libraries used for principal component analysis from Scikit-learn

from sklearn.decomposition import PCA

Data standardization

Perform standardization using fit_transform from the loaded library

Premise: Read / edit the data to be analyzed with pandas etc. in advance, create with the following variables (The following data stores numerical data such as mean, median, max, min) Data variable name: an_data

sc = StandardScaler()
clustering_sc = sc.fit_transform(an_data)

Clustering with K-means

If you set a numerical value to the KMeans option random_state, the same result will be obtained from the next time onward by specifying the option with the same numerical value. (The default is random_state = None, which is processed with a different random number each time)

kmeans = KMeans(n_cluster=<Number of clusters>, random_state=0)
clusters = kmeans.fit(clustering_sc)

Output of clustering results to a table

an_data["result_clustering"] = clusters.labels_
an_data.head()

Principal component analysis

hoge = clustering_sc
pca = PCA(n_components=2)  #Specify 2 for the number of dimensions to output to a two-dimensional plane
pca.fit(hoge)
hoge_pca = pca.transform(hoge)
pca_data = pd.DataFrame(hoge_pca)

Graph output

Graph display preparation

import matplotlib as plt
%matplotlib inline         #For graph display with jupyter

Since it has been clustered, try outputting it as a scatter plot for each cluster label.

for i in an_data["result_clustering"].unique():
    tmp = pca_data.loc[pca_data["result_clustering"] == i]
    plt.scatter(tmp[0], tmp[1], label=i)
plt.legend()

[PYTHON] Clustering and principal component analysis by K-means method (beginner)