[PYTHON] Clustering and principal component analysis by K-means method (beginner)

Introduction

This article describes the K-means method (data clustering) and principal component analysis that can be effectively used when proceeding with data analysis using python's pandas. In order to understand clustering, it is good to have knowledge of mean, deviation, standardization, etc. in advance. It would be good if you had the knowledge up to the second chapter of the book, "You can learn statistics for 4 years in college in 10 hours." I imported pandas with the name pd in advance.

What is K-means method?

Algorithm for classifying into k clusters

What is principal component analysis?

Technique to reduce the number of dimensions (Since it is difficult to output data holding 3 or more variables on a plane, reducing the number of dimensions makes it possible to plot on a plane in an easy-to-understand manner.)

Import library used by K-means from Scikit-learn

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

Importing libraries used for principal component analysis from Scikit-learn

from sklearn.decomposition import PCA

Data standardization

Perform standardization using fit_transform from the loaded library

Premise: Read / edit the data to be analyzed with pandas etc. in advance, create with the following variables (The following data stores numerical data such as mean, median, max, min) Data variable name: an_data

sc = StandardScaler()
clustering_sc = sc.fit_transform(an_data)

Clustering with K-means

If you set a numerical value to the KMeans option random_state, the same result will be obtained from the next time onward by specifying the option with the same numerical value. (The default is random_state = None, which is processed with a different random number each time)

kmeans = KMeans(n_cluster=<Number of clusters>, random_state=0)
clusters = kmeans.fit(clustering_sc)

Output of clustering results to a table

an_data["result_clustering"] = clusters.labels_
an_data.head()

Principal component analysis

hoge = clustering_sc
pca = PCA(n_components=2)  #Specify 2 for the number of dimensions to output to a two-dimensional plane
pca.fit(hoge)
hoge_pca = pca.transform(hoge)
pca_data = pd.DataFrame(hoge_pca)

Graph output

Graph display preparation

import matplotlib as plt
%matplotlib inline         #For graph display with jupyter

Since it has been clustered, try outputting it as a scatter plot for each cluster label.

for i in an_data["result_clustering"].unique():
    tmp = pca_data.loc[pca_data["result_clustering"] == i]
    plt.scatter(tmp[0], tmp[1], label=i)
plt.legend()

Recommended Posts

Clustering and principal component analysis by K-means method (beginner)
Dimensional compression with self-encoder and principal component analysis
[Roughly] Clustering by KMeans
Principal component analysis (Principal component analysis: PCA)
[Python] Comparison of Principal Component Analysis Theory and Implementation by Python (PCA, Kernel PCA, 2DPCA)
Principal component analysis (PCA) and independent component analysis (ICA) in python
Classify data by k-means method
Unsupervised learning 3 Principal component analysis
Visualize the correlation matrix by principal component analysis in Python
Face recognition using principal component analysis
Principal component analysis with Spark ML
[GWAS] Plot the results of principal component analysis (PCA) by PLINK
Explainable AI ~ Explainable k-Means and k-Medians Clustering ~
Python: Unsupervised Learning: Principal Component Analysis
100 Language Processing Knock-85 (Truncated SVD): Dimensional compression by principal component analysis
I implemented the K-means method (clustering method)
Principal Component Analysis with Livedoor News Corpus-Practice-
Try cluster analysis using the K-means method
Principal component analysis with Power BI + Python
<Course> Machine learning Chapter 4: Principal component analysis
Single regression analysis by least squares method
Data analysis by clustering using k-means method (python) ([High school information department information II] teaching materials for teacher training)
Beginner Kmeans
Let's start multivariate analysis and principal component analysis with Pokemon! Collaboration between R and Tableau
Vectors are compressed to two dimensions by principal component analysis and visualized by matplotlib --Compress vectors to 2-dimension using Principal Component Analysis and visualize it with matplotlib.
Extract dominant color of image by k-means clustering
I tried principal component analysis with Titanic data!
PRML Chapter 12 Bayesian Principal Component Analysis Python Implementation
Robot grip position (Python PCA principal component analysis)