[Python] Clustering results by K-means are reduced in dimension by PCA and plotted on a scatter plot.

It was unlikely, so I wrote an article. Suppose there are 6 data with 4D features.

sample.csv


1,2,3,4
1,2,3,5
1,2,4,5
4,3,2,1
5,3,2,1
5,4,2,1

After clustering this by K-means, the dimension is reduced by PCA and plotted on a scatter plot. The K-means documentation is here and the PCA documentation is here. .org / stable / modules / generated / sklearn.decomposition.PCA.html), and the documentation for pyplot is here.

sample.py


# -*- coding: UTF-8 -*-
import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# sample.load csv
users = np.loadtxt('./sample.csv', delimiter=",")

# K-Clustering by means
model = KMeans(n_clusters=2).fit(users)

#Dimensionality reduction with PCA
pca = PCA(n_components=2)
users_r = pca.fit_transform(users)

#Plot the results on a scatter plot
plt.figure()
for (i, label) in enumerate(model.labels_):
    if label == 0:
        plt.scatter(users_r[i, 0], users_r[i, 1], c='red')
    elif label == 1:
        plt.scatter(users_r[i, 0], users_r[i, 1], c='blue')
plt.show()

The following scatter plot is obtained. figure_1.png

Recommended Posts

[Python] Clustering results by K-means are reduced in dimension by PCA and plotted on a scatter plot.
Notes on coloring by value in the matplotlib scatter plot
Python a + = b and a = a + b are different
Text mining: Probability density distribution on the hypersphere and text clustering in KMeans
Modules and packages in Python are "namespaces"
A memo with Python2.7 and Python3 on CentOS
Create an elliptical scatter plot in Python without using a multivariate normal distribution
Draw a line / scatter plot on the CSV file (2 columns) with python matplotlib
[Python] Plot data by prefecture on a map (number of cars owned nationwide)