This article describes the K-means method (data clustering) and principal component analysis that can be effectively used when proceeding with data analysis using python's pandas. In order to understand clustering, it is good to have knowledge of mean, deviation, standardization, etc. in advance. It would be good if you had the knowledge up to the second chapter of the book, "You can learn statistics for 4 years in college in 10 hours." I imported pandas with the name pd in advance.
Algorithm for classifying into k clusters
Technique to reduce the number of dimensions (Since it is difficult to output data holding 3 or more variables on a plane, reducing the number of dimensions makes it possible to plot on a plane in an easy-to-understand manner.)
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
Perform standardization using fit_transform from the loaded library
Premise: Read / edit the data to be analyzed with pandas etc. in advance, create with the following variables (The following data stores numerical data such as mean, median, max, min) Data variable name: an_data
sc = StandardScaler()
clustering_sc = sc.fit_transform(an_data)
If you set a numerical value to the KMeans option random_state, the same result will be obtained from the next time onward by specifying the option with the same numerical value. (The default is random_state = None, which is processed with a different random number each time)
kmeans = KMeans(n_cluster=<Number of clusters>, random_state=0)
clusters = kmeans.fit(clustering_sc)
Output of clustering results to a table
an_data["result_clustering"] = clusters.labels_
an_data.head()
hoge = clustering_sc
pca = PCA(n_components=2) #Specify 2 for the number of dimensions to output to a two-dimensional plane
pca.fit(hoge)
hoge_pca = pca.transform(hoge)
pca_data = pd.DataFrame(hoge_pca)
Graph display preparation
import matplotlib as plt
%matplotlib inline #For graph display with jupyter
Since it has been clustered, try outputting it as a scatter plot for each cluster label.
for i in an_data["result_clustering"].unique():
tmp = pca_data.loc[pca_data["result_clustering"] == i]
plt.scatter(tmp[0], tmp[1], label=i)
plt.legend()
Recommended Posts