[PYTHON] Unsupervised learning 2 non-hierarchical clustering

Aidemy 2020/10/29

Introduction

Hello, it is Yope! I am a liberal arts student, but I was interested in the possibilities of AI, so I went to the AI-specialized school "Aidemy" to study. I would like to share the knowledge gained here with you, and I am summarizing it on Qiita. I am very happy that many people have read the previous summary article. Thank you! This is the second post of unsupervised learning. Nice to meet you.

What to learn this time ・ Types of clustering ・ K-means method ・ DBSCAN method

Clustering

Hierarchical clustering

-__ Hierarchical clustering __ is a method of clustering the most similar (closest) data among the data __. ・ For example, if there is data of a = 1, b = 2, c = 10, d = 15, e = 100 「(a,b),c,d,e」=>「(a,b),(c,d),e」=>「((a,b),(c,d)),e」=>「(((a,b),(c,d)),e)」 It is clustered like this, and when all the data is finally collected, it ends. -At this time, a hierarchy is formed for each group, so it is called hierarchical clustering.

Non-hierarchical clustering

-Non-hierarchical clustering, like hierarchical clustering, is a method of clustering the most similar (close) data, but does not create a hierarchical structure. -In non-hierarchical clustering, a person decides how many clusters to make, and clusters are generated accordingly. -Non-hierarchical clustering requires less calculation than hierarchical clustering, so it is effective when the amount of data is large.

Structure of data used for clustering

-By using __make_blobs () __, you can generate data by specifying the number of clusters. -Of the variables, X is the __ data point (x, y) __, and Y is the __cluster label __. ・ About each argument n_samples: Total number of data n_features: Features (number of dimensions) centers: Number of clusters cluster_std: Standard deviation within the cluster shuffle: Whether to sort the data randomly random_state: seed setting

スクリーンショット 2020-10-28 23.10.58.png

k-means method

-The __k-means method __ is one of non-hierarchical clustering. The clustering method is to divide into clusters with the same __ distribution __. ・ For the division, use an index called SSE. SSE is the sum of squares (= variance) of the difference between the __centroid __ and the data points for each cluster. (Details will be described later) -Then, learn and select __centroid __ that minimizes this variance (SSE).

・ The specific flow of the k_means method is as follows. (1) Extract k data from the data and use them as the initial centroid. (2) Allocate all data points to the nearest centroid. (3) Calculate the center of gravity of the data group collected in each centroid, and set the center of gravity as a new centroid. ④ Calculate the distance between the original centroid and the new centroid, and repeat ② and ③ until they get closer. ⑤ Finish when the distance is close enough.

Execution of k-means method

-Use __KMeans () __ to execute the k-means method. See below for arguments. n_clusters: Number of clusters (match "centers" in make_blob) init: How to set the initial centroid (set randomly with "random") n_init: How many times to do ① above max_iter: Maximum number of iterations of ② and ③ above tol: Tolerance to consider "converged" random_state: Initial seed

·code スクリーンショット 2020-10-28 23.11.23.png

・ Result![Screenshot 2020-10-28 23.12.28.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/92537145-7ac3-9375- 7d4a-8e44988d9624.png)

About SSE

・ SSE explained that it is the sum of squares (= variance) of the difference between the center of gravity (centroid) and data points of each cluster, but this index can also be used for __clustering performance evaluation. it can. -Since what can be seen from SSE is __ "how much the center of gravity deviates from each data" __, it can be said that the smaller the value, the better the cluster is organized.

・ To display the SSE value print("Distortion: %2f"% km.inertia)_ It should be done. (Km is the KMeans instance created in the previous section)

Elbow method

-In the k-means method, it is necessary to determine the number of clusters by yourself, but there is a method that can be used as a reference when determining the number of clusters. This is called the __elbow method __. -The elbow method is a diagram of __ "SSE value when the number of clusters is increased" __. -In this figure, there is a point where the SSE value bends, and this point is regarded as the optimum number of clusters for calculation. This bending is like an elbow, so it is called the elbow method.

・ Code![Screenshot 2020-10-28 23.15.18.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/7f5b536a-4646-4047- 72ee-50e1e3dad0bb.png)

・ Result![Screenshot 2020-10-28 23.15.38.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/698700/c71006b5-f6c7-86dd- 1af1-2fce518018a7.png)

Other non-hierarchical clustering

DBSCAN -We have seen the k-means method as an example of non-hierarchical clustering, but the feature is that the clusters are close to a circle because the data gathers around the center of gravity of each cluster. Therefore, when there is no bias in the size and shape of the __cluster, the accuracy of clustering tends to increase __, but otherwise it will not be good clustering.

-A method called DBSCAN can be used in such cases. -DBSCAN is a method that focuses on the place where __ data is gathered more than a certain number, and separates the data around it from the other data __. -Specifically, use two indicators, __ "min_sample" __ and __ "eps" __. The procedure is as follows. (1) If there are more than "min_sample" data within the data radius "eps", that point is regarded as core point. (2) From the core point, the data within the radius "eps" is regarded as border point. (3) Points that are neither core points nor border points are regarded as noise points. (4) The group of core points is regarded as a cluster, and the boater points are incorporated into the nearest cluster to finish.

-In this way, DBSCAN classifies data into three types, so even biased data can be clustered well.

-To execute DBSCAN, use __DBSCAN () __. (The following __metric = "euclidean" __ declares to use the Euclidean distance)

スクリーンショット 2020-10-28 23.17.16.png

Summary

-There are two types of clustering: hierarchical clustering and __ non-hierarchical clustering__. Due to the algorithm, non-hierarchical clustering requires manual setting of the number of clusters. -One of the non-hierarchical clustering is __k-means method __. In the k-means method, clusters are generated by repeating the setting of the center of gravity. -SSE can be used as the performance index of the k-means method. It can be said that the smaller the value, the better the clustering. -The optimum number of clusters can be calculated by the __elbow method __ which plots the relationship between the number of clusters and SSE. -Another method of non-hierarchical clustering is DBSCAN. Since DBSCAN creates a cluster by referring to the number of data in a certain range, clustering is easy to go well even with biased data.

This time is over. Thank you for reading until the end.

Recommended Posts

Unsupervised learning 2 non-hierarchical clustering
Python: Unsupervised Learning: Non-hierarchical clustering
Unsupervised learning 1 Basics
Python: Unsupervised Learning: Basics
Unsupervised learning 3 Principal component analysis
Unsupervised learning of mnist with autoencoder and clustering and evaluating latent variables
Python: Unsupervised Learning: Principal Component Analysis
Unsupervised learning of mnist with variational auto encoder, clustering and evaluating latent variables