Yesterday, I explained the outline of clustering and the flow of actually clustering using scikit-learn.
Clustering by scikit-learn (1)
Let's go back to the basics and explore what clustering is in the first place.
In many machine learning algorithms, features (features) are represented by vectors. In linear algebra, the set in which the sum and scalar product are defined internally is called a vector space, and its elements are called a vector.
Roughly speaking, clustering is a method of calculating how similar features are and grouping similar ones.
Regardless of whether the original data is characters or images, when the pattern is recognized and reduced to features, grouping can be performed without giving data to be a teacher.
For example, it can be applied to various technologies such as collecting an unspecified number of questionnaire answers between similar people and extracting the skin color part of an image.
By reading this far, you can see that the key to clustering is how to find the similarity of sets.
I'll walk you through the code along with the scikit-learn tutorial. Clustering
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]
print ( metrics.adjusted_mutual_info_score(labels_true, labels_pred) )
# => 0.225042310598
labels_true = [0, 1, 2, 0, 3, 4, 5, 1]
labels_pred = [1, 1, 0, 0, 2, 2, 2, 2]
print ( metrics.adjusted_mutual_info_score(labels_true, labels_pred) )
# => -0.105263430575
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]
print ( metrics.homogeneity_score(labels_true, labels_pred) )
# => 0.666666666667
print ( metrics.completeness_score(labels_true, labels_pred) )
# => 0.420619835714
As you can see, scikit-learn can find various similarities.
Let's try clustering with yesterday's code. Since scikit-learn has a dataset, we will use it as it is. First, prepare the data set.
from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn import datasets
dataset = datasets.load_iris()
X = dataset.data
y = dataset.target
#Take a peek at the contents
print (X)
print (y)
Let's cluster with yesterday's code.
import numpy as np
from sklearn.cluster import KMeans
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
#Try to calculate the Euclidean distance
print ( metrics.silhouette_score(X, labels, metric='euclidean') )
#Cluster using yesterday's code
clusters = make_cluster(X)
#Output the result to a file
write_cluster(clusters, 'out.txt')
#Peep into the contents of the generated clustering
print ( clusters )
By using a powerful clustering library, it can be said that once the features of the target are extracted by pattern recognition, grouping can be easily performed and it can be applied to various fields.
Recommended Posts