[PYTHON] Perform (Visualization> Clustering> Feature Description) with (t-SNE, DBSCAN, Decision Tree)

Summary

I tried it with the idea. In fact, I haven't figured out how it can be applied.

Try it with sklearn's load_boston

First import what you need

import numpy as np
from sklearn import datasets
from sklearn.manifold import TSNE
from matplotlib import pyplot as plt

Load boston and try to visualize it with TSNE Visually see some clusters appear

boston = datasets.load_boston()
model = TSNE(n_components=2)
tsne_result = model.fit_transform(boston.data) 
plt.plot(tsne_result[:,0], tsne_result[:,1], ".")

boston_tsne.png

Let's cluster with kmeans once for comparison

from sklearn.cluster import MiniBatchKMeans
#Number of clusters`n_clusters`Looked at the TSNE graph and decided by feeling
kmeans = MiniBatchKMeans(n_clusters=10, max_iter=300)
kmeans_tsne = kmeans.fit_predict(tsne_result)

#Color it nicely
color=cm.brg(np.linspace(0,1,np.max(kmeans_tsne) - np.min(kmeans_tsne)+1))
for i in range(np.min(kmeans_tsne), np.max(kmeans_tsne)+1):
    plt.plot(tsne_result[kmeans_tsne == i][:,0],
             tsne_result[kmeans_tsne == i][:,1],
             ".",
             color=color[i]
             )
    plt.text(tsne_result[kmeans_tsne == i][:,0][0],
             tsne_result[kmeans_tsne == i][:,1][0],
             str(i), color="black", size=16
             )

Clusters (1,5), (2,8), and (4,7,9) are split, but structurally connected, which is not very desirable (for me). boston_tsne_kmeans.png

Try clustering with DBSCAN

from sklearn.cluster import DBSCAN
# `eps`Is the result of trial and error
dbscan = DBSCAN(eps=3)
dbscan_tsne = dbscan.fit_predict(tsne_result)

#Color it nicely
color=cm.brg(np.linspace(0,1,np.max(dbscan_tsne) - np.min(dbscan_tsne)+1))
for i in range(np.min(dbscan_tsne), np.max(dbscan_tsne)+1):
    plt.plot(tsne_result[dbscan_tsne == i][:,0],
             tsne_result[dbscan_tsne == i][:,1],
             ".",
             color=color[i+1]
             )
    plt.text(tsne_result[dbscan_tsne == i][:,0][0],
             tsne_result[dbscan_tsne == i][:,1][0],
             str(i), color="black", size=16
             )

In DBSCAN, it is desirable because the connected islands are in the same cluster. (-1 is a cluster that contains things that are out of order)

boston_tsne_dbscan.png

In addition, generate a decision tree to try to explain each cluster well.

from sklearn import tree
clf = tree.DecisionTreeClassifier()
#dbscan-The label is because 1 cluster is generated-Start from 1
clf.classes_ = np.max(dbscan_tsne) - np.min(dbscan_tsne) + 1
clf.fit(boston.data, dbscan_tsne)

#Generate a graphviz dot file
with open("boston_tsne_dt.dot", 'w') as f:
    tree.export_graphviz(
        clf,
        out_file=f,
        feature_names=boston.feature_names,
        filled=True,
        rounded=True,  
        special_characters=True,
        impurity=False,
        proportion=False,
        class_names=map(str, range(-1, np.max(dbscan_tsne) - np.min(dbscan_tsne)+1))
    )
dot -T png boston_tsne_dt.dot > boston_tsne_dt.png

The result is shown in the figure below.

boston_tsne_dt.png

For reference, draw the target (house price) of each cluster.

plt.boxplot([boston.target[dbscan_tsne == i]
             for i in range(np.min(dbscan_tsne), 
                            np.max(dbscan_tsne)+1)],
            labels=range(np.min(dbscan_tsne), 
                         np.max(dbscan_tsne)+1)
            )

boston_tsne_price.png

Consideration

To summarize what I was interested in,

However, when it comes to providing some information with this, I feel quite suspicious. By the way, even if you mix boston.target with the original data, the result will be quite close.

Recommended Posts

Perform (Visualization> Clustering> Feature Description) with (t-SNE, DBSCAN, Decision Tree)
DBSCAN (clustering) with scikit-learn
Creating a decision tree with scikit-learn
Photo segmentation and clustering with DBSCAN