[PYTHON] DBSCAN (clustering) with scikit-learn

About the outline of DBSCAN which is one of the clustering algorithms and simple parameter tuning I didn't seem to have a complete Japanese article, so I made a note of it. Please note that the outline of DBSCAN is a (rough) Japanese translation of wikipedia.

What is DBSCAN

Abbreviation for Density-based spatial clustering of applications with noise
One of the clustering algorithms

Algorithm overview

1. Classify points into 3 categories
Core points: Points with at least minPts of adjacent points within a radius of ε
Reachable points (border points): Points that do not have as many adjacent points as minPts within a radius of ε, but have Core points within a radius of ε.
Outlier: Points with no adjacent points within radius ε
1. Create a cluster from a collection of Core points and assign Reachable points to each cluster.

[From figure wikipedia]

Pros

Unlike k-means, you don't have to decide the number of clusters first
Can be classified even in sharp clusters. Does not assume that the cluster is spherical
Robust for outliers.
There may be two parameters, ε and minPts. It is also easy to determine the parameter range.

Disadvantages

The concept of border points is subtle, and data may change which cluster it belongs to.
Accuracy varies depending on the distance calculation method.
It is difficult to properly determine ε and minPts when the data is dense. In some cases, most points are classified into one cluster.
It is difficult to determine ε without knowing the data.
(Not limited to DBSCAN) Larger dimensions are affected by the curse of dimensionality

Difference from other algorithms

It is a comparison of each method in scikit-learn demo page, but the second from the right is DBSCAN. You can intuitively see that they are clustered based on density.

Tuning of ε and minPts

If it is two-dimensional, it can be visualized to determine whether it is classified well, but if it is three-dimensional or more, it is difficult to visualize and judge. I debugged and adjusted the outliers and the number of clusters as follows. (Using scikit-learn)

from sklearn.cluster import DBSCAN

for eps in range(0.1,3,0.1):
    for minPts in range(1,20):
        dbscan = DBSCAN(eps=eps,min_samples=minPts).fit(X)
        y_dbscan = dbscan.labels_
        print("eps:",eps,",minPts:", minPts)
        #Outlier number
        print(len(np.where(y_dbscan ==-1)[0]))
        #Number of clusters
        print(np.max(y_dbscan)))
        #Number of points in cluster 1
        print(len(np.where(y_dbscan ==0)[0]))
        #Number of points in cluster 2
        print(len(np.where(y_dbscan ==1)[0]))

DBSCAN related links

wikipedia
- https://en.wikipedia.org/wiki/DBSCAN
Official documentation
- http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
Official demo page
- http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html
Qiita
- http://qiita.com/ynakayama/items/46d5f2e49f57c7de98a4
- http://qiita.com/yamaguchiyuto/items/82c57c5d44833f5a33c7

Postscript

Japanese Wikipedia has been updated with additional descriptions. It's easy to understand.