[PYTHON] DBSCAN practices and algorithms

DBSCAN is a "density-based" clustering technique.

図2.png

The paper for which DBSCAN was proposed

Martin Ester，Hans-Peter Kriegel，Jorg Sander，Xiaowei Xu. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proccrdings of 2ndInternational Conference on Knowledge Discovery and Data Mining (KDD-96),1996.

So, the algorithm of this paper is also proposed in scikit-learn.

The DBSCAN algorithm is described in the Wiki as follows.

DBSCAN requires two parameters: ε (eps) and the minimum number of points required to form a dense regiona. It starts with an arbitrary starting point that has not been visited. This point's ε-neighborhood is retrieved, and if it contains sufficiently many points........ (http://en.wikipedia.org/wiki/DBSCAN)

I will interpret this in my own way.

The area is judged as a cluster depending on how many points there are in the radius ε. Furthermore, as long as the density of the neighborhood exceeds a certain threshold, the cluster can continue to grow, so the entire neighborhood can be regarded as a region.

It looks like this in the figure.

![図.png](https://qiita-image-store.s3.amazonaws.com/0/109578/c1f8377d-ab4f-64e6-3678-05c66faea3e0.png)

If you watch this video, you can understand DBSCAN well (I think). http://www.youtube.com/watch?v=5E097ZLE9Sg

merit (1) Unlike Mean-shift, you do not have to set the number of clusters. (2) Since the cluster is calculated recursively, it is robust against outliers. Demerit (1) High calculation cost and difficult to implement in real-time applications

Finally, the source code is described. Determine the range to be searched by eps (how many points there are from a certain point X), and how many points exist in eps with min_samples to consider it as a cluster.

    def (eps,min_samples):
        db = DBSCAN(eps, min_samples).fit(point_data)
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
        labels = db.labels_
        n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

        print('Estimated number of clusters: %d' % n_clusters_)