[PYTHON] Clustering with scikit-learn + DBSCAN

Today is [SciPy and NumPy] following Yesterday An example using scikit-learn from Optimizing & Boosting your Python Programming I will explain it briefly. Regarding clustering, Identify edible mushrooms and Reuse clustering results ) And [clustering with scikit-learn] ](Http://qiita.com/ynakayama/items/ab2d89be36d3cdaeb4f2), so I think it's a familiar method in machine learning.

Clustering with scikit-learn

Core points with more data points compared to the popular kmeans Find the DBSCAN Algorithm, where the process iterates within the specified radius once the core is defined. This technique is often compared to kmeans for noisy data.

The original work also compares and visualizes these methods. Let's try it right away.

import numpy as np
import matplotlib.pyplot as mpl
from scipy.spatial import distance
from sklearn.cluster import DBSCAN

#First, generate sample data with random numbers
c1 = np.random.randn(100, 2) + 5
c2 = np.random.randn(50, 2)

#Generate and stack uniform distribution
u1 = np.random.uniform(low=-10, high=10, size=100)
u2 = np.random.uniform(low=-10, high=10, size=100)
c3 = np.column_stack([u1, u2])

#Store all data in a 150 x 2 array
data = np.vstack([c1, c2, c3])

#Clustering using DBSCAN
# db.labels_Is an array with identifiers for different clusters in the data
db = DBSCAN().fit(data)
labels = db.labels_

#Get the coordinates for each core
#The noise is 0 and 1 for the two clusters-Classified as 1
#Divide these
dbc1 = data[labels == 0] #Negative example
dbc2 = data[labels == 1] #Positive example
noise = data[labels == -1] #noise

The feature is that noise can be separated in this way.

Visualization

Let's visualize it with the familiar matplotlib.

x1, x2 = -12, 12
y1, y2 = -12, 12
fig = mpl.figure()
fig.subplots_adjust(hspace=0.1, wspace=0.1)
ax1 = fig.add_subplot(121, aspect='equal')
ax1.scatter(c1[:, 0], c1[:, 1], lw=0.5, color='#00CC00')
ax1.scatter(c2[:, 0], c2[:, 1], lw=0.5, color='#028E9B')
ax1.scatter(c3[:, 0], c3[:, 1], lw=0.5, color='#FF7800')
ax1.xaxis.set_visible(False)
ax1.yaxis.set_visible(False)
ax1.set_xlim(x1, x2)
ax1.set_ylim(y1, y2)
ax1.text(-11, 10, 'Original')
ax2 = fig.add_subplot(122, aspect='equal')
ax2.scatter(dbc1[:, 0], dbc1[:, 1], lw=0.5, color='#00CC00')
ax2.scatter(dbc2[:, 0], dbc2[:, 1], lw=0.5, color='#028E9B')
ax2.scatter(noise[:, 0], noise[:, 1], lw=0.5, color='#FF7800')
ax2.xaxis.set_visible(False)
ax2.yaxis.set_visible(False)
ax2.set_xlim(x1, x2)
ax2.set_ylim(y1, y2)
ax2.text(-11, 10, 'DBSCAN identified')
fig.savefig('image.png', bbox_inches='tight')