[PYTHON] Clustering with scikit-learn + DBSCAN

Today is [SciPy and NumPy] following Yesterday An example using scikit-learn from Optimizing & Boosting your Python Programming I will explain it briefly. Regarding clustering, Identify edible mushrooms and Reuse clustering results ) And [clustering with scikit-learn] ](Http://qiita.com/ynakayama/items/ab2d89be36d3cdaeb4f2), so I think it's a familiar method in machine learning.

Clustering with scikit-learn

Core points with more data points compared to the popular kmeans Find the DBSCAN Algorithm, where the process iterates within the specified radius once the core is defined. This technique is often compared to kmeans for noisy data.

The original work also compares and visualizes these methods. Let's try it right away.

import numpy as np
import matplotlib.pyplot as mpl
from scipy.spatial import distance
from sklearn.cluster import DBSCAN

#First, generate sample data with random numbers
c1 = np.random.randn(100, 2) + 5
c2 = np.random.randn(50, 2)

#Generate and stack uniform distribution
u1 = np.random.uniform(low=-10, high=10, size=100)
u2 = np.random.uniform(low=-10, high=10, size=100)
c3 = np.column_stack([u1, u2])

#Store all data in a 150 x 2 array
data = np.vstack([c1, c2, c3])

#Clustering using DBSCAN
# db.labels_Is an array with identifiers for different clusters in the data
db = DBSCAN().fit(data)
labels = db.labels_

#Get the coordinates for each core
#The noise is 0 and 1 for the two clusters-Classified as 1
#Divide these
dbc1 = data[labels == 0] #Negative example
dbc2 = data[labels == 1] #Positive example
noise = data[labels == -1] #noise

The feature is that noise can be separated in this way.

Visualization

Let's visualize it with the familiar matplotlib.

x1, x2 = -12, 12
y1, y2 = -12, 12
fig = mpl.figure()
fig.subplots_adjust(hspace=0.1, wspace=0.1)
ax1 = fig.add_subplot(121, aspect='equal')
ax1.scatter(c1[:, 0], c1[:, 1], lw=0.5, color='#00CC00')
ax1.scatter(c2[:, 0], c2[:, 1], lw=0.5, color='#028E9B')
ax1.scatter(c3[:, 0], c3[:, 1], lw=0.5, color='#FF7800')
ax1.xaxis.set_visible(False)
ax1.yaxis.set_visible(False)
ax1.set_xlim(x1, x2)
ax1.set_ylim(y1, y2)
ax1.text(-11, 10, 'Original')
ax2 = fig.add_subplot(122, aspect='equal')
ax2.scatter(dbc1[:, 0], dbc1[:, 1], lw=0.5, color='#00CC00')
ax2.scatter(dbc2[:, 0], dbc2[:, 1], lw=0.5, color='#028E9B')
ax2.scatter(noise[:, 0], noise[:, 1], lw=0.5, color='#FF7800')
ax2.xaxis.set_visible(False)
ax2.yaxis.set_visible(False)
ax2.set_xlim(x1, x2)
ax2.set_ylim(y1, y2)
ax2.text(-11, 10, 'DBSCAN identified')
fig.savefig('image.png', bbox_inches='tight')

image.png

Recommended Posts

Clustering with scikit-learn + DBSCAN
DBSCAN (clustering) with scikit-learn
DBSCAN with scikit-learn
Clustering with scikit-learn (1)
Clustering with scikit-learn (2)
Photo segmentation and clustering with DBSCAN
Isomap with Scikit-learn
Clustering with python-louvain
PCA with Scikit-learn
kmeans ++ with scikit-learn
The most basic clustering analysis with scikit-learn
Clustering representative schools in summer 2016 with scikit-learn
Cross Validation with scikit-learn
Multi-class SVM with scikit-learn
Learn with chemoinformatics scikit-learn
DBSCAN algorithm (data clustering)
Install scikit.learn with pip
Calculate tf-idf with scikit-learn
Try using scikit-learn (1) --K-means clustering
Neural network with Python (scikit-learn)
I tried clustering with PyCaret
Clustering ID-POS data with LDA
Parallel processing with Parallel of scikit-learn
[Python] Linear regression with scikit-learn
Deep Embedded Clustering with Chainer 2.0
Robust linear regression with scikit-learn
Perform (Visualization> Clustering> Feature Description) with (t-SNE, DBSCAN, Decision Tree)
Grid search of hyperparameters with Scikit-learn
Creating a decision tree with scikit-learn
Image segmentation with scikit-image and scikit-learn
Identify outliers with RandomForestClassifier in scikit-learn
Laplacian eigenmaps with Scikit-learn (personal notes)
Non-negative Matrix Factorization (NMF) with scikit-learn
Try machine learning with scikit-learn SVM
Scikit-learn DecisionTreeClassifier with datetime type values
100 language processing knock-97 (using scikit-learn): k-means clustering
Let's tune the model hyperparameters with scikit-learn!
[Scikit-learn] I played with the ROC curve
Try SVM with scikit-learn on Jupyter Notebook
Multi-label classification by random forest with scikit-learn
[Python] Use string data with scikit-learn SVM
Implement a minimal self-made estimator with scikit-learn
Fill in missing values with Scikit-learn impute
Clustering books from Aozora Bunko with Doc2Vec
Visualize scikit-learn decision trees with Plotly's Treemap