[PYTHON] [Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Unsupervised learning: Finding the representation of data

Google translated http://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html scikit-learn 0.18 Tutorial Table of Contents Statistical Learning Tutorial Table of Contents for Scientific Data Processing Previous tutorial page


Unsupervised learning: seeking data representation

Clustering: Group observations together

Problems solved by clustering

Given a iris dataset, if you have three types of iris, but you do not have access to label the taxonomists, you can try a clustering task: observes in well-separated groups called clusters. Divide.

K-means clustering

Note that there are many different clustering criteria and related algorithms. The simplest clustering algorithm is K-means (http://scikit-learn.org/stable/modules/clustering.html#k-means).

>>> from sklearn import cluster, datasets
>>> iris = datasets.load_iris()
>>> X_iris = iris.data
>>> y_iris = iris.target

>>> k_means = cluster.KMeans(n_clusters=3)
>>> k_means.fit(X_iris) 
KMeans(algorithm='auto', copy_x=True, init='k-means++', ...
>>> print(k_means.labels_[::10])
[1 1 1 1 1 0 0 0 0 0 2 2 2 2 2]
>>> print(y_iris[::10])
[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

Warning There is absolutely no guarantee that the truth on earth will be restored. First, it's difficult to choose the right number of clusters. Second, the algorithm is sensitive to initial values, and scikit-learn uses some tricks to alleviate this problem, but it can fall within the local minimum.

Bad initialization 8 clusters Ground truth

** Do not over-interpret clustering results **

Application example: Vector quantization

In general, clustering and KMeans can be seen as a way to choose a few examples for compressing information. This problem is sometimes referred to as vector quantization. For example, this can be used to posterize an image:

>>> import scipy as sp
>>> try:
...    face = sp.face(gray=True)
... except AttributeError:
...    from scipy import misc
...    face = misc.face(gray=True)
>>> X = face.reshape((-1, 1)) # We need an (n_sample, n_feature) array
>>> k_means = cluster.KMeans(n_clusters=5, n_init=1)
>>> k_means.fit(X) 
KMeans(algorithm='auto', copy_x=True, init='k-means++', ...
>>> values = k_means.cluster_centers_.squeeze()
>>> labels = k_means.labels_
>>> face_compressed = np.choose(labels, values)
>>> face_compressed.shape = face.shape
Raw image K-Quantization Equal bin Image histogram

Hierarchical cohesive clustering: Ward's method

Hierarchical clustering is a type of cluster analysis that aims to build a hierarchy of clusters. In general, the various approaches of this technique

-** Aggregate -Bottom-up approach: Each observation starts with its own cluster, and the clusters are intensively merged in such a way as to minimize the binding criteria. This approach is especially interesting when the clusters of interest are made up of very few observations. With a large number of clusters, it is much more computationally efficient than the k-means. -- Branched ** --Top-down approach: All observations start in one cluster and are iteratively divided as you move through the hierarchy. To estimate a large number of clusters, this approach is slow and statistically malicious (for all observations starting as one cluster and splitting recursively).

Clustering with constrained connectivity

Aggregate clustering allows you to specify which samples to cluster by creating a connection graph. The graph in scikit is represented by its adjacency matrix. Sparse matrices are often used. This is useful, for example, when clustering images to get the connected area (also known as the connected component).

import matplotlib.pyplot as plt

from sklearn.feature_extraction.image import grid_to_graph
from sklearn.cluster import AgglomerativeClustering
from sklearn.utils.testing import SkipTest
from sklearn.utils.fixes import sp_version

if sp_version < (0, 12):
    raise SkipTest("Skipping because SciPy version earlier than 0.12.0 and "
                   "thus does not include the scipy.misc.face() image.")


###############################################################################
# Generate data
try:
    face = sp.face(gray=True)
except AttributeError:
    # Newer versions of scipy have face in misc
    from scipy import misc
    face = misc.face(gray=True)

# Resize it to 10% of the original size to speed up the processing
face = sp.misc.imresize(face, 0.10) / 255.

Aggregation of features

We have found that the curse of dimensional numbers, i.e., can be used to alleviate inadequate observations compared to the number of features. Another approach is to integrate similar features. Agglomeration of features. This approach can be achieved by feature-oriented clustering, in other words clustering of transposed data.

>>> digits = datasets.load_digits()
>>> images = digits.images
>>> X = np.reshape(images, (len(images), -1))
>>> connectivity = grid_to_graph(*images[0].shape)

>>> agglo = cluster.FeatureAgglomeration(connectivity=connectivity,
...                                      n_clusters=32)
>>> agglo.fit(X) 
FeatureAgglomeration(affinity='euclidean', compute_full_tree='auto',...
>>> X_reduced = agglo.transform(X)

>>> X_approx = agglo.inverse_transform(X_reduced)
>>> images_approx = np.reshape(X_approx, images.shape)
transform method and ʻinverse_transform` method

Some estimators expose the transform method, for example, to reduce the number of dimensions in a dataset.

Decomposition: from signal to component and loading

Components and loading

If X is our multivariate data, the problem we are trying to solve is to rewrite it with different observation criteria: we like loading $ L $ and $ X = LC $. I want to learn a set of components $ C $. There are different criteria for selecting components.

Principal component analysis: PCA

Principal Component Analysis (PCA) (http://scikit-learn.org/stable/modules/decomposition.html#pca) selects continuous components that describe the maximum variance of the signal.

The point cloud that spans the above observations is very flat in one direction. One of the three univariate features can be calculated almost exactly using the other two functions. PCA finds directions where data is not flat PCA can be used to transform data to reduce the dimensions of the data by projecting it into a major subspace.

>>> #Create a signal with only two valid dimensions
>>> x1 = np.random.normal(size=100)
>>> x2 = np.random.normal(size=100)
>>> x3 = x1 + x2
>>> X = np.c_[x1, x2, x3]

>>> from sklearn import decomposition
>>> pca = decomposition.PCA()
>>> pca.fit(X)
PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)
>>> print(pca.explained_variance_)  
[  2.18565811e+00   1.19346747e+00   8.43026679e-32]

>>> #As you can see, only the first two components are useful
>>> pca.n_components = 2
>>> X_reduced = pca.fit_transform(X)
>>> X_reduced.shape
(100, 2)

Independent component analysis: ICA

Independent Component Analysis (ICA) (http://scikit-learn.org/stable/modules/decomposition.html#ica) selects components so that the distribution of the components carries the maximum amount of independent information. .. ** Non-Gauss ** Independent signals can be restored.

>>> # Generate sample data
>>> time = np.linspace(0, 10, 2000)
>>> s1 = np.sin(2 * time)  # Signal 1 : sinusoidal signal
>>> s2 = np.sign(np.sin(3 * time))  # Signal 2 : square signal
>>> S = np.c_[s1, s2]
>>> S += 0.2 * np.random.normal(size=S.shape)  # Add noise
>>> S /= S.std(axis=0)  # Standardize data
>>> # Mix data
>>> A = np.array([[1, 1], [0.5, 2]])  # Mixing matrix
>>> X = np.dot(S, A.T)  # Generate observations

>>> # Compute ICA
>>> ica = decomposition.FastICA()
>>> S_ = ica.fit_transform(X)  # Get the estimated sources
>>> A_ = ica.mixing_.T
>>> np.allclose(X,  np.dot(S_, A_) + ica.mean_)
True

Next tutorial page © 2010 --2016, scikit-learn developers (BSD license).

Recommended Posts

[Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Unsupervised learning: Finding the representation of data
[Translation] scikit-learn 0.18 tutorial Statistical learning tutorial for scientific data processing
[Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Put all together
[Translation] scikit-learn 0.18 Tutorial Search for help on statistical learning tutorials for scientific data processing
[Translation] scikit-learn 0.18 tutorial Statistical learning tutorial for scientific data processing Statistical learning: Settings and estimator objects in scikit-learn
[Translation] scikit-learn 0.18 Tutorial Statistical learning tutorial for scientific data processing Model selection: Estimator and its parameter selection
[Translation] scikit-learn 0.18 tutorial Statistical learning tutorial for scientific data processing Supervised learning: Predicting output variables from high-dimensional observations
[Translation] scikit-learn 0.18 Tutorial Introduction of machine learning by scikit-learn
[Translation] scikit-learn 0.18 Tutorial Text data manipulation
[Translation] scikit-learn 0.18 Tutorial Table of Contents
Paper: Machine learning mimics the learning methods of multilingual children! (Visual Grounding in Video for Unsupervised Word Translation)
About data expansion processing for deep learning
[Translation] scikit-learn 0.18 Tutorial Choosing the Right Model
Introduction to Statistical Modeling for Data Analysis Expanding the range of applications of GLM
Align the number of samples between classes of data for machine learning with Python
The story of low learning costs for Python
About the processing speed of SVM (SVC) of scikit-learn
Image processing? The story of starting Python for
Implementation of clustering k-shape method for time series data [Unsupervised learning with python Chapter 13]
[Translation] NumPy Official Tutorial "NumPy: the absolute basics for beginners"
[Translation] scikit-learn 0.18 User Guide 3.2. Tuning the hyperparameters of the estimator
Organizing basic procedures for data analysis and statistical processing (4)
I tried the MNIST tutorial for beginners of tensorflow.
Organizing basic procedures for data analysis and statistical processing (2)
Techniques for understanding the basis of deep learning decisions