[PYTHON] [Translation] scikit-learn 0.18 User Guide 4.5. Random projection

Google translated http://scikit-learn.org/0.18/modules/random_projection.html [scikit-learn 0.18 User Guide 4. Dataset Conversion](http://qiita.com/nazoking@github/items/267f2371757516f8c168#4-%E3%83%87%E3%83%BC%E3%82%BF From% E3% 82% BB% E3% 83% 83% E3% 83% 88% E5% A4% 89% E6% 8F% 9B)


4.5. Random projection

sklearn.random_projection Modules are a controlled amount for faster processing times and smaller model sizes Implements a simple and computationally efficient way to reduce the dimensions of your data by trading the accuracy of (as an additional variance). This module implements two types of unstructured random matrices, Gaussian random matrices and Sparse random matrices. The dimensions and distribution of the random projection matrix are controlled to preserve the paired distance between any two samples in the dataset. Therefore, random projection is a good approximation technique for distance-based methods.

4.5.1. Johnson-Lindenstrauss Lemma

The main theoretical result of the efficiency of random projection is Johnson Lindenstrauss Lemma (Wikipedia quoted).

In mathematics, Johnson-Lindenstrauss's lemma is the result of low-distortion embedding of points in high-dimensional to low-dimensional Euclidean space. The lemma states that small sets of points in higher dimensional space can be embedded in much lower dimensional space so that the distance between points is largely preserved. The map used for embedding is at least Lipschitz and can also be considered an orthodox projection.

sklearn.random_projection.johnson_lindenstrauss_min_dim knows only the number of samples. It conservatively estimates the minimum size of the artificial subspace and guarantees bounded distortion introduced by random projection.

>>> from sklearn.random_projection import johnson_lindenstrauss_min_dim
>>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=0.5)
663
>>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=[0.5, 0.1, 0.01])
array([    663,   11841, 1112658])
>>> johnson_lindenstrauss_min_dim(n_samples=[1e4, 1e5, 1e6], eps=0.1)
array([ 7894,  9868, 11841])

4.5.2. Gaussian random projection

sklearn.random_projection.GaussianRandomProjection projects the original input space onto a randomly generated matrix to reduce the dimensions. Here, the components are derived from the following distribution $ N (0, \ frac {1} {n_ {components}}) $. Here is a small excerpt showing how to use the Gauss random projection transformer:

>>> import numpy as np
>>> from sklearn import random_projection
>>> X = np.random.rand(100, 10000)
>>> transformer = random_projection.GaussianRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.shape
(100, 3947)

4.5.3. Sparse random projection

sklearn.random_projection.SparseRandomProjection is the original input using a sparse random matrix Reduce dimensions by projecting space. The sparse random matrix is an alternative to the dense Gaussian random projection matrix, guaranteeing similar embedding quality, being more memory efficient, and enabling fast computation of projected data. If you define s = 1 / density, the elements of the random matrix will be

\left\{
\begin{array}{c c l}
-\sqrt{\frac{s}{n_{\text{components}}}} & & 1 / 2s\\
0 &\text{with probability}  & 1 - 1 / s \\
+\sqrt{\frac{s}{n_{\text{components}}}} & & 1 / 2s\\
\end{array}
\right.

Where $ n_ {\ text {components}} $ is the size of the projected subspace. By default, the density of nonzero elements is set to the following minimum density recommended by Ping Li et al. $ 1 / \ sqrt {n_ {\ text {features}}} $

A small excerpt showing how to use a sparse random projective transformation:

>>> import numpy as np
>>> from sklearn import random_projection
>>> X = np.random.rand(100,10000)
>>> transformer = random_projection.SparseRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.shape
(100, 3947)

[scikit-learn 0.18 User Guide 4. Dataset Conversion](http://qiita.com/nazoking@github/items/267f2371757516f8c168#4-%E3%83%87%E3%83%BC%E3%82%BF From% E3% 82% BB% E3% 83% 83% E3% 83% 88% E5% A4% 89% E6% 8F% 9B)

© 2010 --2016, scikit-learn developers (BSD license).

Recommended Posts

[Translation] scikit-learn 0.18 User Guide 4.5. Random projection
[Translation] scikit-learn 0.18 User Guide 1.11. Ensemble method
[Translation] scikit-learn 0.18 User Guide 1.15. Isotonic regression
[Translation] scikit-learn 0.18 User Guide 4.2 Feature extraction
[Translation] scikit-learn 0.18 User Guide 1.16. Probability calibration
[Translation] scikit-learn 0.18 User Guide 1.13 Feature selection
[Translation] scikit-learn 0.18 User Guide 3.4. Model persistence
[Translation] scikit-learn 0.18 User Guide 2.8. Density estimation
[Translation] scikit-learn 0.18 User Guide 4.3. Data preprocessing
[Translation] scikit-learn 0.18 User Guide 4.4. Unsupervised dimensionality reduction
[Translation] scikit-learn 0.18 User Guide Table of Contents
[Translation] scikit-learn 0.18 User Guide 1.4. Support Vector Machine
[Translation] scikit-learn 0.18 User Guide 1.12. Multi-class algorithm and multi-label algorithm
[Translation] scikit-learn 0.18 User Guide 3.2. Tuning the hyperparameters of the estimator
[Translation] scikit-learn 0.18 User Guide 4.8. Convert the prediction target (y)
[Translation] scikit-learn 0.18 User Guide 2.7. Detection of novelty and outliers
[Translation] scikit-learn 0.18 User Guide 3.1. Cross-validation: Evaluate the performance of the estimator
[Translation] scikit-learn 0.18 User Guide 3.3. Model evaluation: Quantify the quality of prediction
[Translation] scikit-learn 0.18 User Guide 4.1. Pipeline and Feature Union: Combination of estimators
[Translation] scikit-learn 0.18 User Guide 3.5. Verification curve: Plot the score to evaluate the model
[Translation] scikit-learn 0.18 User Guide 2.5. Decompose the signal in the component (matrix factorization problem)
Pandas User Guide "Multi-Index / Advanced Index" (Official document Japanese translation)
Pandas User Guide "Manipulating Missing Data" (Official Document Japanese Translation)
Pandas User Guide "Table Formatting and PivotTables" (Official Document Japanese Translation)