[PYTHON] [Translation] scikit-learn 0.18 User Guide 4.3. Data preprocessing

Google translated http://scikit-learn.org/0.18/modules/preprocessing.html [scikit-learn 0.18 User Guide 4. Dataset Conversion](http://qiita.com/nazoking@github/items/267f2371757516f8c168#4-%E3%83%87%E3%83%BC%E3%82%BF From% E3% 82% BB% E3% 83% 83% E3% 83% 88% E5% A4% 89% E6% 8F% 9B)


4.3. Data preprocessing

The sklearn.preprocessing package provides some common utility functions and transformation classes for transforming raw feature vectors into a representation suitable for downstream estimators.

4.3.1. Standardization, mean removal and variance scaling

Dataset ** standardization ** is a common requirement ** of many machine learning estimators implemented in scikit-learn. If the individual features do not look like data that are normally normally distributed, that is, if they have a Gaussian distribution with ** zero mean and unit variance **, they can behave badly. In practice, it ignores the shape of the distribution, removes the mean of each feature, transforms the data to the center, divides the transient features by the standard deviation, and scales. For example, in many elements used in learning algorithm objective functions (such as the RBF kernel in Support Vector Machines and the l1 and l2 normalizers in linear models), all features are centered on zero and distributed in the same order. Suppose you have. If the variance of one feature is greater than the order of another, it can dominate the objective function and the estimator may not learn correctly from the other features as expected. scale This function works on a dataset like a single array. It provides a quick and easy way to perform operations.

>>> from sklearn import preprocessing
>>> import numpy as np
>>> X = np.array([[ 1., -1.,  2.],
...               [ 2.,  0.,  0.],
...               [ 0.,  1., -1.]])
>>> X_scaled = preprocessing.scale(X)

>>> X_scaled                                          
array([[ 0.  ..., -1.22...,  1.33...],
       [ 1.22...,  0.  ..., -0.26...],
       [-1.22...,  1.22..., -1.06...]])

The mean and unit variance of the scaled data is zero.

>>>
>>> X_scaled.mean(axis=0)
array([ 0.,  0.,  0.])

>>> X_scaled.std(axis=0)
array([ 1.,  1.,  1.])

In addition, the preprocessing module is a utility class [StandardScaler](http://scikit-learn.org/0.18/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler] that implements the Transformer API. ) To calculate the mean and standard deviation of the training set so that the same transformation can be reapplied to the test set later. Therefore, this class is used in the early stages of sklearn.pipeline.Pipeline. Suitable for.

Scaler instances can be used for new data and transformed in the same way as training sets.

Either with_mean = False or with_std = False in StandardScaler You can disable centering or scaling by passing it to the constructor.

4.3.1.1. Scaling features over a range

Another standard is a scaling feature that allows a given minimum and maximum value to be between zero and one, or the maximum absolute value of each feature to be scaled to a unit size. is there. This is either MinMaxScaler or MaxAbsScaler. This can be achieved using .org / 0.18 / modules / generated / sklearn.preprocessing.MaxAbsScaler.html # sklearn.preprocessing.MaxAbsScaler). Motivations for using this scaling include robustness to very small standard deviations of features and retention of zero entries of sparse data. The following example scales the toy data matrix to the range [0, 1].

>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> min_max_scaler = preprocessing.MinMaxScaler()
>>> X_train_minmax = min_max_scaler.fit_transform(X_train)
>>> X_train_minmax
array([[ 0.5       ,  0.        ,  1.        ],
       [ 1.        ,  0.5       ,  0.33333333],
       [ 0.        ,  1.        ,  0.        ]])

You can apply the same instance of the transformer to some new test data that is not visible during the fitting call. The same scaling and shift operations are applied to match the transformations performed on the training data.

>>> X_test = np.array([[ -3., -1.,  4.]])
>>> X_test_minmax = min_max_scaler.transform(X_test)
>>> X_test_minmax
array([[-1.5       ,  0.        ,  1.66666667]])

By looking at the scaler's attributes, you can find the exact nature of the transformations learned in the training data.

>>> min_max_scaler.scale_                             
array([ 0.5       ,  0.5       ,  0.33...])

>>> min_max_scaler.min_                               
array([ 0.        ,  0.5       ,  0.33...])

MinMaxScaler is given an explicit feature_range = (min, max) If so, the full formula would be:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

X_scaled = X_std / (max - min) + min

MaxAbsScaler works very much, but the training data is Divide and scale the maximum value of each feature so that it is within the range [-1, 1]. It means data that is already in the center of zero or sparse data. Here's how to use the data from the previous example with this scaler:

>>> X_train = np.array([[ 1., -1.,  2.],
...                     [ 2.,  0.,  0.],
...                     [ 0.,  1., -1.]])
...
>>> max_abs_scaler = preprocessing.MaxAbsScaler()
>>> X_train_maxabs = max_abs_scaler.fit_transform(X_train)
>>> X_train_maxabs                # doctest +NORMALIZE_WHITESPACE^
array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])
>>> X_test = np.array([[ -3., -1.,  4.]])
>>> X_test_maxabs = max_abs_scaler.transform(X_test)
>>> X_test_maxabs                 
array([[-1.5, -1. ,  2. ]])
>>> max_abs_scaler.scale_         
array([ 2.,  1.,  2.])

Like scale, modules are a useful function minmax_scale if you don't want to create objects. It provides .minmax_scale) and maxabs_scale.

4.3.1.2. Scaling sparse data

Centering sparse data is rarely good because it destroys the sparseness structure of the data. However, there are cases where the features are on different scales. MaxAbsScaler and maxabs_scale /modules/generated/sklearn.preprocessing.maxabs_scale.html#sklearn.preprocessing.maxabs_scale) is specially designed for scaling sparse data and is the recommended method. However, scale and StandardScaler /0.18/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) accepts the scipy.sparse matrix as input as long as with_mean = False is explicitly passed to the constructor. can do. Otherwise, it will quietly center and expand the sparse matrix, unintentionally allocating too much memory and crashing, resulting in a ValueError. RobustScaler cannot make the input sparse, but the transform method Can be used in sparse matrices.

The scaler accepts both Compressed Sparse Rows and Compressed Sparse Columns formats (see scipy.sparse.csr_matrix and scipy.sparse.csc_matrix). Other sparse matrix inputs are ** converted to csr matrices **. We recommend that you choose the upstream CSR or CSC representation to avoid unnecessary memory copies. Finally, if you expect the center data to be small enough, another option is to explicitly convert the input to an array using the toarray method of the sparse matrix.

4.3.1.3. Scaling of data containing outliers

Scaling using averaging and variance of the data may not work well if the data contains a large number of outliers. In such cases, instead, robust_scale and RobustScaler You can use (: //scikit-learn.org/0.18/modules/generated/sklearn.preprocessing.RobustScaler.html#sklearn.preprocessing.RobustScaler). They use more robust estimators for the center and extent of your data.

--Reference: --See this FAQ for the importance of data centering and scaling. Do I need to standardize / standardize / scale my data? --Scale and whitening —— Downstream models can further assume linear independence of features, so it may not be sufficient to centralize and scale features independently. --To address this issue, sklearn.decomposition.PCA or sklearn .decomposition.RandomizedPCA using whiten = True between feature values Further remove the linear correlation of. --Scaling of target variables in regression --scale and StandardScaler are ready-to-use one-dimensional arrays. This is very useful for scaling the target / response variables used for regression.

4.3.1.4. Centering kernel matrix

If you have a kernel matrix for kernel $ K $ that computes the dot product in the feature space defined by the function $ phi $, then KernelCenterer .KernelCenterer.html # sklearn.preprocessing.KernelCenterer) can transform the kernel matrix to include the inner product in the feature space defined by $ phi $ and then remove the average in that space.

4.3.2. Normalization

** Normalization ** is a ** process that scales individual samples to have a unit standard. This process is useful for quantifying the similarity of sample pairs using quadratic forms such as dot products and other kernels. This assumption is the basis of the Vector Space Model (https://en.wikipedia.org/wiki/Vector_Space_Model), which is often used in the context of text classification and clustering. The function normalize uses the norm l1 or l2. It provides a quick and easy way to perform this operation on datasets such as a single array.

>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> X_normalized = preprocessing.normalize(X, norm='l2')

>>> X_normalized                                      
array([[ 0.40..., -0.40...,  0.81...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.  ...,  0.70..., -0.70...]])

The preprocessing module also uses the Transformer API to implement the same operation in the utility class [Normalizer](http://scikit-learn.org/0.18/modules/generated/sklearn.preprocessing.Normalizer.html# sklearn.preprocessing.Normalizer) (even if the fit method doesn't help in this case, this operation is stateless because it processes the sample independently). Therefore, this class is suitable for early use with sklearn.pipeline.Pipeline.

>>> normalizer = preprocessing.Normalizer().fit(X)  # fit does nothing
>>> normalizer
Normalizer(copy=True, norm='l2')

The normalizer instance can be used as any transformer in the sample vector.

>>> normalizer.transform(X)                            
array([[ 0.40..., -0.40...,  0.81...],
       [ 1.  ...,  0.  ...,  0.  ...],
       [ 0.  ...,  0.70..., -0.70...]])

>>> normalizer.transform([[-1.,  1., 0.]])             
array([[-0.70...,  0.70...,  0.  ...]])

--Sparse input --normalize and Normalizer accept both dense and sparse matrices from scipy.sparse as input. --For sparse input, the data is converted to a compressed sparse representation (see scipy.sparse.csr_matrix) before being sent to an efficient Cython routine. We recommend that you choose the upstream CSR representation to avoid unnecessary memory copies.

4.3.3. Binarization

4.3.3.1. Feature binarization

Feature binarization is the process of thresholding numerical features to obtain Boolean values. This is useful for downstream stochastic estimators that assume that the input data is distributed according to the multivariate Bernoulli distribution. For example, this is the case for sklearn.neural_network.BernoulliRBM. Also, using binary feature values (to simplify probabilistic reasoning) even when normalized counts (term frequency, etc.) and TF-IDF valued features actually work a bit better. , Common in the text processing community. Use binary feature values, even if normalized counts or TF-IDF evaluated features often work slightly better in practice. For Normalizer, the utility class Binarizer .org / 0.18 / modules / generated / sklearn.preprocessing.Binarizer.html # sklearn.preprocessing.Binarizer) is [sklearn.pipeline.Pipeline](http://scikit-learn.org/0.18/modules/generated/sklearn. It is intended to be used in the early stages of pipeline.Pipeline.html # sklearn.pipeline.Pipeline). The fit method does nothing because each sample is processed independently of the other samples.

>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]

>>> binarizer = preprocessing.Binarizer().fit(X)  # fit does nothing
>>> binarizer
Binarizer(copy=True, threshold=0.0)

>>> binarizer.transform(X)
array([[ 1.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  1.,  0.]])

You can adjust the threshold of Binarizer.

>>> binarizer = preprocessing.Binarizer(threshold=1.1)
>>> binarizer.transform(X)
array([[ 0.,  0.,  1.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  0.]])

For the StandardScaler and Normalizer classes, the preprocessing module is a companion function [binarize](http://scikit-learn.org/0.18/modules/generated/sklearn.preprocessing." that is used when the conversion API is not needed. binarize.html#sklearn.preprocessing.binarize) is provided.

--Sparse input --binarize and Binarizer accept both dense and sparse matrices from scipy.sparse as input. --For sparse input, the data is converted to a ** compressed sparse line representation ** (see scipy.sparse.csr_matrix). We recommend that you choose the upstream CSR representation to avoid unnecessary memory copies.

4.3.4. Encode category features

In many cases, features are categorized rather than continuous values. For example, [" male "," female "], [" from Europe "," from US "," from Asia "], ["uses Firefox "," uses Chrome "," uses Safari ", "uses Internet Explorer"] Such features can be efficiently coded as integers. For example, [" male "," from US "," uses Internet Explorer "] is [0,1,3] [" female "," from Asia "," uses Chrome "] is [ 1, 2, 1] . Such an integer representation expects continuous input and interprets the categories as ordered, so it cannot be used directly in the scikit-learn estimator (often an unwanted set of browsers is arbitrarily ordered. I did). One way to convert categorical features to features that can be used by the scikit-learn estimator is [OneHotEncoder](http://scikit-learn.org/0.18/modules/generated/sklearn.preprocessing.OneHotEncoder.html# It is to use the 1-to-K or 1-hot encoding implemented in sklearn.preprocessing.OneHotEncoder). This estimator converts each category feature with m possible values into m binary features and makes only one active. Continue with the above example:

>>> enc = preprocessing.OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values='auto', sparse=True)
>>> enc.transform([[0, 1, 3]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

By default, the number of values each feature can take is automatically inferred from the dataset. You can specify this explicitly using the parameter n_values. The dataset has two genders, three possible continents, and four web browsers. Then transform the data points to fit your quote. As a result, the first two numbers encode into the gender, the third number into the continent, and the last four into the web browser. Note that you need to explicitly set n_values if your training data may not have categorical features. For example

>>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
>>> #Notice that the category values for the 2nd and 3rd features are missing
>>> enc.fit([[1, 2, 3], [0, 2, 0]])  
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
       handle_unknown='error', n_values=[2, 3, 4], sparse=True)
>>> enc.transform([[1, 0, 0]]).toarray()
array([[ 0.,  1.,  1.,  0.,  0.,  1.,  0.,  0.,  0.]])

"Loading features from [dicts](http://qiita.com/nazoking@github/items/b270288fa38aed0a71bf#421-dicts%E3%81%8B%E3%82%89%E3%81%AE%E7%" 89% B9% E5% BE% B4% E9% 87% 8F% E3% 81% AE% E3% 83% AD% E3% 83% BC% E3% 83% 89) ".

4.3.5. Missing value completion

For a variety of reasons, many real-world datasets contain missing values and are often encoded as placeholders such as whitespace, NaN, and so on. However, such datasets are incompatible with scikit-learn estimates, which assume that all values in the array are numeric, all meaningful and retained. The basic strategy for using an incomplete dataset is to discard the entire row and / or column that contains the missing value. However, this will result in data loss (even if incomplete). A better strategy is to substitute missing values, i.e. infer them from known parts of the data. Imputer The class is the mean, median of rows or columns with missing values. Or use the most frequent values to provide a basic way to assign missing values. It is also possible to encode different missing values in this class. The following snippet shows how to use the mean of the column containing the missing values (axis 0) to replace the missing values encoded as np.nan.

>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
>>> imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
>>> imp.fit([[1, 2], [np.nan, 3], [7, 6]])
Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)
>>> X = [[np.nan, 2], [6, np.nan], [7, 6]]
>>> print(imp.transform(X))                           
[[ 4.          2.        ]
 [ 6.          3.666...]
 [ 7.          6.        ]]

The Imputer class also supports sparse matrices.

>>> import scipy.sparse as sp
>>> X = sp.csc_matrix([[1, 2], [0, 3], [7, 6]])
>>> imp = Imputer(missing_values=0, strategy='mean', axis=0)
>>> imp.fit(X)
Imputer(axis=0, copy=True, missing_values=0, strategy='mean', verbose=0)
>>> X_test = sp.csc_matrix([[0, 2], [6, 0], [7, 6]])
>>> print(imp.transform(X_test))                      
[[ 4.          2.        ]
 [ 6.          3.666...]
 [ 7.          6.        ]]

Note that the missing values are 0-encoded here, so they are implicitly stored in the matrix. Therefore, this format is suitable when there are more missing values than the observed values. You can use Imputer in your pipeline as a way to build a composite estimator that supports substitution. See Enter missing values before creating the estimator (http://scikit-learn.org/0.18/auto_examples/missing_values.html#sphx-glr-auto-examples-missing-values-py) please

4.3.6. Generation of polynomial features

Adding complexity to the model is often useful, taking into account the non-linear characteristics of the input data. A simple and common use is polynomial features, which allow you to obtain higher-order and interaction terms for features. Implemented in PolynomialFeatures:

>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> X                                                 
array([[0, 1],
       [2, 3],
       [4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X)                             
array([[  1.,   0.,   1.,   0.,   0.,   1.],
       [  1.,   2.,   3.,   4.,   6.,   9.],
       [  1.,   4.,   5.,  16.,  20.,  25.]])

The functionality of X has been converted from $ (X_1, X_2) $ to $ (1, X_1, X_2, X_1 ^ 2, X_1X_2, X_2 ^ 2) $. If you only need the interaction term between features, you can get it with the setting ʻinteract_only = True`.

>>> X = np.arange(9).reshape(3, 3)
>>> X                                                 
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
>>> poly = PolynomialFeatures(degree=3, interaction_only=True)
>>> poly.fit_transform(X)                             
array([[   1.,    0.,    1.,    2.,    0.,    0.,    2.,    0.],
       [   1.,    3.,    4.,    5.,   12.,   15.,   20.,   60.],
       [   1.,    6.,    7.,    8.,   42.,   48.,   56.,  336.]])

The functionality of X has been converted from $ (X_1, X_2, X_3) $ to $ (1, X_1, X_2, X_3, X_1X_2, X_1X_3, X_2X_3, X_1X_2X_3) $. Polynomial functions are polynomial [kernel functions](http://qiita.com/nazoking@github/items/2b16be7f7eac940f2e6a#146-%E3%82%AB%E3%83%BC%E3%83%8D%E3%83 When using% AB% E9% 96% A2% E6% 95% B0), the Kernel Method (https://en.wikipedia.org/wiki/Kernel_method) (for example, sklearn.svm.SVC, sklearn.decomposition.KernelPCA Note that it is used implicitly in /modules/generated/sklearn.decomposition.KernelPCA.html#sklearn.decomposition.KernelPCA)). [Polynomial interpolation] of ridge regression using the created polynomial features (http://scikit-learn.org/0.18/auto_examples/linear_model/plot_polynomial_interpolation.html#sphx-glr-auto-examples-linear-model-plot See -polynomial-interpolation-py).

4.3.7. Custom transformer

In many cases, you'll want to transform your existing Python functions into transformers to help you clean and process your data. You can implement a transformer from any function using FunctionTransformer. For example, to build a transformer that applies log transformations in your pipeline:

>>> import numpy as np
>>> from sklearn.preprocessing import FunctionTransformer
>>> transformer = FunctionTransformer(np.log1p)
>>> X = np.array([[0, 1], [2, 3]])
>>> transformer.transform(X)
array([[ 0.        ,  0.69314718],
       [ 1.09861229,  1.38629436]])

For a complete code example of how to use FunctionTransformer to select custom features, see Select Columns Using FunctionTransformer (http://scikit-learn.org/0.18/auto_examples/preprocessing/plot_function_transformer.html#). sphx-glr-auto-examples-preprocessing-plot-function-transformer-py) ".


[scikit-learn 0.18 User Guide 4. Dataset Conversion](http://qiita.com/nazoking@github/items/267f2371757516f8c168#4-%E3%83%87%E3%83%BC%E3%82%BF From% E3% 82% BB% E3% 83% 83% E3% 83% 88% E5% A4% 89% E6% 8F% 9B)

© 2010 --2016, scikit-learn developers (BSD license).

Recommended Posts

[Translation] scikit-learn 0.18 User Guide 4.3. Data preprocessing
[Translation] scikit-learn 0.18 User Guide 4.5. Random projection
[Translation] scikit-learn 0.18 User Guide 1.11. Ensemble method
[Translation] scikit-learn 0.18 User Guide 1.15. Isotonic regression
[Translation] scikit-learn 0.18 User Guide 4.2 Feature extraction
[Translation] scikit-learn 0.18 User Guide 1.16. Probability calibration
[Translation] scikit-learn 0.18 User Guide 1.13 Feature selection
[Translation] scikit-learn 0.18 User Guide 3.4. Model persistence
[Translation] scikit-learn 0.18 User Guide 2.8. Density estimation
[Translation] scikit-learn 0.18 User Guide 4.4. Unsupervised dimensionality reduction
[Translation] scikit-learn 0.18 User Guide Table of Contents
[Translation] scikit-learn 0.18 User Guide 1.4. Support Vector Machine
[Translation] scikit-learn 0.18 User Guide 1.12. Multi-class algorithm and multi-label algorithm
[Translation] scikit-learn 0.18 User Guide 3.2. Tuning the hyperparameters of the estimator
[Translation] scikit-learn 0.18 User Guide 4.8. Convert the prediction target (y)
[Translation] scikit-learn 0.18 User Guide 2.7. Detection of novelty and outliers
[Translation] scikit-learn 0.18 Tutorial Text data manipulation
[Translation] scikit-learn 0.18 User Guide 3.1. Cross-validation: Evaluate the performance of the estimator
[Translation] scikit-learn 0.18 User Guide 3.3. Model evaluation: Quantify the quality of prediction
[Translation] scikit-learn 0.18 User Guide 4.1. Pipeline and Feature Union: Combination of estimators
[Translation] scikit-learn 0.18 User Guide 3.5. Verification curve: Plot the score to evaluate the model
[Translation] scikit-learn 0.18 User Guide 2.5. Decompose the signal in the component (matrix factorization problem)
[Translation] scikit-learn 0.18 tutorial Statistical learning tutorial for scientific data processing
Correlation by data preprocessing
Preprocessing of prefecture data
Implement normalization of Python training data preprocessing with scikit-learn [fit_transform]
Text data preprocessing (vectorization, TF-IDF)
Pandas User Guide "Table Formatting and PivotTables" (Official Document Japanese Translation)