[PYTHON] Try Standard Scaler

I briefly researched sklearn.preprocessing.StandardScaler, so I will leave it as a memo.

StandardScaler provides the standardization function of the dataset. By standardizing, the ratio of features can be made uniform.

For example, taking the deviation value as an example, assuming that there is a test with a maximum of 100 points and a test with a maximum of 50 points. Even if the ratio and unit of points are different, you can evaluate the points without being affected by standardization.

Standardization can be found by subtracting the mean from each piece of data in the set and dividing by the standard deviation.

z_i = \large{\frac{x_i- \mu}{\sigma}}

μ is the mean, σ is the standard deviation, and ʻi` is any natural number.

The average is calculated by dividing the sum of the sets by the number.

\mu = \frac{1}{n}\sum ^n_{i}x_i

The standard deviation is calculated by dividing the variance by the square root.

\sigma = \sqrt{s^2}

The variance is calculated by subtracting the mean from each data in the set, summing the squares, and dividing by the number.

s^2 = \dfrac 1n\sum ^n_{i}(x_i - \mu)^2

First of all, I would like to implement it myself without using a machine learning library. Use the ʻiris` dataset as the target for standardization.

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
print(X[:1]) # array([[5.1, 3.5, 1.4, 0.2]])

import math

def standardize(x):
    """
    Parameters
    ----------
    x:Expect an array to standardize, a one-dimensional vector.

    Returns
    -------
    mean:average
    var:Distributed
    std:standard deviation
    z:Standardized array
    """
    mean = None
    var = None
    std = None
    z = None

    #Calculate the average, the sum of the sets/Number of features
    mean = sum(x) / len(x)

    #Calculate the variance, mean the set-Divide the sum of squares of feature differences by the number
    var = sum([(mean - x_i) ** 2 for x_i in x]) / len(x)

    #Divide the variance by the square root
    std = math.sqrt(var)

    #Subtract the mean from each feature and divide it by the standard deviation
    z = [x_i / std for x_i in [x_i - mean for x_i in x]]
    
    return [mean, var, std, z]

#7 data from the beginning of the 1D of the dataset
sample_data = X[:, :1][:7] 
print(sample_data.tolist()) # [[5.1], [4.9], [4.7], [4.6], [5.0], [5.4], [4.6]]

mean, var, std, z = standardize(sample_data)

print(mean) # [4.9]
print(var)  # [0.07428571]
print(std)  # 0.2725540575476989
print(*z)   # [0.73379939] [3.25872389e-15] [-0.73379939] [-1.10069908] [0.36689969] [1.83449846] [-1.10069908]

You can see that the sample_data variable has been processed and converted to floating point.

Next, try using sklearn.preprocessing.StandardScaler.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(sample_data)

print(scaler.mean_) # [4.9]
print(scaler.var_)  # [0.07428571]
print(math.sqrt(scaler.var_)) # 0.2725540575476989
print(*scaler.transform(sample_data))  # [0.73379939] [3.25872389e-15] [-0.73379939] [-1.10069908] [0.36689969] [1.83449846] [-1.10069908]

#The values are the same
print(scaler.mean_ == mean) # [ True]
print(scaler.var_ == var)   # [ True]
print(math.sqrt(scaler.var_) == std) # True
print(*(scaler.transform(sample_data) == z)) # [ True] [ True] [ True] [ True] [ True] [ True] [ True] 

By using Standard Scaler in this way, standardization can be achieved with a small amount.

Also, since the values calculated at the time of standardization such as mean_ and var_ are retained, if fit is set when learning the machine learning model, it will be used as data [^ 1] when inferring the model. On the other hand, processed_query = (np.array (query) --scaler.mean_) --np.sqrt (scaler.var_) can be set, so I think it is very convenient.

[^ 1]: Assuming that the vector given during model inference belongs to the set of datasets used during training.

Recommended Posts

Try Standard Scaler
Try python
try pysdl2
Try PyOpenGL
Try OpenAI's standard reinforcement learning algorithm PPO