[PYTHON] Standardize non-normal distribution with robust Z-score

Min-max normalization and Z-score Normalization (Standardization) are often used for normalization / standardization. This time, I tried robust Z-score and compared it with the above normalization.

min-max normalization min-max normalization is a method to make the data have a minimum value of 0 and a maximum value of 1, and normalizes with the following formula.

x' = \frac{x-min(x)}{max(x)-min(x)}

In python, you can calculate with minmax_scale or MinMaxScaler in sklearn.preprocessing. This normalization assumes that the distribution of the data is ** uniform **.

Z-score Normalization(Standardization) Z-score Normalization is a method to make the data average 0 and variance 1, and normalize with the following formula. This value is called ** Z-score **. * μ * represents the mean and * σ * represents the standard deviation.

x' = \frac{x-\mu}{\sigma}

In python, you can calculate with scale or StandardScaler in sklearn.preprocessing. This normalization assumes that the distribution of the data is ** normal **.

What to do if it is neither uniform nor normal?

In the actual data, it was often neither uniform nor normal distribution, so when I was investigating what to do, I found the robust Z-score in the following article.

Robust z-score: median and quartile, non-normal distribution, standardization including outliers (Memo) Exclusion of outliers using robust z-score

Below, I tried it in Python.

Implementation of robust Z-score

For more information on robust Z-score, please read the above article. The following is a brief description and implementation.

Z-score assumes a normal distribution, but to apply this to a non-normal distribution, first replace the mean * μ * with the median and the standard deviation * σ * with the interquartile range (IQR).

x' = \frac{x-median(x)}{IQR}

This formula can be calculated with robust_scale or RobustScaler in sklearn.preprocessing.

It also makes it compatible with standard normal distributions. The corresponding IQR to the standard normal distribution is called the normalized interquartile range (NIQR), which is the IQR divided by F (0.75) --F (0.25) = 1.3489. (F (x) is the inverse of the cumulative distribution function)

NIQR = \frac{IQR}{1.3489}

Robust Z-score is the denominator of the above formula replaced from IQR to NIQR.

robust Z score = \frac{x-median(x)}{NIQR}

If implemented based on the above, the function will be as follows.


def robust_z(x):
    from sklearn.preprocessing import robust_scale
    from scipy.stats import norm

    coefficient = norm.ppf(0.75)-norm.ppf(0.25)
    robust_z_score = robust_scale(x)*coefficient

    return robust_z_score

Comparison of three normalizations

I would like to compare the three normalizations that have come up so far. First, prepare the data. I want data that is neither uniform nor normal, so I prepared data that combines uniform and normal distribution.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import chisquare, shapiro, norm
from sklearn.preprocessing import minmax_scale, scale, robust_scale

np.random.seed(2020)

#Data that combines a uniform distribution and a normal distribution.
data = np.concatenate((np.random.uniform(low=5.0, high=10.0, size=100),
                       np.random.normal(loc=5.0, scale=1.0, size=100)))

#Draw a histogram.
fig, axes = plt.subplots()
axes.hist(data)
axes.set_title("Histogram of data")
fig.show()

row.png

The test confirms that this data is not uniformly and normally distributed. Homogeneity was confirmed by the chi-square test, and normality was confirmed by the (Shapiro-Wilk test).

#Calculate the frequency distribution.
hist_data, _ = np.histogram(data, bins="auto")

#Uniformity test (chi-square test)
_, chisquare_p = chisquare(hist_data)
print("Uniformity test (chi-square test) p-value: {}".format(chisquare_p))

#Normality test (Shapiro-Wilk test)
_, shapiro_p = shapiro(data)
print("P-value of normality test (Shapiro-Wilk test): {}".format(shapiro_p))

The results are as follows. Both have a P-value smaller than 0.05, so it can be said that they are neither uniform nor normal.

Uniformity test (chi-square test) p-value: 3.8086163670115985e-09
P-value of normality test (Shapiro-Wilk test): 8.850588528730441e-06

Use this data to calculate min-max normalization, Z-score, and robust Z-score and compare them.

#Normalize each method and put it in the data frame.
score_df = pd.DataFrame(data=np.array([minmax_scale(data), scale(data), robust_z(data)]).T,
                        columns=["min-max", "Z-score", "robust Z-score"])


#Create a graph
fig, axs = plt.subplots(ncols=3, constrained_layout=True)

#x-axis width setting
xrange = {"min-max":(0,1),
          "Z-score":(-2.5,2.5),
          "robust Z-score":(-2.5,2.5)}

#Drawing of each histogram
for i, score_name in enumerate(score_df.columns):
    
    axs[i].hist(score_df[score_name])
    axs[i].set_title(score_name)
    axs[i].set_xlim(xrange[score_name])

fig.show()

The result is shown below. There is not much difference. It may make a difference depending on the distribution of the data.

score.png

Compare with data with outliers

In the first place, "robust" in robust Z-score means that it is robust against ** outliers **. The robust Z-score is also used for outlier detection. Therefore, I would like to put outliers in the data and compare them. For ease of comparison, try entering a large number of extreme outliers.

#Combine outliers (uniform distribution) into the data.
outier = np.concatenate((data,
                         np.random.uniform(low=19.0, high=20.0, size=15)))

#Normalize each method and put it in the data frame.
outlier_df = pd.DataFrame(data=np.array([minmax_scale(outier), scale(outier), robust_z(outier)]).T,
                          columns=["min-max", "Z-score", "robust Z-score"])

#Combine data frames with no outliers and with outliers.
concat_df = pd.concat([score_df, outlier_df],
               axis=1,
               keys=['without outlier', 'with outlier'])


#Create a graph
fig, axs = plt.subplots(nrows=2, ncols=3, constrained_layout=True)

#x-axis width setting
xrange = {"min-max":(0, 1),
          "Z-score":(-6.5, 6.5),
          "robust Z-score":(-6.5, 6.5)}

#Histogram drawing
for i, (data_name, score_name) in enumerate(concat_df.columns):
    row, col = divmod(i, 3)
    axs[row, col].hist(concat_df[(data_name, score_name)])
    axs[row, col].set_xlim(xrange[score_name])
    
    title = "\n".join([data_name, score_name])    
    axs[row, col].set_title(title)       
    
plt.show()

The result is shown below. The top is when there are no outliers and the bottom is when there are outliers. min-max normalization is highly sensitive to outliers. Z-score is also affected by outliers and is very different from the case without outliers. Robust Z-score is least affected by outliers and is relatively similar to no outliers.

outlier.png

Finally

Robust Z-score gives the same result as Z-score when the data is normally distributed, so if you get lost, I'm thinking of using robust Z-score. In particular, I felt that robust Z-score is effective when I want to use outliers as well.

Recommended Posts

Standardize non-normal distribution with robust Z-score
3. Normal distribution with neural network!
Standardize by group with pandas
Robust linear regression with scikit-learn