[PYTHON] Fill in missing values with Scikit-learn impute

Scikit-learn impute is used to fill in missing data as a pre-process for machine learning. I tried to check the behavior using simple data.

Test data creation

data = {
    'A': [a for a in range(10)],
    'B': [a * 2 for a in range(10)],
    'C': [a * 3 for a in range(10)],
    'D': [a * 4 for a in range(10)],
        }
import pandas as pd

data = pd.DataFrame(data)
data
A B C D
0 0 0 0 0
1 1 2 3 4
2 2 4 6 8
3 3 6 9 12
4 4 8 12 16
5 5 10 15 20
6 6 12 18 24
7 7 14 21 28
8 8 16 24 32
9 9 18 27 36
import numpy as nan
data2 = pd.DataFrame(data)
#data2['B'][3] = np.nan
data2.loc.__setitem__(((2), ("B")), np.nan)
data2.loc.__setitem__(((3), ("C")), np.nan)
data2.loc.__setitem__(((5), ("C")), np.nan)
data2.loc.__setitem__(((6), ("D")), np.nan)
data2.loc.__setitem__(((7), ("D")), np.nan)
data2
A B C D
0 0 0.0 0.0 0.0
1 1 2.0 3.0 4.0
2 2 NaN 6.0 8.0
3 3 6.0 NaN 12.0
4 4 8.0 12.0 16.0
5 5 10.0 NaN 20.0
6 6 12.0 18.0 NaN
7 7 14.0 21.0 NaN
8 8 16.0 24.0 32.0
9 9 18.0 27.0 36.0
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data2)

Imputer_4_1.png

As described above, we created column B containing one missing value, column C containing two values, and column D containing consecutive missing values.

SimpleImputer

The SimpleImputer class provides a basic calculation method for entering missing values. Missing values can be calculated using the specified constant value or by using the statistic (mean, median, or most frequently occurring value) of each column in which the missing value exists.

default(mean)

The default is filled with the average value.

from sklearn.impute import SimpleImputer

imp = SimpleImputer() #missing_values=np.nan, strategy='mean')
data3 = pd.DataFrame(imp.fit_transform(data2))
data3
0 1 2 3
0 0.0 0.000000 0.000 0.0
1 1.0 2.000000 3.000 4.0
2 2.0 9.555556 6.000 8.0
3 3.0 6.000000 13.875 12.0
4 4.0 8.000000 12.000 16.0
5 5.0 10.000000 13.875 20.0
6 6.0 12.000000 18.000 16.0
7 7.0 14.000000 21.000 16.0
8 8.0 16.000000 24.000 32.0
9 9.0 18.000000 27.000 36.0
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data3)

Imputer_7_1.png

As mentioned above, depending on the type of data, it may be unnatural to fill in with the average value.

median

You can also fill in the missing values with the median.

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='median')
data4 = pd.DataFrame(imp.fit_transform(data2))
data4
0 1 2 3
0 0.0 0.0 0.0 0.0
1 1.0 2.0 3.0 4.0
2 2.0 10.0 6.0 8.0
3 3.0 6.0 15.0 12.0
4 4.0 8.0 12.0 16.0
5 5.0 10.0 15.0 20.0
6 6.0 12.0 18.0 14.0
7 7.0 14.0 21.0 14.0
8 8.0 16.0 24.0 32.0
9 9.0 18.0 27.0 36.0
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data4)

Imputer_10_1.png

Like the average value, when filling with the median, it may become an unnatural filling type depending on the content of the data.

most_frequent

You can also fill it with the mode.

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
data5 = pd.DataFrame(imp.fit_transform(data2))
data5
0 1 2 3
0 0.0 0.0 0.0 0.0
1 1.0 2.0 3.0 4.0
2 2.0 0.0 6.0 8.0
3 3.0 6.0 0.0 12.0
4 4.0 8.0 12.0 16.0
5 5.0 10.0 0.0 20.0
6 6.0 12.0 18.0 0.0
7 7.0 14.0 21.0 0.0
8 8.0 16.0 24.0 32.0
9 9.0 18.0 27.0 36.0
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data5)

Imputer_13_1.png

If there is no mode, it seems to be filled with the first value.

constant

You can also set a predetermined number and fill it with it.

from sklearn.impute import SimpleImputer

imp = SimpleImputer(missing_values=np.nan, strategy='constant', fill_value=99)
data6 = pd.DataFrame(imp.fit_transform(data2))
data6
0 1 2 3
0 0.0 0.0 0.0 0.0
1 1.0 2.0 3.0 4.0
2 2.0 99.0 6.0 8.0
3 3.0 6.0 99.0 12.0
4 4.0 8.0 12.0 16.0
5 5.0 10.0 99.0 20.0
6 6.0 12.0 18.0 99.0
7 7.0 14.0 21.0 99.0
8 8.0 16.0 24.0 32.0
9 9.0 18.0 27.0 36.0
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data6)

Imputer_16_1.png

Well, how unnatural it is!

KNNImputer

The KNNImputer class fills in missing values using the k-Nearest Neighbors approach. By default, the Euclidean distance metric nan_euclidean_distances, which supports missing values, is used to find the nearest neighbors. Neighbor characteristics are either uniformly averaged or weighted by the distance to each neighbor. If a sample is missing one or more features, its neighbors may differ depending on the particular features entered. If the number of neighbors available is less than n_neighbors and there is no defined distance to the training set, the training set average for that feature will be used during input. If there is at least one neighbor with the defined distance, the weighted or unweighted average of the remaining neighbors is used at entry.

n_neighbors=2

Let's explicitly set the number of neighbors to consider to n_neighbors = 2.

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
data7 = pd.DataFrame(imputer.fit_transform(data2))
data7
0 1 2 3
0 0.0 0.0 0.0 0.0
1 1.0 2.0 3.0 4.0
2 2.0 4.0 6.0 8.0
3 3.0 6.0 9.0 12.0
4 4.0 8.0 12.0 16.0
5 5.0 10.0 15.0 20.0
6 6.0 12.0 18.0 18.0
7 7.0 14.0 21.0 26.0
8 8.0 16.0 24.0 32.0
9 9.0 18.0 27.0 36.0
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data7)

Imputer_19_1.png

It seems that it cannot be filled well if it is missing for two consecutive times.

default(n_neighbors=5)

By default, it seems to consider up to 5 neighbors.

from sklearn.impute import KNNImputer
imputer = KNNImputer()
data8 = pd.DataFrame(imputer.fit_transform(data2))
data8
0 1 2 3
0 0.0 0.0 0.0 0.0
1 1.0 2.0 3.0 4.0
2 2.0 5.2 6.0 8.0
3 3.0 6.0 12.0 12.0
4 4.0 8.0 12.0 16.0
5 5.0 10.0 16.2 20.0
6 6.0 12.0 18.0 23.2
7 7.0 14.0 21.0 23.2
8 8.0 16.0 24.0 32.0
9 9.0 18.0 27.0 36.0
%matplotlib inline
import matplotlib.pyplot as plt
plt.plot(data8)

Imputer_22_1.png

The way to fill column D is relatively better, but instead it has had a slight negative effect on how to fill columns B and C.

Summary

There is probably no perfect way to fill in missing values, so consider the characteristics of your data and choose the suboptimal method!

Recommended Posts

Fill in missing values with Scikit-learn impute
Identify outliers with RandomForestClassifier in scikit-learn
Scikit-learn DecisionTreeClassifier with datetime type values
Clustering representative schools in summer 2016 with scikit-learn
Search / Delete Missing Values in "Kaggle Memorandum"
Isomap with Scikit-learn
Delete rows with arbitrary values in pandas DataFrame
Clustering with scikit-learn (1)
Clustering with scikit-learn (2)
PCA with Scikit-learn
A story packed with absolute values in numpy.ndarray
kmeans ++ with scikit-learn
Continued) Try other distance functions with kmeans in Scikit-learn
Multi-class SVM with scikit-learn
Clustering with scikit-learn + DBSCAN
Fill in 5 equal shifts
Swapping values in Python
DBSCAN (clustering) with scikit-learn
Install scikit.learn with pip
Calculate tf-idf with scikit-learn
[Django] How to give input values in advance with ModelForm
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data