[PYTHON] Feature preprocessing for modeling

Introduction

When you analyze the data yourself, make a note of preprocessing of features for modeling. Data EDA etc. before preprocessing Tips and precautions when analyzing data Please refer to it as it is described in.

What is important in preprocessing is *** ・ Missing value processing *** *** ・ What is the type of feature that is preprocessed ***?

In this article -** Numeric (numerical data) ** -** Categorical ** This section describes the preprocessing method for the two type types. `There are other pre-processings for various data such as Datetime and location information data, but please note that they are not described in this article. ``


Previous information
This time, I will explain using the data used in the Kaggle competition. Kaggle House Prices DataSet

Kaggle House Prices Kernel Make a note using.

Please note that we do not care about model accuracy etc. as we keep it for Memo`


Data contents
There are 81 features in all, of which SalePrices is used as the objective variable (target) for analysis. The explanatory variables are other than SalePrices.

Missing value processing

First, check the contents of the data, and if the data contains missing values, there are two approaches to eliminate the missing values. 1. Delete columns and rows that contain missing values. 2. Complement (fill in the blanks) with another numerical value for the missing value.

When doing the above, it is important to note *** why the data is analyzed and what you want to output as Output ***. There are two possible reasons for data analysis. *** The first is the construction of a model that accurately predicts the objective variable (target), and the second is the understanding of the data ***. In the case of constructing the first model that predicts with high accuracy, if columns and rows containing missing values are deleted unnecessarily, the number of data may decrease significantly. It can be said that this is not a good idea. Therefore, it is a good idea to supplement the missing value with another numerical value. At this time, a typical example is to complement with the average value. On the other hand, when it comes to understanding data, unnecessarily complementing and over-modifying the data may lead to misunderstanding of the data. *** In other words, when dealing with missing values, what is important when analyzing the data is important. *** ***

Below is the code for 1. Delete columns and rows containing missing values, 2. Complement (fill in) with different numbers for missing values.


import pandas as pd

## load Data
df = pd.read~~~~(csv , json etc...)  

## count nan
df.isnull().sum()

スクリーンショット 2020-08-23 17.56.44.png From the above figure, it can be seen that the data this time contains missing values. Now, I would like to process missing values for * LotFrontage *.

Drop columns and rows containing missing values dropna ()

--.dropna (how = "???", axis = (0 or 1)): how = ??? is any or all, axis = 0 if any, column if 1

If any, the column or row containing at least one missing value is deleted. On the other hand, if all, columns or rows where all values are missing are deleted.

df.dropna(how="any", axis=0)

スクリーンショット 2020-08-23 18.10.55.png

df.dropna(how="all", axis=0)

スクリーンショット 2020-08-23 18.11.35.png

In the case of any, if even one missing value is included, the specified axis value is deleted. As a result, the shape of df is (0,81). In the case of all, if all missing values are included, the specified axis value is deleted. As a result, the shape of df is (1460,81).

*** *** When you want to delete a row / column that has a missing value in a specific row or column

--.dropna (subset = ["???"]): Select a specific column / row with subset

df.dropna(subset=["LotFrontage"])

スクリーンショット 2020-08-23 18.20.01.png

By using a subset as an argument, if a specific column / row contains a missing value, that column / row can be deleted. This argument can sometimes work effectively.

Fill in the missing values with another numerical value (fill in) fillna ()

--.fillna (???, inplace = bool (True of False)): The original object can be changed by specifying fill (fill in), inplace with any value in ???

If inplace = True, it can be updated without increasing the memory, but since the object is updated, it becomes difficult to reuse it. .. .. .. .. .. Therefore, it is recommended to create a new object for non-large data. (Personal opinion ...)

#Fill in the blanks with 0 for NAN.
df.fillna(0)

#When completing for multiple columns
df.fillna({'LotFrontage': df["LotFrontage"].mean(),'PoolArea': df["PoolArea"].median(), 'MoSold': df["MoSold"].mode().iloc[0]})

As mentioned above, the mean and median of the column can be filled in for the argument part. In the case of `mode, the first line of iloc [0] is acquired because it is returned in the data frame. ``

There are various other complement methods for fillna, so please check the official documentation. pandas.DataFrame.fillna

Numeric (numerical data)

The first thing to do when working with Numeric data is *** scaling ***. Scaling means converting to a number with a certain width. For example, if you want to predict ice sales from variables such as temperature, precipitation, and humidity, each variable has a different unit and value range. If you continue learning as it is, you may not be able to learn well, so it is necessary to adjust the values to a certain width. This is scaling.

There are several methods for scaling features. In this article, I would like to touch on the 1.Min Max Scaler, 2.Standard Scaler, 3.log transformation that I use relatively.

  1. Min Max Scaler

It is to convert the values of all features to the same scale. `` Subtract the minimum value from all values and divide by the difference between Min and Max. As a result, the value becomes 0 to 1. `` *** However, this method has a disadvantage, and since the value is in the range of 0 to 1, the standard deviation becomes small and the influence of outliers is suppressed. *** *** If you need to worry about outliers, it can be difficult to consider them with this method.

sklearn.preprocessing.MinMaxScaler

from sklearn import preprocessing

#First from df to dtype=Extract column column of number
numb_columns = df.select_dtypes(include=['number']).columns

# dtype=Extract only number data
num_df = df[numb_columns]

# MinMaxScaler
mm_scaler = preprocessing.MinMaxScaler()

#Get with array
num_df_fit = mm_scaler.fit_transform(num_df)

#Convert to array to Dataframe
num_df_fit = pd.DataFrame(num_df_fit, columns=numb_columns)

** After the conversion process is completed, it is better to check if the scaling is correct. ** ** For example


#Confirmation of the maximum value of each feature
num_df_fit.max()

#Confirmation of the minimum value of each feature
num_df_fit.min()

It's best to check after conversion in this way: relaxed:

  1. Standard Scaler

Can be converted to a standardized distribution with mean 0, variance 1. First, subtract the mean value to get a value around 0. Next, divide the value by the standard deviation so that the resulting distribution becomes the standard with a mean of 0 and a standard deviation of 1. StandardScaler has the same disadvantages (outlier processing) as MinMaxScaler. In StandardScaler, outliers affect the calculation of mean and standard deviation, narrowing the range of features. *** In particular, since the magnitude of the outliers of each feature is different, the spread of the converted data of each feature may be significantly different. *** ***

sklearn.preprocessing.StandardScaler


# StandardScaler
s_scaler = preprocessing.StandardScaler()

#Get with array
num_df_s_fit = s_scaler.fit_transform(num_df)

#Convert to array to Dataframe
num_df_s_fit = pd.DataFrame(num_df_s_fit, columns=numb_columns)

  1. log transformation

Logarithmic conversion was discussed in the previous article, so I will explain it lightly. In the machine learning model, Since the normal distribution is often assumed, if there is a variable that does not follow the normal distribution, logarithmic conversion etc. may be performed so that it follows the normal distribution. Tips and precautions when performing data analysis


num_df = np.log(num_df)

Categorical

Catergorical data is character string data such as gender (male / female) and prefectures (Hokkaido / Aomori / Iwate ...). When handling such data, it is often converted to numerical data before handling. In this article, I would like to touch on 1.Label Encoding, 2.One Hot Encoding, which I use relatively.

  1. Label Encoding I think this method is the most commonly used method. `This is a method to extract unique values of selected features and map them to different numerical values. `` By doing this, the feature amount acquired by the character string can be changed to a numerical value.

sklearn.preprocessing.LabelEncoder

from sklearn import preprocessing
le = preprocessing.LabelEncoder()

#First from df to dtype=Extract column column of object
obj_columns = df.select_dtypes(include=["object"]).columns

# dtype=Extract only object data
obj_df = df[obj_columns]

#Extract unique values for Street columns
str_uniq = obj_df["Street"].unique()

# labelEncoder
le.fit(str_uniq)

list(le.classes_)

le.transform(str_uniq)

スクリーンショット 2020-08-25 17.10.56.png

As above, create an instance with ** le.fit ** and You can get the unique value of the feature with ** list (le.classes_) **. Use ** le.transform () ** to map unique values to numbers.

When dealing with numbers converted by LabelEncoder, Tree Based Model (random forest, etc.) is said to be good. `` Non Tree Nased Model (regression analysis, etc.) is not effective. `` *** This is because the unique value is converted into a numerical value, but the magnitude of the numerical value is dealt with in the case of regression analysis even though the magnitude of the numerical value has no meaning. *** ***

  1. One Hot Encoding Since LabelEncoder cannot be applied to regression analysis, One Hot Encoding is used to apply it to regression analysis. (There may be some misunderstandings in this way ...) Because ʻOne Hot Encoding has already been scaled to a maximum of 1 and a minimum of 0. `` *** On the contrary, binary data for the number of categories is generated as features, so it is not suitable for Tree Based Model. *** ***

sklearn.preprocessing.OneHotEncoder

from sklearn import preprocessing

# One Hot Encoder
oh = preprocessing.OneHotEncoder()

str_oh = oh.fit_transform(obj_df.values)

Specify the array type as the argument of transform.

Summary

In this article, we have described the preprocessing of features for modeling. In Kaggle etc., we see that ** Target Encoding ** is often used for Categorical data. I haven't been able to publish it in the article due to my lack of study, but I would like to update it as soon as I study.

Articles for data analysis

Tips and precautions when analyzing data

Recommended Posts

Feature preprocessing for modeling
Japanese preprocessing for machine learning
Pre-processing pipeline template for DataLiner
Extract only Python for preprocessing
Preprocessing template for data analysis (Python)
Predictive Power Score for feature selection
5th Feature Engineering for Machine Learning-Feature Selection