Introduction

Previous article [Python] First Data Analysis / Machine Learning (Kaggle) [Python] First Data Analysis / Machine Learning (Kaggle) ~ Part2 ~ Following on from Kaggle, we challenged the relatively gentle competition "House Prices: Advanced Regression Techniques"!

The competition this time is to estimate the price of a house based on variables of information about the house. However, there are 80 variables related to this house, and I suddenly became scared ... (laughs)

While thinking "Is it possible to do this?", I borrowed the wisdom of my predecessors this time as well! Lol Reference code ↓↓↓

The general flow is as follows.

Feature engineering
** Imputation missing values ** Fill in missing values
** Transforming ** Data conversion (log conversion, etc.)
** Label Encoding ** Encoding categorical data
** Box Cox Transformation **: Transformation to bring it closer to a normal distribution
** Getting dummy variables ** Convert categorical data to numerical data
Modeling (stacking ensemble learning)
Base model analysis
Second model analysis

And in this article, we'll focus on ** feature engineering **!

Data acquisition / library import

#Library import

import numpy as np #linear algebra
import pandas as pd #Data processing, csv file operation
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #Ignore unnecessary warnings(from sklearn and seaborn)


from scipy import stats
from scipy.stats import norm, skew #Statistical manipulation

pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Decimal point setting

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8")) #Check if the file is valid

`output`


data_description.txt
sample_submission.csv
test.csv
train.csv

Data frame creation

#Data acquisition, data frame creation
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

Let's take a look at the data frame!

##Displaying data frames
train.head(5)

#Number of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))

#'Id'Save column
train_ID = train['Id']
test_ID = test['Id']

#ID is unnecessary in the prediction process, so delete it
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

#Check again if the ID has disappeared
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape)) 
print("The test data size after dropping Id feature is : {} ".format(test.shape))

`output`


The train data size before dropping Id feature is : (1460, 81) 
The test data size before dropping Id feature is : (1459, 80) 

The train data size after dropping Id feature is : (1460, 80) 
The test data size after dropping Id feature is : (1459, 79)

1460 training data, 80 features There are 1459 test data and 79 features!

Eh, ** Too many features! !! ** ** How should I analyze this?

** For the time being, let's digitize missing values, skewness, and categorical data **!

Data preprocessing

Let's start with the missing values! Basically * You can erase the feature that the data loss rate is 15% or more !! *

1. Missing value

In order to ** handle missing values collectively ** with train of training data and test of test data, ** integrate data ** once!

ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))

`output`


all_data size is : (2917, 79)

The number of data is 2917 and the number of features is 79, so it was integrated nicely!

To efficiently identify features with missing values

** 1. Calculate the missing rate of all features and put the feature with missing value rate> 0 into a new data frame ** ** 2. Visualize defect rate ** ** 3. Consider deleting or assigning values for each feature **

We will process missing values in this flow!

** 1. Calculate the missing rate of all features and put the feature with missing value rate> 0 into a new data frame **

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]#Extract only variables that contain missing values
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})#Put it in the data frame.
missing_data.head(20)

output スクリーンショット 2020-03-16 16.44.06.png

** 2. Visualize defect rate **

f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)

** 3. Consider deleting or assigning values for each feature ** We will enter values for the features including the missing values shown in the graph above! It's a little difficult because we are considering each feature one by one, but we will do it!

-** PoolQC **: Mostly missing values. Enter None in the missing value to mean "no pool"

all_data["PoolQC"] = all_data["PoolQC"].fillna("None")

-** MiscFeature **: According to the data description, NA means "no other features". Put None for missing values

all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")

-** Alley **: According to the data description, it means "no access to the alley". Put None for missing values

all_data["Alley"] = all_data["Alley"].fillna("None")

-** Fence **: Means "no fence". Put None for missing values

all_data["Fence"] = all_data["Fence"].fillna("None")

-** FireplaceQu **: Means "no fireplace". Put None for missing values

all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")

-** LotFrontage **: Substitute the mean value of the neighboring house for the missing value By the way, median () gets the mean

#Group by neighborhood and substitute the average LotFrontage of the group for the missing value
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

-** GarageType, GarageFinish, GarageQual and GarageCond **: Fill missing values with None

for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')

-** GarageYrBlt, GarageArea and GarageCars **: Put 0 in the missing value

for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)

-** BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath **: Put 0 in the missing value

for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[col] = all_data[col].fillna(0)

-** BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and BsmtFinType2 **: Put None in the missing value of categorical data

for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[col] = all_data[col].fillna('None')

-** MasVnrArea and MasVnrType **: Enter None for type and 0 for Area

all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)

-** MSZoning (The general zoning classification) **: ‘RL’ is the value farthest from the average value. Therefore, put RL in the missing value.

all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])

-** Utilities **: Removed because it is useless

all_data = all_data.drop(['Utilities'], axis=1)

-** Functional **: NA is Typical, so enter Typ

all_data["Functional"] = all_data["Functional"].fillna("Typ")

-** Electrical **: There is only one missing value. This includes sBkrk mode (): Get the mode

all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])

-** KitchenQual **: Enter the mode for only one missing value

all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])

-** Exterior1st and Exterior2nd **: Put the mode in the missing value

all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])

-** SaleType **: Mode

all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])

-** MSSubClass **: Put None in Nan

all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")

Check the remaining missing values

#Check if there are any missing values
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head()

No missing values This completes the processing of missing values! There were many ...

2. Category data processing

Category data is data represented by nominal scale or ordinal scale. Roughly speaking, it's non-numeric data!

If the category data remains, it cannot be analyzed or learned, so we will quantify it!

Processing of ordinal scale data

Ordinal scale data is data that is meaningful only in order. For example, it refers to a fast food drink size "s, M, L" that is quantified as s → 0, M → 1, L → 2. One thing to note about ordinal data is that ** cannot perform numerical calculations such as mean or standard deviation **.

First, convert the numerical value of the order data to character data. (Later, to digitize the category data collectively)

#MSSubClass=The building class
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)


#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)


#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

Convert category data to numeric data

LabelEncoder () will digitize the order data and the nominal data together! Select data with .fit, convert to number with .transform (), Reference: How to use LabelEncoder of scikit-learn


from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))

# shape        
print('Shape all_data: {}'.format(all_data.shape))

`output`


Shape all_data: (2917, 78)

Supplement 1: Addition of new features

Since the area of all floors is also important, we will add the total value of TotalBsmtSF, 1stSF and 2ndFlrSF to the new special tomb!

# Adding total sqfootage feature 
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

3. Transfer data to normal distribution

In machine learning, it is said that ** it is more accurate if the data follows a normal distribution **! So, first look at the skewness of the current ** data (how far it is from the normal distribution) and let Box Cox make the data follow the normal distribution! ** **

1. First of all, the skewness of the data

numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)

After all, you can see multiple features that are not normally distributed!

2. Convert to normal distribution with BoxCox!

skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    #all_data[feat] += 1
    all_data[feat] = boxcox1p(all_data[feat], lam)
    
#all_data[skewed_features] = np.log1p(all_data[skewed_features])

4. Finally, add a dummy variable to the categorical features!

If you use a dummy variable, the variable is set so that the categorical data is represented by 0,1. Reference: [How to create dummy variables and precautions](https://newtechnologylifestyle.net/%E3%83%80%E3%83%9F%E3%83%BC%E5%A4%89%E6%95% B0% E3% 81% AE% E4% BD% 9C% E3% 82% 8A% E6% 96% B9% E3% 81% A8% E6% B3% A8% E6% 84% 8F% E7% 82% B9% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6 /)

all_data = pd.get_dummies(all_data)
print(all_data.shape)

`output`


(2917, 220)

3. Feature engineering completed

This completes the data preprocessing!

Let's split the combined learning and test data!

train = all_data[:ntrain]
test = all_data[ntrain:]

Summary

This time around, we focused on ** data preprocessing ** for Kaggle's residential price forecast competition! There are 80 features, and it was quite difficult at first, but I think that I could handle it properly by following the steps below!

Data preprocessing procedure ** 1. Integration of train and test data ** ** 2. Missing value processing ** ** 3. Category data processing **

In the next article, we will actually learn and predict using a model! !!

Thank you for your viewing! !!

[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-

Introduction

Data acquisition / library import

output

output

Data preprocessing

1. Missing value

output

2. Category data processing

Processing of ordinal scale data

Convert category data to numeric data

output

Supplement 1: Addition of new features

3. Transfer data to normal distribution

1. First of all, the skewness of the data

2. Convert to normal distribution with BoxCox!

4. Finally, add a dummy variable to the categorical features!

output

3. Feature engineering completed

Summary

`output`

`output`

`output`

`output`

`output`