# Introduction

Previous article [Python] First Data Analysis / Machine Learning (Kaggle) [Python] First Data Analysis / Machine Learning (Kaggle) ~ Part2 ~ Following on from Kaggle, we challenged the relatively gentle competition "House Prices: Advanced Regression Techniques"!

The competition this time is to estimate the price of a house based on variables of information about the house. However, there are 80 variables related to this house, and I suddenly became scared ... (laughs)

While thinking "Is it possible to do this?", I borrowed the wisdom of my predecessors this time as well! Lol Reference code ↓↓↓

The general flow is as follows.

1. Feature engineering
2. ** Imputation missing values ** Fill in missing values
3. ** Transforming ** Data conversion (log conversion, etc.)
4. ** Label Encoding ** Encoding categorical data
5. ** Box Cox Transformation **: Transformation to bring it closer to a normal distribution
6. ** Getting dummy variables ** Convert categorical data to numerical data
7. Modeling (stacking ensemble learning)
8. Base model analysis
9. Second model analysis

# Data acquisition / library import

``````#Library import

import numpy as np #linear algebra
import pandas as pd #Data processing, csv file operation
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
pass
warnings.warn = ignore_warn #Ignore unnecessary warnings(from sklearn and seaborn)

from scipy import stats
from scipy.stats import norm, skew #Statistical manipulation

pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Decimal point setting

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8")) #Check if the file is valid
``````

#### `output`

``````
data_description.txt
sample_submission.csv
test.csv
train.csv
``````

Data frame creation

``````#Data acquisition, data frame creation
``````

Let's take a look at the data frame!

``````##Displaying data frames
``````
``````#Number of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))

#'Id'Save column
train_ID = train['Id']
test_ID = test['Id']

#ID is unnecessary in the prediction process, so delete it
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

#Check again if the ID has disappeared
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape))
print("The test data size after dropping Id feature is : {} ".format(test.shape))
``````

#### `output`

``````
The train data size before dropping Id feature is : (1460, 81)
The test data size before dropping Id feature is : (1459, 80)

The train data size after dropping Id feature is : (1460, 80)
The test data size after dropping Id feature is : (1459, 79)
``````

1460 training data, 80 features There are 1459 test data and 79 features!

Eh, ** Too many features! !! ** ** How should I analyze this?

** For the time being, let's digitize missing values, skewness, and categorical data **!

# Data preprocessing

Let's start with the missing values! Basically * You can erase the feature that the data loss rate is 15% or more !! *

## 1. Missing value

In order to ** handle missing values collectively ** with train of training data and test of test data, ** integrate data ** once!

``````ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))
``````

#### `output`

``````
all_data size is : (2917, 79)
``````

The number of data is 2917 and the number of features is 79, so it was integrated nicely!

To efficiently identify features with missing values

** 1. Calculate the missing rate of all features and put the feature with missing value rate> 0 into a new data frame ** ** 2. Visualize defect rate ** ** 3. Consider deleting or assigning values for each feature **

We will process missing values in this flow!

** 1. Calculate the missing rate of all features and put the feature with missing value rate> 0 into a new data frame **

``````all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]#Extract only variables that contain missing values
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})#Put it in the data frame.
``````

output

** 2. Visualize defect rate **

``````f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
``````

** 3. Consider deleting or assigning values for each feature ** We will enter values for the features including the missing values shown in the graph above! It's a little difficult because we are considering each feature one by one, but we will do it!

-** PoolQC **: Mostly missing values. Enter None in the missing value to mean "no pool"

``````all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
``````

-** MiscFeature **: According to the data description, NA means "no other features". Put None for missing values

``````all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
``````

-** Alley **: According to the data description, it means "no access to the alley". Put None for missing values

``````all_data["Alley"] = all_data["Alley"].fillna("None")
``````

-** Fence **: Means "no fence". Put None for missing values

``````all_data["Fence"] = all_data["Fence"].fillna("None")
``````

-** FireplaceQu **: Means "no fireplace". Put None for missing values

``````all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
``````

-** LotFrontage **: Substitute the mean value of the neighboring house for the missing value By the way, median () gets the mean

``````#Group by neighborhood and substitute the average LotFrontage of the group for the missing value
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
``````

-** GarageType, GarageFinish, GarageQual and GarageCond **: Fill missing values with None

``````for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
all_data[col] = all_data[col].fillna('None')
``````

-** GarageYrBlt, GarageArea and GarageCars **: Put 0 in the missing value

``````for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
all_data[col] = all_data[col].fillna(0)
``````

-** BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath **: Put 0 in the missing value

``````for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
all_data[col] = all_data[col].fillna(0)
``````

-** BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and BsmtFinType2 **: Put None in the missing value of categorical data

``````for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
all_data[col] = all_data[col].fillna('None')
``````

-** MasVnrArea and MasVnrType **: Enter None for type and 0 for Area

``````all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
``````

-** MSZoning (The general zoning classification) **: ‘RL’ is the value farthest from the average value. Therefore, put RL in the missing value.

``````all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
``````

-** Utilities **: Removed because it is useless

``````all_data = all_data.drop(['Utilities'], axis=1)
``````

-** Functional **: NA is Typical, so enter Typ

``````all_data["Functional"] = all_data["Functional"].fillna("Typ")
``````

-** Electrical **: There is only one missing value. This includes sBkrk mode (): Get the mode

``````all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
``````

-** KitchenQual **: Enter the mode for only one missing value

``````all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
``````

-** Exterior1st and Exterior2nd **: Put the mode in the missing value

``````all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
``````

-** SaleType **: Mode

``````all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
``````

-** MSSubClass **: Put None in Nan

``````all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")
``````

Check the remaining missing values

``````#Check if there are any missing values
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
``````

No missing values This completes the processing of missing values! There were many ...

## 2. Category data processing

Category data is data represented by nominal scale or ordinal scale. Roughly speaking, it's non-numeric data!

If the category data remains, it cannot be analyzed or learned, so we will quantify it!

### Processing of ordinal scale data

Ordinal scale data is data that is meaningful only in order. For example, it refers to a fast food drink size "s, M, L" that is quantified as s → 0, M → 1, L → 2. One thing to note about ordinal data is that ** cannot perform numerical calculations such as mean or standard deviation **.

First, convert the numerical value of the order data to character data. (Later, to digitize the category data collectively)

``````#MSSubClass=The building class
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)

#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)

#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)
``````

### Convert category data to numeric data

LabelEncoder () will digitize the order data and the nominal data together! Select data with .fit, convert to number with .transform (), Reference: How to use LabelEncoder of scikit-learn

``````
from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',
'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',
'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',
'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
lbl = LabelEncoder()
lbl.fit(list(all_data[c].values))
all_data[c] = lbl.transform(list(all_data[c].values))

# shape
print('Shape all_data: {}'.format(all_data.shape))
``````

#### `output`

``````
Shape all_data: (2917, 78)
``````

### Supplement 1: Addition of new features

Since the area of all floors is also important, we will add the total value of TotalBsmtSF, 1stSF and 2ndFlrSF to the new special tomb!

``````# Adding total sqfootage feature
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
``````

## 3. Transfer data to normal distribution

In machine learning, it is said that ** it is more accurate if the data follows a normal distribution **! So, first look at the skewness of the current ** data (how far it is from the normal distribution) and let Box Cox make the data follow the normal distribution! ** **

### 1. First of all, the skewness of the data

``````numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
``````

After all, you can see multiple features that are not normally distributed!

### 2. Convert to normal distribution with BoxCox!

``````skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
#all_data[feat] += 1
all_data[feat] = boxcox1p(all_data[feat], lam)

#all_data[skewed_features] = np.log1p(all_data[skewed_features])
``````

## 4. Finally, add a dummy variable to the categorical features!

If you use a dummy variable, the variable is set so that the categorical data is represented by 0,1. Reference: [How to create dummy variables and precautions](https://newtechnologylifestyle.net/%E3%83%80%E3%83%9F%E3%83%BC%E5%A4%89%E6%95% B0% E3% 81% AE% E4% BD% 9C% E3% 82% 8A% E6% 96% B9% E3% 81% A8% E6% B3% A8% E6% 84% 8F% E7% 82% B9% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6 /)

``````all_data = pd.get_dummies(all_data)
print(all_data.shape)
``````

#### `output`

``````
(2917, 220)
``````

# 3. Feature engineering completed

This completes the data preprocessing!

Let's split the combined learning and test data!

``````train = all_data[:ntrain]
test = all_data[ntrain:]
``````

# Summary

This time around, we focused on ** data preprocessing ** for Kaggle's residential price forecast competition! There are 80 features, and it was quite difficult at first, but I think that I could handle it properly by following the steps below!

Data preprocessing procedure ** 1. Integration of train and test data ** ** 2. Missing value processing ** ** 3. Category data processing **

In the next article, we will actually learn and predict using a model! !!

Thank you for your viewing! !!