[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-

Introduction

Previous article [Python] First Data Analysis / Machine Learning (Kaggle) [Python] First Data Analysis / Machine Learning (Kaggle) ~ Part2 ~ Following on from Kaggle, we challenged the relatively gentle competition "House Prices: Advanced Regression Techniques"!

The competition this time is to estimate the price of a house based on variables of information about the house. However, there are 80 variables related to this house, and I suddenly became scared ... (laughs)

While thinking "Is it possible to do this?", I borrowed the wisdom of my predecessors this time as well! Lol Reference code ↓↓↓

The general flow is as follows.

  1. Feature engineering
  2. ** Imputation missing values ** Fill in missing values
  3. ** Transforming ** Data conversion (log conversion, etc.)
  4. ** Label Encoding ** Encoding categorical data
  5. ** Box Cox Transformation **: Transformation to bring it closer to a normal distribution
  6. ** Getting dummy variables ** Convert categorical data to numerical data
  7. Modeling (stacking ensemble learning)
  8. Base model analysis
  9. Second model analysis

And in this article, we'll focus on ** feature engineering **!

Data acquisition / library import

#Library import

import numpy as np #linear algebra
import pandas as pd #Data processing, csv file operation
%matplotlib inline
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #Ignore unnecessary warnings(from sklearn and seaborn)


from scipy import stats
from scipy.stats import norm, skew #Statistical manipulation

pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Decimal point setting

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8")) #Check if the file is valid

output


data_description.txt
sample_submission.csv
test.csv
train.csv

Data frame creation

#Data acquisition, data frame creation
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

Let's take a look at the data frame!

##Displaying data frames
train.head(5)
スクリーンショット 2020-03-16 16.03.09.png
#Number of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))

#'Id'Save column
train_ID = train['Id']
test_ID = test['Id']

#ID is unnecessary in the prediction process, so delete it
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

#Check again if the ID has disappeared
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape)) 
print("The test data size after dropping Id feature is : {} ".format(test.shape))

output


The train data size before dropping Id feature is : (1460, 81) 
The test data size before dropping Id feature is : (1459, 80) 

The train data size after dropping Id feature is : (1460, 80) 
The test data size after dropping Id feature is : (1459, 79) 

1460 training data, 80 features There are 1459 test data and 79 features!

Eh, ** Too many features! !! ** ** How should I analyze this?

** For the time being, let's digitize missing values, skewness, and categorical data **!

Data preprocessing

Let's start with the missing values! Basically * You can erase the feature that the data loss rate is 15% or more !! *

1. Missing value

In order to ** handle missing values collectively ** with train of training data and test of test data, ** integrate data ** once!

ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))

output


all_data size is : (2917, 79)

The number of data is 2917 and the number of features is 79, so it was integrated nicely!


To efficiently identify features with missing values

** 1. Calculate the missing rate of all features and put the feature with missing value rate> 0 into a new data frame ** ** 2. Visualize defect rate ** ** 3. Consider deleting or assigning values for each feature **

We will process missing values in this flow!

** 1. Calculate the missing rate of all features and put the feature with missing value rate> 0 into a new data frame **

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]#Extract only variables that contain missing values
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})#Put it in the data frame.
missing_data.head(20)

output スクリーンショット 2020-03-16 16.44.06.png

** 2. Visualize defect rate **

f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
スクリーンショット 2020-03-17 14.42.30.png

** 3. Consider deleting or assigning values for each feature ** We will enter values for the features including the missing values shown in the graph above! It's a little difficult because we are considering each feature one by one, but we will do it!

-** PoolQC **: Mostly missing values. Enter None in the missing value to mean "no pool"

all_data["PoolQC"] = all_data["PoolQC"].fillna("None")

-** MiscFeature **: According to the data description, NA means "no other features". Put None for missing values

all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")

-** Alley **: According to the data description, it means "no access to the alley". Put None for missing values

all_data["Alley"] = all_data["Alley"].fillna("None")

-** Fence **: Means "no fence". Put None for missing values

all_data["Fence"] = all_data["Fence"].fillna("None")

-** FireplaceQu **: Means "no fireplace". Put None for missing values

all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")

-** LotFrontage **: Substitute the mean value of the neighboring house for the missing value By the way, median () gets the mean

#Group by neighborhood and substitute the average LotFrontage of the group for the missing value
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
    lambda x: x.fillna(x.median()))

-** GarageType, GarageFinish, GarageQual and GarageCond **: Fill missing values with None

for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')

-** GarageYrBlt, GarageArea and GarageCars **: Put 0 in the missing value

for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)

-** BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath **: Put 0 in the missing value

for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[col] = all_data[col].fillna(0)

-** BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and BsmtFinType2 **: Put None in the missing value of categorical data

for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[col] = all_data[col].fillna('None')

-** MasVnrArea and MasVnrType **: Enter None for type and 0 for Area

all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)

-** MSZoning (The general zoning classification) **: ‘RL’ is the value farthest from the average value. Therefore, put RL in the missing value.

all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])

-** Utilities **: Removed because it is useless

all_data = all_data.drop(['Utilities'], axis=1)

-** Functional **: NA is Typical, so enter Typ

all_data["Functional"] = all_data["Functional"].fillna("Typ")

-** Electrical **: There is only one missing value. This includes sBkrk mode (): Get the mode

all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])

-** KitchenQual **: Enter the mode for only one missing value

all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])

-** Exterior1st and Exterior2nd **: Put the mode in the missing value

all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])

-** SaleType **: Mode

all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])

-** MSSubClass **: Put None in Nan

all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")

Check the remaining missing values

#Check if there are any missing values
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head()

No missing values This completes the processing of missing values! There were many ...

2. Category data processing

Category data is data represented by nominal scale or ordinal scale. Roughly speaking, it's non-numeric data!

If the category data remains, it cannot be analyzed or learned, so we will quantify it!

Processing of ordinal scale data

Ordinal scale data is data that is meaningful only in order. For example, it refers to a fast food drink size "s, M, L" that is quantified as s → 0, M → 1, L → 2. One thing to note about ordinal data is that ** cannot perform numerical calculations such as mean or standard deviation **.

First, convert the numerical value of the order data to character data. (Later, to digitize the category data collectively)

#MSSubClass=The building class
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)


#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)


#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)

Convert category data to numeric data

LabelEncoder () will digitize the order data and the nominal data together! Select data with .fit, convert to number with .transform (), Reference: How to use LabelEncoder of scikit-learn


from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond', 
        'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(all_data[c].values)) 
    all_data[c] = lbl.transform(list(all_data[c].values))

# shape        
print('Shape all_data: {}'.format(all_data.shape))

output


Shape all_data: (2917, 78)

Supplement 1: Addition of new features

Since the area of all floors is also important, we will add the total value of TotalBsmtSF, 1stSF and 2ndFlrSF to the new special tomb!

# Adding total sqfootage feature 
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']

3. Transfer data to normal distribution

In machine learning, it is said that ** it is more accurate if the data follows a normal distribution **! So, first look at the skewness of the current ** data (how far it is from the normal distribution) and let Box Cox make the data follow the normal distribution! ** **

1. First of all, the skewness of the data

numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)
スクリーンショット 2020-03-18 14.57.18.png

After all, you can see multiple features that are not normally distributed!

2. Convert to normal distribution with BoxCox!

skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
    #all_data[feat] += 1
    all_data[feat] = boxcox1p(all_data[feat], lam)
    
#all_data[skewed_features] = np.log1p(all_data[skewed_features])

4. Finally, add a dummy variable to the categorical features!

If you use a dummy variable, the variable is set so that the categorical data is represented by 0,1. Reference: [How to create dummy variables and precautions](https://newtechnologylifestyle.net/%E3%83%80%E3%83%9F%E3%83%BC%E5%A4%89%E6%95% B0% E3% 81% AE% E4% BD% 9C% E3% 82% 8A% E6% 96% B9% E3% 81% A8% E6% B3% A8% E6% 84% 8F% E7% 82% B9% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6 /)

all_data = pd.get_dummies(all_data)
print(all_data.shape)

output


(2917, 220)

3. Feature engineering completed

This completes the data preprocessing!

Let's split the combined learning and test data!

train = all_data[:ntrain]
test = all_data[ntrain:]

Summary

This time around, we focused on ** data preprocessing ** for Kaggle's residential price forecast competition! There are 80 features, and it was quite difficult at first, but I think that I could handle it properly by following the steps below!

Data preprocessing procedure ** 1. Integration of train and test data ** ** 2. Missing value processing ** ** 3. Category data processing **

In the next article, we will actually learn and predict using a model! !!

Thank you for your viewing! !!

Recommended Posts

[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
[Python] First data analysis / machine learning (Kaggle)
Python: Preprocessing in machine learning: Data acquisition
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
I started machine learning with Python Data preprocessing
Python data analysis learning notes
Python: Time Series Analysis: Preprocessing Time Series Data
Preprocessing template for data analysis (Python)
Python: Preprocessing in Machine Learning: Overview
Preprocessing in machine learning 2 Data acquisition
Preprocessing in machine learning 4 Data conversion
Data analysis python
Machine learning with python (2) Simple regression analysis
Data analysis starting with python (data preprocessing-machine learning)
Data analysis with python 2
Data analysis using Python 0
Data analysis overview python
Python 3 Engineer Certification Data Analysis Exam Pre-Exam Learning
[Python3] Let's analyze data using machine learning! (Regression)
Python data analysis template
Python: Preprocessing in machine learning: Handling of missing, outlier, and imbalanced data
Data analysis with Python
A story about data analysis by machine learning
Python practice data analysis Summary of learning that I hit about 10 with 100 knocks
About data preprocessing of systems that use machine learning
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
Data set for machine learning
My python data analysis container
Japanese preprocessing for machine learning
Machine learning in Delemas (practice)
Python for Data Analysis Chapter 4
Machine learning with Python! Preparation
[Python] Notes on data analysis
Python Machine Learning Programming> Keywords
Python for Data Analysis Chapter 2
Beginning with Python machine learning
Try machine learning with Kaggle
Data analysis using python pandas
Python for Data Analysis Chapter 3
Performance verification of data preprocessing for machine learning (numerical data) (Part 2)
Pre-processing in machine learning 3 Missing values, outliers, and imbalanced data
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)
Performance verification of data preprocessing for machine learning (numerical data) (Part 1)
Implement stacking learning in Python [Kaggle]
Read kaggle Courses --- Intermediate Machine Learning 5
Machine learning algorithm (multiple regression analysis)
Machine learning algorithm (simple regression analysis)
Read kaggle Courses --- Intermediate Machine Learning 6
Python Pandas Data Preprocessing Personal Notes
<For beginners> python library <For machine learning>
Machine learning in Delemas (data acquisition)
Time series analysis 3 Preprocessing of time series data
Data analysis starting with python (data visualization 1)
Python: Unsupervised Learning: Principal Component Analysis
"Scraping & machine learning with Python" Learning memo
Data analysis starting with python (data visualization 2)
Machine Learning: Supervised --Linear Discriminant Analysis
Basic machine learning procedure: ② Prepare data
How to collect machine learning data