Previous article [Python] First Data Analysis / Machine Learning (Kaggle) [Python] First Data Analysis / Machine Learning (Kaggle) ~ Part2 ~ Following on from Kaggle, we challenged the relatively gentle competition "House Prices: Advanced Regression Techniques"!
The competition this time is to estimate the price of a house based on variables of information about the house. However, there are 80 variables related to this house, and I suddenly became scared ... (laughs)
While thinking "Is it possible to do this?", I borrowed the wisdom of my predecessors this time as well! Lol Reference code ↓↓↓
The general flow is as follows.
And in this article, we'll focus on ** feature engineering **!
#Library import
import numpy as np #linear algebra
import pandas as pd #Data processing, csv file operation
%matplotlib inline
import matplotlib.pyplot as plt # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
import warnings
def ignore_warn(*args, **kwargs):
pass
warnings.warn = ignore_warn #Ignore unnecessary warnings(from sklearn and seaborn)
from scipy import stats
from scipy.stats import norm, skew #Statistical manipulation
pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x)) #Decimal point setting
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8")) #Check if the file is valid
output
data_description.txt
sample_submission.csv
test.csv
train.csv
Data frame creation
#Data acquisition, data frame creation
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
Let's take a look at the data frame!
##Displaying data frames
train.head(5)
#Number of samples and features
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))
#'Id'Save column
train_ID = train['Id']
test_ID = test['Id']
#ID is unnecessary in the prediction process, so delete it
train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)
#Check again if the ID has disappeared
print("\nThe train data size after dropping Id feature is : {} ".format(train.shape))
print("The test data size after dropping Id feature is : {} ".format(test.shape))
output
The train data size before dropping Id feature is : (1460, 81)
The test data size before dropping Id feature is : (1459, 80)
The train data size after dropping Id feature is : (1460, 80)
The test data size after dropping Id feature is : (1459, 79)
1460 training data, 80 features There are 1459 test data and 79 features!
Eh, ** Too many features! !! ** ** How should I analyze this?
** For the time being, let's digitize missing values, skewness, and categorical data **!
Let's start with the missing values! Basically * You can erase the feature that the data loss rate is 15% or more !! *
In order to ** handle missing values collectively ** with train of training data and test of test data, ** integrate data ** once!
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print("all_data size is : {}".format(all_data.shape))
output
all_data size is : (2917, 79)
The number of data is 2917 and the number of features is 79, so it was integrated nicely!
To efficiently identify features with missing values
** 1. Calculate the missing rate of all features and put the feature with missing value rate> 0 into a new data frame ** ** 2. Visualize defect rate ** ** 3. Consider deleting or assigning values for each feature **
We will process missing values in this flow!
** 1. Calculate the missing rate of all features and put the feature with missing value rate> 0 into a new data frame **
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]#Extract only variables that contain missing values
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})#Put it in the data frame.
missing_data.head(20)
output
** 2. Visualize defect rate **
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
** 3. Consider deleting or assigning values for each feature ** We will enter values for the features including the missing values shown in the graph above! It's a little difficult because we are considering each feature one by one, but we will do it!
-** PoolQC **: Mostly missing values. Enter None in the missing value to mean "no pool"
all_data["PoolQC"] = all_data["PoolQC"].fillna("None")
-** MiscFeature **: According to the data description, NA means "no other features". Put None for missing values
all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")
-** Alley **: According to the data description, it means "no access to the alley". Put None for missing values
all_data["Alley"] = all_data["Alley"].fillna("None")
-** Fence **: Means "no fence". Put None for missing values
all_data["Fence"] = all_data["Fence"].fillna("None")
-** FireplaceQu **: Means "no fireplace". Put None for missing values
all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")
-** LotFrontage **: Substitute the mean value of the neighboring house for the missing value By the way, median () gets the mean
#Group by neighborhood and substitute the average LotFrontage of the group for the missing value
all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform(
lambda x: x.fillna(x.median()))
-** GarageType, GarageFinish, GarageQual and GarageCond **: Fill missing values with None
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
all_data[col] = all_data[col].fillna('None')
-** GarageYrBlt, GarageArea and GarageCars **: Put 0 in the missing value
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
all_data[col] = all_data[col].fillna(0)
-** BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath **: Put 0 in the missing value
for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
all_data[col] = all_data[col].fillna(0)
-** BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and BsmtFinType2 **: Put None in the missing value of categorical data
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
all_data[col] = all_data[col].fillna('None')
-** MasVnrArea and MasVnrType **: Enter None for type and 0 for Area
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
-** MSZoning (The general zoning classification) **: ‘RL’ is the value farthest from the average value. Therefore, put RL in the missing value.
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
-** Utilities **: Removed because it is useless
all_data = all_data.drop(['Utilities'], axis=1)
-** Functional **: NA is Typical, so enter Typ
all_data["Functional"] = all_data["Functional"].fillna("Typ")
-** Electrical **: There is only one missing value. This includes sBkrk mode (): Get the mode
all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
-** KitchenQual **: Enter the mode for only one missing value
all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
-** Exterior1st and Exterior2nd **: Put the mode in the missing value
all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
-** SaleType **: Mode
all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
-** MSSubClass **: Put None in Nan
all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None")
Check the remaining missing values
#Check if there are any missing values
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head()
No missing values This completes the processing of missing values! There were many ...
Category data is data represented by nominal scale or ordinal scale. Roughly speaking, it's non-numeric data!
If the category data remains, it cannot be analyzed or learned, so we will quantify it!
Ordinal scale data is data that is meaningful only in order. For example, it refers to a fast food drink size "s, M, L" that is quantified as s → 0, M → 1, L → 2. One thing to note about ordinal data is that ** cannot perform numerical calculations such as mean or standard deviation **.
First, convert the numerical value of the order data to character data. (Later, to digitize the category data collectively)
#MSSubClass=The building class
all_data['MSSubClass'] = all_data['MSSubClass'].apply(str)
#Changing OverallCond into a categorical variable
all_data['OverallCond'] = all_data['OverallCond'].astype(str)
#Year and month sold are transformed into categorical features.
all_data['YrSold'] = all_data['YrSold'].astype(str)
all_data['MoSold'] = all_data['MoSold'].astype(str)
LabelEncoder () will digitize the order data and the nominal data together! Select data with .fit, convert to number with .transform (), Reference: How to use LabelEncoder of scikit-learn
from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',
'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',
'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',
'YrSold', 'MoSold')
# process columns, apply LabelEncoder to categorical features
for c in cols:
lbl = LabelEncoder()
lbl.fit(list(all_data[c].values))
all_data[c] = lbl.transform(list(all_data[c].values))
# shape
print('Shape all_data: {}'.format(all_data.shape))
output
Shape all_data: (2917, 78)
Since the area of all floors is also important, we will add the total value of TotalBsmtSF, 1stSF and 2ndFlrSF to the new special tomb!
# Adding total sqfootage feature
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
In machine learning, it is said that ** it is more accurate if the data follows a normal distribution **! So, first look at the skewness of the current ** data (how far it is from the normal distribution) and let Box Cox make the data follow the normal distribution! ** **
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
# Check the skew of all numerical features
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)
After all, you can see multiple features that are not normally distributed!
skewness = skewness[abs(skewness) > 0.75]
print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))
from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for feat in skewed_features:
#all_data[feat] += 1
all_data[feat] = boxcox1p(all_data[feat], lam)
#all_data[skewed_features] = np.log1p(all_data[skewed_features])
If you use a dummy variable, the variable is set so that the categorical data is represented by 0,1. Reference: [How to create dummy variables and precautions](https://newtechnologylifestyle.net/%E3%83%80%E3%83%9F%E3%83%BC%E5%A4%89%E6%95% B0% E3% 81% AE% E4% BD% 9C% E3% 82% 8A% E6% 96% B9% E3% 81% A8% E6% B3% A8% E6% 84% 8F% E7% 82% B9% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6 /)
all_data = pd.get_dummies(all_data)
print(all_data.shape)
output
(2917, 220)
This completes the data preprocessing!
Let's split the combined learning and test data!
train = all_data[:ntrain]
test = all_data[ntrain:]
This time around, we focused on ** data preprocessing ** for Kaggle's residential price forecast competition! There are 80 features, and it was quite difficult at first, but I think that I could handle it properly by following the steps below!
Data preprocessing procedure ** 1. Integration of train and test data ** ** 2. Missing value processing ** ** 3. Category data processing **
In the next article, we will actually learn and predict using a model! !!
Thank you for your viewing! !!