The first step of data analysis, data wait-and-see & pre-processing (this time, wait-and-see main) We use data from House Prices: Advanced Regression Techniques, a Kaggle study competition. House data The theme is to predict the price from now on. House Prices: Advanced Regression Techniques
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Check what kind of data is included
!ls ../input/house-prices-advanced-regression-techniques
↓↓↓
data_description.txt sample_submission.csv test.csv train.csv
train.csv
is training data
test.csv
is the test data
An example of sample_submission.csv
asking you to submit an answer like this
data_description.txt
is the description of each column
Then read each csv file as pandas.DataFrame
with pd.read_csv
TEST_PATH = "../input/house-prices-advanced-regression-techniques/test.csv"
TRAIN_PATH = "../input/house-prices-advanced-regression-techniques/train.csv"
SUBMISSION_PATH = "../input/house-prices-advanced-regression-techniques/sample_submission.csv"
test_data = pd.read_csv(TEST_PATH)
train_data = pd.read_csv(TRAIN_PATH)
sample_submission = pd.read_csv(SUBMISSION_PATH)
Let's take a look at the contents of the training data. The first 5 items in the table are displayed below
train_data.head()
↓↓↓
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape ... MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg ... NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg ... NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 ... NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 ... NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 ... NaN 0 12 2008 WD Normal 250000
5 rows × 81 columns
There seems to be 81 columns. full. By the way, Sale Price is missing from the test data. The theme is to anticipate the Sale Price.
Well, I checked only the first 5 cases, but how many are there in total?
print(len(train_data), len(test_data))
1460 1459
Well, you can see that the training data consists of 1460 rows and 81 columns.
Now you can grasp the number of data. Looking at the table above, there are some missing values (NaN and blanks). Let's also check how many total missing values there are.
Determine if there is a missing value with ʻisnull ()(return as bool type, that is,
True False), Calculate the total number of missing values with
sum () as
True = 1`` False = 0. It means the process of counting non-missing values with
count ()`.
total = train_data.isnull().sum().sort_values(ascending=False)
percent = (train_data.isnull().sum()/train_data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)
↓↓↓
Total Percent
PoolQC 1453 0.995205
MiscFeature 1406 0.963014
Alley 1369 0.937671
Fence 1179 0.807534
FireplaceQu 690 0.472603
LotFrontage 259 0.177397
GarageCond 81 0.055479
GarageType 81 0.055479
GarageYrBlt 81 0.055479
GarageFinish 81 0.055479
GarageQual 81 0.055479
BsmtExposure 38 0.026027
BsmtFinType2 38 0.026027
BsmtFinType1 37 0.025342
BsmtCond 37 0.025342
BsmtQual 37 0.025342
MasVnrArea 8 0.005479
MasVnrType 8 0.005479
Electrical 1 0.000685
Utilities 0 0.000000
Now, a possible action for missing values is to embed the missing values with something different or delete the rows or columns that contain the missing values. For example, if you look at the top row of PoolQC
, the percentage of missing values is 99%, so let's delete the column as follows.
train_data = train_data.drop('PoolQC',axis=1)
If you check the missing value as before to see if it really disappears
Total Percent
MiscFeature 1406 0.963014
Alley 1369 0.937671
Fence 1179 0.807534
FireplaceQu 690 0.472603
LotFrontage 259 0.177397
GarageCond 81 0.055479
GarageType 81 0.055479
GarageYrBlt 81 0.055479
GarageFinish 81 0.055479
GarageQual 81 0.055479
BsmtExposure 38 0.026027
BsmtFinType2 38 0.026027
BsmtFinType1 37 0.025342
BsmtCond 37 0.025342
BsmtQual 37 0.025342
MasVnrArea 8 0.005479
MasVnrType 8 0.005479
Electrical 1 0.000685
Utilities 0 0.000000
Certainly the top line has disappeared. Now, let's take a look not only at the data frame but also at the graph. (I imported matplotlib.pyplot and seaborn for that.)
It's been long, so next time!
Recommended Posts