[PYTHON] First step of data analysis (number of data, table display, missing values)

The first step of data analysis, data wait-and-see & pre-processing (this time, wait-and-see main) We use data from House Prices: Advanced Regression Techniques, a Kaggle study competition. House data The theme is to predict the price from now on. House Prices: Advanced Regression Techniques

Check the contents of the data

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

Check what kind of data is included

!ls ../input/house-prices-advanced-regression-techniques

↓↓↓

data_description.txt  sample_submission.csv  test.csv  train.csv

train.csv is training data test.csv is the test data An example of sample_submission.csv asking you to submit an answer like this data_description.txt is the description of each column

Then read each csv file as pandas.DataFrame with pd.read_csv

TEST_PATH = "../input/house-prices-advanced-regression-techniques/test.csv"
TRAIN_PATH = "../input/house-prices-advanced-regression-techniques/train.csv"
SUBMISSION_PATH = "../input/house-prices-advanced-regression-techniques/sample_submission.csv"

test_data = pd.read_csv(TEST_PATH)
train_data = pd.read_csv(TRAIN_PATH)
sample_submission = pd.read_csv(SUBMISSION_PATH)

Let's take a look at the contents of the training data. The first 5 items in the table are displayed below

train_data.head()

↓↓↓


	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	...	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	        RL	        65.0	    8450	Pave	NaN  	Reg	        ... NaN	        0	    2	    2008	WD	        Normal	        208500
1	2	20	        RL	        80.0	    9600	Pave	NaN	    Reg	        ... NaN	        0	    5	    2007	WD	        Normal	        181500
2	3	60	        RL	        68.0	    11250	Pave	NaN 	IR1	        ... NaN	        0	    9	    2008	WD	        Normal	        223500
3	4	70	        RL	        60.0	    9550	Pave	NaN 	IR1	        ... NaN	        0	    2	    2006	WD	        Abnorml	        140000
4	5	60	        RL	        84.0	    14260	Pave	NaN 	IR1	        ... NaN	        0	    12	    2008	WD	        Normal	        250000
5 rows × 81 columns

There seems to be 81 columns. full. By the way, Sale Price is missing from the test data. The theme is to anticipate the Sale Price.

Well, I checked only the first 5 cases, but how many are there in total?

print(len(train_data), len(test_data))
1460 1459

Well, you can see that the training data consists of 1460 rows and 81 columns.

Confirmation of missing values

Now you can grasp the number of data. Looking at the table above, there are some missing values (NaN and blanks). Let's also check how many total missing values there are.

Determine if there is a missing value with ʻisnull ()(return as bool type, that is,True False), Calculate the total number of missing values with sum () as True = 1`` False = 0. It means the process of counting non-missing values with count ()`.


total = train_data.isnull().sum().sort_values(ascending=False)
percent = (train_data.isnull().sum()/train_data.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

↓↓↓


	         Total	Percent
PoolQC	     1453	0.995205
MiscFeature	 1406	0.963014
Alley	     1369	0.937671
Fence	     1179	0.807534
FireplaceQu	 690	0.472603
LotFrontage	 259	0.177397
GarageCond	 81	    0.055479
GarageType	 81	    0.055479
GarageYrBlt	 81	    0.055479
GarageFinish 81 	0.055479
GarageQual	 81 	0.055479
BsmtExposure 38 	0.026027
BsmtFinType2 38 	0.026027
BsmtFinType1 37 	0.025342
BsmtCond	 37 	0.025342
BsmtQual	 37 	0.025342
MasVnrArea	 8  	0.005479
MasVnrType	 8  	0.005479
Electrical	 1  	0.000685
Utilities	 0  	0.000000

Now, a possible action for missing values is to embed the missing values with something different or delete the rows or columns that contain the missing values. For example, if you look at the top row of PoolQC, the percentage of missing values is 99%, so let's delete the column as follows.

train_data = train_data.drop('PoolQC',axis=1)

If you check the missing value as before to see if it really disappears


	         Total	Percent
MiscFeature	 1406	0.963014
Alley	     1369	0.937671
Fence	     1179	0.807534
FireplaceQu	 690	0.472603
LotFrontage	 259	0.177397
GarageCond	 81	    0.055479
GarageType	 81	    0.055479
GarageYrBlt	 81	    0.055479
GarageFinish 81 	0.055479
GarageQual	 81 	0.055479
BsmtExposure 38 	0.026027
BsmtFinType2 38 	0.026027
BsmtFinType1 37 	0.025342
BsmtCond	 37 	0.025342
BsmtQual	 37 	0.025342
MasVnrArea	 8  	0.005479
MasVnrType	 8  	0.005479
Electrical	 1  	0.000685
Utilities	 0  	0.000000

Certainly the top line has disappeared. Now, let's take a look not only at the data frame but also at the graph. (I imported matplotlib.pyplot and seaborn for that.)

It's been long, so next time!

Recommended Posts

First step of data analysis (number of data, table display, missing values)
[Data science memorandum] Handling of missing values ​​[python]
First satellite data analysis by Tellus
Recommendation of data analysis using MessagePack
Time series analysis 3 Preprocessing of time series data
Data handling 2 Analysis of various data formats
[Python] First data analysis / machine learning (Kaggle)
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
Data analysis in Python Summary of sources to look at first for beginners
An introduction to data analysis using Python-To increase the number of video views-
The first step to log analysis (how to format and put log data in Pandas)