1.First of all

This is the third time I have analyzed housing. Until the last time, the score was around 0.17, and even if I changed the model, it wouldn't grow any more.

This time, the standard process using CRISP-DM was used.

CRISP-DM (CRoss Industry Standard Process for Data Mining) advocated by Shearer et al.

The data analysis process includes KDD, which focuses more on the data analysis part than CRISP-DM and CRIISP-DM as standard processes (KDD explanation is omitted this time).

The CRISP-DM process proceeds in the following order: (1) business understanding → (2) data understanding → (3) data preparation → (4) modeling → (5) evaluation → (6) application. Figure 1 CRISP-DM

I would like to introduce what I have thought about these things. Since it is Part 1, I will introduce it multiple times.

2. Business understanding

The challenge in this competition is to predict the price of a home. So I imagined what factors would affect the price of a house.

==================== Imagination below ==================== ** Generally "location" Close to urban areas and train stations, convenient transportation, luxury homes ** ** "House size" Site area, number of floors, building size ** ** "Included" with pool, tennis court, etc. ** ** I feel that "new construction" or "used" is quite important (how old is important?) ** ** I think that "quality" is an important factor for materials. ** **

It's hard to mention, but I think it's very important for prediction.

3. Data understanding

Finally we will look at the contents of kaggle

# 1-1.Read data
df_train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
df_test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
df_train.head()

Output result

	Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
0	1	60	RL	65.0	8450	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	2	2008	WD	Normal	208500
1	2	20	RL	80.0	9600	Pave	NaN	Reg	Lvl	AllPub	...	NaN	NaN	NaN	5	2007	WD	Normal	181500
2	3	60	RL	68.0	11250	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	9	2008	WD	Normal	223500
3	4	70	RL	60.0	9550	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	2	2006	WD	Abnorml	140000
4	5	60	RL	84.0	14260	Pave	NaN	IR1	Lvl	AllPub	...	NaN	NaN	NaN	12	2008	WD	Normal	250000

# 1-2.Check data structure
print(df_train.shape)
print(df_test.shape)
df_train.columns

Output result (1460, 81) (1459, 80) Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice'], dtype='object')

** There are 80 explanatory variables. ** **

This time, it is over because of space. Next time, we will finally perform the data preprocessing.

[PYTHON] Kaggle ~ Housing Analysis ③ ~ Part1

1.First of all

2. Business understanding

It's hard to mention, but I think it's very important for prediction.

3. Data understanding