[PYTHON] [Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 5: Dummy categorical variables)

theme

This is the 5th project to make a note of the hands-on content that everyone will challenge to the famous "House Price" problem of kaggle. It's more of a memo than a commentary, but I hope it helps someone somewhere. I want to think that the end is about to be seen.

Today's work

Dummy of categorical variables

It's like replacing character strings with numbers.

#List features of categorical variables
cat_cols = alldata.dtypes[alldata.dtypes=='object'].index.tolist()
#List the features of numerical variables
num_cols = alldata.dtypes[alldata.dtypes!='object'].index.tolist()
#List columns required for data splitting and submission
other_cols = ['Id','WhatIsData']
#Remove extra elements from the list
cat_cols.remove('WhatIsData') #Training data / test data distinction flag removal
num_cols.remove('Id') #Id delete
#Dummy categorical variables
alldata_cat = pd.get_dummies(alldata[cat_cols])
#Data integration
all_data = pd.concat([alldata[other_cols],alldata[num_cols],alldata_cat],axis=1)

List features of categorical variables

Oh, I think I'm piled up. The mysterious response. Then, I would like to output only the following results together. Only the object type data type has the index in the list.

cat_cols = alldata.dtypes[alldata.dtypes=='object'].index.tolist() スクリーンショット 2020-06-22 12.05.14.png

List the features of numerical variables

num_cols = alldata.dtypes[alldata.dtypes!='object'].index.tolist()

This is the same as listing the features of categorical variables, so I will omit it.

List columns required for data splitting and submission

other_cols = ['Id','WhatIsData']

As you can see, the column added in Part 2 is stored in the array. Apparently this next step will be used to remove extra elements from the list.

Remove extra elements from the list

It seems that it removes unnecessary elements from the list. You can also confirm from the previous output that there was an item called WhatIsData in cat_cols.

cat_cols.remove ('WhatIsData') #Training data / test data distinction flag removal num_cols.remove ('Id') #Id remove

Dummy categorical variables

alldata_cat = pd.get_dummies(alldata[cat_cols])

Unusual impression. It's so convenient that you can just apply it to a function and it will do everything for you ... I like this kind of python.

ʻAlldata_cat = pd.get_dummies (alldata [cat_cols])` output result. It's amazing, it's really changed. スクリーンショット 2020-06-22 12.15.49.png

Data integration

all_data = pd.concat([alldata[other_cols],alldata[num_cols],alldata_cat],axis=1)

This is just what I saw. Combine [alldata [other_cols], alldata [num_cols], alldata_cat with concat. (I've come to say that it looks great)

That's it.

Did you proceed at a good tempo this time? It seems that it is not taking much time to read and understand unexpectedly. It feels like you're getting used to it. I will continue to devote myself. Now that the data has been formatted, it's time to analyze it. I'm looking forward to it.

Recommended Posts

[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 5: Dummy categorical variables)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 2: Checking Missing Values)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 3: Preparation for missing value complementation)
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (Part 1: Reading data)
[Hands-on for beginners] Read kaggle's "Predicting House Prices" line by line (6th: Distribution conversion of objective variables)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (7th: Preparing to build a prediction model)
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (8th: Building a Forecast Model)
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (4th: Complementing Missing Values (Complete))
Predicting Home Prices (Regression by Linear Regression (kaggle)) ver1.0
[For beginners] Read DB authentication information from environment variables