I started studying programming (python) around December 2018 and started working on kaggle in the last few months.
Among them, there were many things that I wondered "how do I do this?", And I proceeded while investigating various things, so this time I will focus on that "preprocessing" and summarize it. think.
It is a familiar Titanic at kaggle. https://www.kaggle.com/c/titanic
import matplotlib
import matplotlib.pyplot as plt
matplotlib.style.use('ggplot')
import seaborn as sns
import numpy as np
import pandas as pd
I will read the data.
df_train =  pd.read_csv(r"C:///train.csv")
df_test = pd.read_csv(r"C:///test.csv")
df_train.shape
df_test.shape
Now you can see the training data as (891, 12) and the test data as (418, 11).
Let's put out the first 5 lines to see what kind of data is in it.
df_train.head()
df_test.head()
It seems that the data looks like this.

As you can see by df_test.head (), the column shows the column in which the objective variable "Survived" has disappeared from the column of training data.
Let's combine the training data and test data information into one.
df_train.info()
print("-"*48)
df_test.info()
You can roughly check the number of data and data type in each column.
 
df_train.describe()
 
Numerical data information is displayed.
df_train.describe(include=['O'])
 
The categorical variables Name, Sex, Ticket, Cabin, and Embarked are displayed for each number / unique value, top frequency category, and the number of occurrences.
This is important to me as it stumbled so much in the early days.
In the end, we will learn separately for training data and test data, but since it is complicated to perform pre-processing such as missing value processing and categorical variable processing for each training data and test data, we will summarize them first and then later. I will divide it again from.
#Create a new column called TrainFlag and set it to True for training data and False for test data.
df_train["TrainFlag"] = True
df_test["TrainFlag"] = False
#Combine training and test data
df_all = df_train.append(df_test)
#PassengerId is probably not used for features, so I want to delete it.
#However, since it is necessary when submitting test data later, it will not be completely deleted.
#Keep as an index
df_all.index = df_all["PassengerId"]
df_all.drop("PassengerId", axis = 1, inplace = True)
Now, if you look at df_all, it looks like this. The index will be PassengerId, and on the far right is the TrainFlag column we just added. True is the training data and False (not shown here) is the test data.
 
This will sort them in descending order.
df_all.isnull().sum().sort_values(ascending=False)
 
The number of variables should be about this time, but when the number of variables increases, it is very difficult to see if the numerical values of the missing values of all the explanatory variables are given.
Therefore, let's narrow down the variables that have "missing values" and sort them in descending order.
df_all.isnull().sum()[df_train.isnull().sum()>0].sort_values(ascending = False)
 
Then only the variables with missing values were sorted in descending order!
◆Cabin With df_all.shape, the number of data is 1,309 when the training data and test data are combined. Of these, there are 1,014 missing Cabins, so this time I will exclude each column from the analysis target, so I will not perform missing value processing here.
◆Age Age also has some missing values, but there are not so many, and although I will not touch on this time, age seems to affect the model, so we will process missing values.
There are several ways to do it, but this time I will fill the orthodox with the average value.
df_all["Age"] = df_all["Age"].fillna(df_all["Age"].mean())
◆Embarked If you do df_all.describe (include = ['O']), you can see that Embarked has only 3 unique values, and most of them are "S", so this time we will fill in the missing values with S.
 
df_all["Embarked"] = df_all["Embarked"].fillna("S")
df_all.isnull().sum()[df_train.isnull().sum()>0].sort_values(ascending = False)
Then, you can see that only Cabin has a missing value, so the missing value processing is now complete.
I will omit detailed examination this time, but as a result of data analysis, it is assumed that Cabin, Name, PassengerId, Ticket are unnecessary for this model construction.
Let's erase these columns.
df_all = df_train.drop(["Cabin", 'Name','PassengerId','Ticket'], axis = 1)
df_all = pd.get_dummies(df_all, drop_first=True)
If you check with df_all.head (), you can see that the categorical variable could be processed like this.

With the above, the insanely orthodox preprocessing is completed, and after that, we will proceed to full-scale model construction.
It is a very rudimentary content for intermediate and above, but at first it was very difficult to proceed while examining these, and each time I was stressed.
We hope that it will help such people to deepen their understanding.
Recommended Posts