When trying to analyze data using machine learning, suddenly from the obtained csv data
"OK! Let's go back!"
"OK! Let's classify!"
In such a way, it is rare to suddenly put it in model generation. Rather, the barriers up to that point are quite high for beginners.
So, this time I tried to summarize the data preprocessing.
This time I borrowed data from Kaggle's Titanic. By the way, I use jupyter. (If you know how to combine the input and output of jupyter into Qiita, please let me know ...)
In[1]
%matplotlib inline
import matplotlib as plt
import pandas as pd
import numpy as np
import seaborn as sns
df = pd.read_csv('./train.csv')
df = df.set_index('PassengerId') #Set a unique column to index
print(df.shape)
df.head()
In[2]
df = df.drop(['Name', 'Ticket'], axis=1) #Drop columns that are not needed for analysis
df.head()
In[3]
print(df.info())
#print(df.dtypes) #Click here if you want to check only the data type
df.isnull().sum(axis=0)
#df.isnull().any(axis=0) #Click here to check only the presence or absence of null
out[3]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 9 columns):
Survived 891 non-null int64
Pclass 891 non-null int64
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 69.6+ KB
None
Survived 0
Pclass 0
Sex 0
Age 177
SibSp 0
Parch 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
In[4]
#Count the types of elements on the nominal scale
import collections
c = collections.Counter(df['Sex'])
print('Sex:',c)
c = collections.Counter(df['Cabin'])
print('Cabin:',len(c))
c = collections.Counter(df['Embarked'])
print('Embarked:',c)
out[4]
Sex: Counter({'male': 577, 'female': 314})
Cabin: 148
Embarked: Counter({'S': 644, 'C': 168, 'Q': 77, nan: 2})
In[5]
df = df.drop(['Cabin'], axis=1) #Deleted because it seems difficult to use for analysis
df = df.dropna(subset = ['Embarked']) #Cabin has few defects, so delete it with dropna in a line
df = df.fillna(method = 'ffill') #Other columns complement from previous data
print(df.isnull().any(axis=0))
df.shape
out[5]
Survived False
Pclass False
Sex False
Age False
SibSp False
Parch False
Fare False
Embarked False
dtype: bool
(889, 8)
Converts object (character) type data to numerical (numerical) type data using label encoding.
In[6]
from sklearn.preprocessing import LabelEncoder
for column in ['Sex','Embarked']:
le = LabelEncoder()
le.fit(df[column])
df[column] = le.transform(df[column])
df.head()
You can see that there is label encoding for Sex and Embarked.
By label encoding, you can also see the outline of the data using seaborn's pair plot etc.
sns.pairplot(df);
I think it's good to select only continuous variables and look at them. (Although the following includes some that are not strictly continuous variables ...)
df_continuous = df[['Age','SibSp','Parch','Fare']]
sns.pairplot(df_continuous);
One-hot encodes numeric data and other nominal scale data using one-hot encoding. I didn't know how to use scikit-learn's OneHotEncoder well, so I used get_dummies from pandas.
In[7]
df = pd.get_dummies(df, columns = ['Pclass','Embarked'])
df.head()
You can see that there is one hot encoding for Pclass and Embarked.
I think this trend is common to many data analyses. Please refer to the flow of pretreatment.
We are looking for comments, article material, etc.