0. First

When trying to analyze data using machine learning, suddenly from the obtained csv data
"OK! Let's go back!" "OK! Let's classify!"
In such a way, it is rare to suddenly put it in model generation. Rather, the barriers up to that point are quite high for beginners. So, this time I tried to summarize the data preprocessing.

1. Read data

This time I borrowed data from Kaggle's Titanic. By the way, I use jupyter. (If you know how to combine the input and output of jupyter into Qiita, please let me know ...)

`In[1]`


%matplotlib inline
import matplotlib as plt
import pandas as pd
import numpy as np
import seaborn as sns

df = pd.read_csv('./train.csv')
df = df.set_index('PassengerId') #Set a unique column to index
print(df.shape)
df.head()

2. Delete columns that are not needed for analysis

`In[2]`


df = df.drop(['Name', 'Ticket'], axis=1) #Drop columns that are not needed for analysis
df.head()

3. Check the data type and loss

`In[3]`


print(df.info())
#print(df.dtypes) #Click here if you want to check only the data type
df.isnull().sum(axis=0)
#df.isnull().any(axis=0) #Click here to check only the presence or absence of null

`out[3]`


<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 9 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 69.6+ KB
None
Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

4. Count the types of elements on the nominal scale

`In[4]`


#Count the types of elements on the nominal scale
import collections
c = collections.Counter(df['Sex'])
print('Sex:',c)
c = collections.Counter(df['Cabin'])
print('Cabin:',len(c))
c = collections.Counter(df['Embarked'])
print('Embarked:',c)

`out[4]`


Sex: Counter({'male': 577, 'female': 314})
Cabin: 148
Embarked: Counter({'S': 644, 'C': 168, 'Q': 77, nan: 2})

5. Delete / complement defects

`In[5]`


df = df.drop(['Cabin'], axis=1) #Deleted because it seems difficult to use for analysis
df = df.dropna(subset = ['Embarked']) #Cabin has few defects, so delete it with dropna in a line
df = df.fillna(method = 'ffill') #Other columns complement from previous data
print(df.isnull().any(axis=0))
df.shape

`out[5]`


Survived    False
Pclass      False
Sex         False
Age         False
SibSp       False
Parch       False
Fare        False
Embarked    False
dtype: bool
(889, 8)

6. Label encoding

Converts object (character) type data to numerical (numerical) type data using label encoding.

`In[6]`


from sklearn.preprocessing import LabelEncoder
for column in ['Sex','Embarked']:
    le = LabelEncoder()
    le.fit(df[column])
    df[column] = le.transform(df[column])
df.head()

You can see that there is label encoding for Sex and Embarked.

By label encoding, you can also see the outline of the data using seaborn's pair plot etc.

sns.pairplot(df);

I think it's good to select only continuous variables and look at them. (Although the following includes some that are not strictly continuous variables ...)

df_continuous = df[['Age','SibSp','Parch','Fare']]
sns.pairplot(df_continuous);

7. One hot encoding

One-hot encodes numeric data and other nominal scale data using one-hot encoding. I didn't know how to use scikit-learn's OneHotEncoder well, so I used get_dummies from pandas.

`In[7]`


df = pd.get_dummies(df, columns = ['Pclass','Embarked'])
df.head()

You can see that there is one hot encoding for Pclass and Embarked.

8. Summary

I think this trend is common to many data analyses. Please refer to the flow of pretreatment.

We are looking for comments, article material, etc.

[PYTHON] [Kaggle] From data reading to preprocessing and encoding

0. First

1. Read data

In[1]

2. Delete columns that are not needed for analysis

In[2]

3. Check the data type and loss

In[3]

out[3]