[Python] First data analysis / machine learning (Kaggle)


When I was about to get a job (spring break in my fourth year at university), I suddenly wanted to become a data scientist, so I started working on Kaggle for the time being.

This time, I worked on the tutorial "** Titanic ** of ** Kaggle **. titanic / overview) "problem.

I have experience in statistical analysis in my research, but I don't understand machine learning at all, so I decided to study by referring to the code of the insanely excellent person!

By the way, here is the reference ✔︎Introduction to Ensembling/Stacking in Python This is the code that was described as "Most voted" in the Notebook. (As of March 10, 2020)

Also, I also referred to this article that referred to the above code lol Ensemble (Stacking) Learning & Machine Learning Tutorial in Kaggle with Python

Introduction The content of the Titanic competition that we are working on this time will be given data on passengers such as age, gender, number of people in the room, room class, life and death.

Based on that data ** Data preprocessing → Data visualization → Stacking model construction → Test data → Evaluation **

Finally, the life and death of passengers is predicted using a model constructed from test data. How well this prediction is also a measure of the score.

Since there is a lot of volume, this article will focus on ** "preprocessing" and "data visualization" **!

After building the model, I will post it in the next article! Click here for the sequel article ↓ ↓ ↓ [Python] First Data Analysis / Machine Learning (Kaggle) ~ Part2 ~

Now, let's start preprocessing the data! !!

Data preprocessing

Library import

Roughly the library used this time

--Mathematical and statistical processing: numpy, pundas --Regular expression operations (searching or replacing words or numbers in a specified form): re --I have various models ~: sklearn --Gradient boosting decision tree algorithm: XGboost --Graphs and figures: matplotlib, seaborn, plotly --Show / Hide Warnings: warnings ――Five basic models - RandomForestClassifier - AdaBoostClassifier - GradientBoostingClassifier - ExtraTreesClassifier - SVC --It is Kfold that applies these 5 model libraries at the same time.


import pandas as pd
import numpy as np
import re
import sklearn
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.offline as py
import plotly.graph_objs as go
import plotly.tools as tls

import warnings

#5 models
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC

#Adapt multiple machine learning libraries at the same time
from sklearn.cross_validation import KFold

Data acquisition


train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

#Have the Passenger ID retain the passenger ID
PassengerId = test['PassengerId']


output スクリーンショット 2020-03-10 16.13.20.png

Data description

--PassengerId: Passenger ID --Survived: Life / Death flag (If you survive: 1, If you die: 0) --Pclass: Ticket class --Name: Passenger's name --Sex: Gender --Age: Age --SibSp: Brother / spouse on board --Parch: Parents / children on board --fare: fare --cabin: room number --Embarked: Port on board

Feature engineering

We will process the acquired data so that it can be easily analyzed. This ** pre-processing seems to be quite important ** in machine learning, so I'll do my best for a long time! !!

Basically, all missing values (no values) and character data are converted to numerical data.


full_data = [train, test]

#Passenger name length
train['name_length'] = train['name'].apply(len)
test['name_length'] = test['name'].apply(len)

#1 if there is room number data, 0 if there is a missing value
train['Has_Cabin'] = train['Cabin'].apply(lambda x: 0 if type(x) == float else 1)
test['Has_Cabin'] = test['Cabin'].apply(lambda x: 0 if type(x) == float else 1)

#The size of the family is on board the Titanic"Brother/Number of spouses"When"parent/Number of children"Define from
for dataset in full_data:
    dataset ['FamilySize'] = dataset['Sibsp'] + dataset['Parch'] +1

#If you don't have a family"IsAlone"Is 1
for dataset in full_data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

#Most missing values at the port of departure'S'Keep
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')

#Set the median missing value of the charge
#Divide the fee into 4 groups
for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)

#Divide age into 5 groups
for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    #List of random values to put in missing values
    #Use a value that is larger or smaller by the deviation from the average value
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size = age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    #Convert data to int type
    dataset['Age'] = dataset['Age'].astype(int)

train['CategoricalAge'] = pd.qcut(train['Age'],5)

#Function get to retrieve name_Definition of title
def get_title(name):
    title_search = re.search('([A-Za-z]+)\.',name)
    #If there is a name, take it out and return it
    if title_search:
       return title_search.group(1)
    return ""

#Function get_use title
for dataset in ftll_data:
    dataset['Title'] = dataset['Name'].apply(get_title)

    #Correction of name entry mistakes
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

for dataset in full_data:
    #0 for women, 1 for men
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)

    #Labeled 5 types of names
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)

    #Labeled on 3 types of departure points
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)

    #Divide the fee into 4 groups
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare']                               = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare']                                  = 3
    dataset['Fare'] = dataset['Fare'].astype(int)

    #Divide age into 5 groups
    dataset.loc[ dataset['Age'] <= 16, 'Age']                          = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 ;

#Remove unnecessary features
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']
train = train.drop(drop_elements, axis = 1)
train = train.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)
test  = test.drop(drop_elements, axis = 1)

Coding that I learned

--How to use the lambda function

Data visualization

Finally the pre-processing is over! !! Let's check if all the data is numerical data!


スクリーンショット 2020-03-10 17.25.59.png

Pearson Correlation Heatmap

Check the correlation between features on the heat map.


colormap = plt.cm.RdBu
plt.figure(figsize = (14,12))
plt.title('Peason Correlation of Features', y = 1.05, size = 15)
sns.heatmap(train.astype(float).corr(), linewidths=0.1, vmax=1.0, square = True, cmap=colormap, linecolor='white', annot=True)
スクリーンショット 2020-03-10 17.32.26.png

From this plot, we can see that the features are not so strongly correlated with each other.

Features are independent of each other → No useless features → Important for building a learning model (Parch and Family Size have a relatively strong correlation, but leave them as they are.)

Pair Plot

Distribution of data from one feature to another


g = sns.pairplot(train[[u'Survived', u'Pclass', u'Sex', u'Age', u'Parch', u'Fare', u'Embarked',
       u'FamilySize', u'Title']], hue='Survived', palette = 'seismic',size=1.2,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
スクリーンショット 2020-03-10 17.40.00.png


In this article, I worked on Kaggle's tutorial "Titanic." As a flow,

  1. Import library
  2. Data acquisition
  3. Data preprocessing --Eliminate missing values (insert values close to the average value at random, etc.) --Equal division of data --Convert category data to numbers
  4. Data visualization --Pearson Correlation Heatmap --Pair plot

It was quite difficult so far, but I will continue to do my best because it is the actual model construction from here! !!

Recommended Posts

[Python] First data analysis / machine learning (Kaggle)
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
Python data analysis learning notes
Data analysis python
Machine learning with python (2) Simple regression analysis
Python: Preprocessing in machine learning: Data acquisition
Data analysis starting with python (data preprocessing-machine learning)
Python: Preprocessing in machine learning: Data conversion
Preprocessing in machine learning 1 Data analysis process
Data analysis with python 2
Data analysis using Python 0
Data analysis overview python
Python data analysis template
Data analysis with Python
Python 3 Engineer Certification Data Analysis Exam Pre-Exam Learning
[Python3] Let's analyze data using machine learning! (Regression)
I started machine learning with Python Data preprocessing
A story about data analysis by machine learning
Data set for machine learning
Machine learning with Python! Preparation
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
[Python] Notes on data analysis
Python Machine Learning Programming> Keywords
Python for Data Analysis Chapter 2
Beginning with Python machine learning
Try machine learning with Kaggle
Data analysis using python pandas
Python for Data Analysis Chapter 3
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
Coursera Machine Learning Challenges in Python: ex7-2 (Principal Component Analysis)
Implement stacking learning in Python [Kaggle]
First satellite data analysis by Tellus
First simple regression analysis in Python
Python: Time Series Analysis: Preprocessing Time Series Data
Machine learning with python (1) Overall classification
Read kaggle Courses --- Intermediate Machine Learning 5
Machine learning summary by Python beginners
Machine learning algorithm (simple regression analysis)
<For beginners> python library <For machine learning>
Preprocessing template for data analysis (Python)
Machine learning in Delemas (data acquisition)
Python: Preprocessing in Machine Learning: Overview
Preprocessing in machine learning 2 Data acquisition
Data analysis starting with python (data visualization 1)
Python: Unsupervised Learning: Principal Component Analysis
"Scraping & machine learning with Python" Learning memo
Data analysis starting with python (data visualization 2)
Machine Learning: Supervised --Linear Discriminant Analysis
Preprocessing in machine learning 4 Data conversion
Basic machine learning procedure: ② Prepare data
How to collect machine learning data
Introduction to Python Basics of Machine Learning (Unsupervised Learning / Principal Component Analysis)
python learning
Python & Machine Learning Study Memo: Environment Preparation
Notes on PyQ machine learning python grammar
Python visualization tool for data analysis work
Amplify images for machine learning with python
First Steps for Machine Learning (AI) Beginners
[python] Frequently used techniques in machine learning
Why Python is chosen for machine learning
"Python Machine Learning Programming" Summary Note (Jupyter)