[PYTHON] Machine learning starting from scratch (machine learning learned with Kaggle)

Target audience and purpose of this page

Target audience

・ Those who know the outline of machine learning ・ Or, for those who read Machine learning from scratch (overview of machine learning)

Purpose

・ Understand Kaggle ・ Understand the actual flow of machine learning ・ Practice using Kaggle tutorials ・ Practice using scikit-learn

agenda

What is Kaggle?
Why Kaggle?
Kaggle Tutorial (Titanic: Machine Learning from Disaster)
Which one to use in the data?
Data preprocessing
Usage data
Learning method to use
Learning execution and cross-validation
Optimization
Machine learning flow
Introducing some corporate contests

This page is a re-edited version of the presentation. If you would like to see the original presentation, please click here. https://www.edocr.com/v/vlzyelxe/tflare/Kaggle_-Machine-learning-to-learn-at-Kaggle

1. What is Kaggle?

If you put it together without fear of misunderstanding "Kaggle is a site where companies and researchers solve data science and machine learning related themes. Some of them have prize money (and we publish and explain the code to solve it. Explanation) There is also a function to communicate with comments etc.)

2. Why Kaggle?

・ In explanations such as books, data sets for explanation are often used, and it is difficult to get a real feeling. ・ You can understand the actual flow of machine learning because it is necessary to carry out even the parts that are broken in the explanations such as books. ・ I get motivated because the ranking comes out. (You can compete with and collaborate with data analysts around the world) ・ Prize money will be given (some will be awarded $ 1.5 million)

3. Kaggle Tutorial (Titanic: Machine Learning from Disaster)

Predict if passengers survived the sinking of Titanic -Training data (891 rows x 12 columns csv) Some data is missing ・ Test data (418 rows x 11 columns csv) Some data is missing ・ Learn with training data and predict whether or not you survived against the test data.

4. Which one to use in the data?

-PassengerId: The number attached to the data sequentially ・ Survived: Survival (0 = No, 1 = Yes) Exists only in training data ・ Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) ・ Name: Name ・ Sex: Gender ・ Age: Age ・ SibSp: Number of siblings and spouses on the Titanic ・ Parch: Number of parents and children on the Titanic ・ Ticket: Ticket number ・ Fare: Passenger fare ・ Cabin: Cabin number ・ Embarked: Boarding area (C = Cherbourg, Q = Queenstown, S = Southampton)

`Execution code`


import numpy as np
import pandas as pd
train = pd.read_csv("train.csv", dtype={"Age": np.float64}, )
test  = pd.read_csv("test.csv", dtype={"Age": np.float64}, )
train.head(10)

`Execution code`


train_corr = train.corr()
train_corr

5. Data preprocessing

It seems that you can use other than PassengerId. Since there is data that is not currently used for analysis, it will be converted to usable data (numerical value). In addition, there is missing data, so correct it.

`Execution code`


def correct_data(titanic_data):
    
    titanic_data.Age = titanic_data.Age.fillna(titanic_data.Age.median())
    
    titanic_data.Sex = titanic_data.Sex.replace(['male', 'female'], [0, 1])
    
    titanic_data.Embarked = titanic_data.Embarked.fillna("S")
    titanic_data.Embarked = titanic_data.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])

    titanic_data.Fare = titanic_data.Fare.fillna(titanic_data.Fare.median())
   
    return titanic_data

train_data = correct_data(train)
test_data  = correct_data(test)

`Execution code`


train_corr = train.corr()
train_corr

6. Usage data

This time, we will use the following items. ・ Ticket class ·sex ·age ・ Number of siblings and spouses on the Titanic ・ Number of parents and children on the Titanic ・ Passenger fare ・ Boarding area

7. Learning method to use

・ Logistic regression ・ Support vector machine ・ K-nearest neighbor method ・ Decision tree ・ Random forest ·neural network

References See below for details on learning methods. Machine learning started with Python Features learned with scikit-learn Basics of engineering and machine learning https://www.oreilly.co.jp/books/9784873117980/

8. Learning execution and cross-validation

Specify the data and learning method.

`Execution code`


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import cross_val_score

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

models = []

models.append(("LogisticRegression",LogisticRegression()))
models.append(("SVC",SVC()))
models.append(("LinearSVC",LinearSVC()))
models.append(("KNeighbors",KNeighborsClassifier()))
models.append(("DecisionTree",DecisionTreeClassifier()))
models.append(("RandomForest",RandomForestClassifier()))
models.append(("MLPClassifier",MLPClassifier(solver='lbfgs', random_state=0)))

Perform cross-validation.

In cross-validation, the dataset is divided into training data and test data (here, 3 divisions). It is a method to stabilize the accuracy by evaluating each

一から始める機械学習（Kaggleで学ぶ機械学習）__Machine_learning_to_learn_at_Kaggle_key.png

`Execution code`


results = []
names = []
for name,model in models:
    result = cross_val_score(model, train_data[predictors], train_data["Survived"],  cv=3)
    names.append(name)
    results.append(result)

The results divided into three are averaged and evaluated. Random forest gave good results.

`Execution code`


for i in range(len(names)):
    print(names[i],results[i].mean())

LogisticRegression 0.785634118967
SVC 0.687991021324
LinearSVC 0.58810325477
KNeighbors 0.701459034792
DecisionTree 0.766554433221
RandomForest 0.796857463524
MLPClassifier 0.785634118967

Based on what you learned in Random Forest Make a prediction with test data and send the result as csv.

`Execution code`


alg = RandomForestClassifier()
alg.fit(train_data[predictors], train_data["Survived"])

predictions = alg.predict(test_data[predictors])

submission = pd.DataFrame({
        "PassengerId": test_data["PassengerId"],
        "Survived": predictions
    })

submission.to_csv('submission.csv', index=False)

9. Optimization

Correct answer rate

Correct answer rate 0.74163 It was 7043th out of 7922 people. I'm a little disappointed, so I'll optimize it.

optimisation

If you use grid search, hyperparameters will be optimized automatically. However, please note that it takes a long time to execute.

`Execution code`


parameters = {
        'n_estimators'      : [5, 10, 20, 30, 50, 100, 300],
        'max_depth'         : [3, 5, 10, 15, 20, 25, 30, 40, 50, 100]
        'random_state'      : [0],
}
gsc = GridSearchCV(RandomForestClassifier(), parameters,cv=3)
gsc.fit(train_data[predictors], train_data["Survived"])

Let's apply the result optimized by the above.

Correct answer rate 0.77990 I went to 4129th out of 7922 people.

Optimization by changing data preprocessing

I got a comment when I published the code to Kaggle. It was better to find the missing values from the test data rather than from the training data. I tried it. The modified code is shown below.

`Execution code`


def correct_data(train_data, test_data):
    
    # Make missing values for training data from test data as well
    train_data.Age = train_data.Age.fillna(test_data.Age.median())
    train_data.Fare = train_data.Fare.fillna(test_data.Fare.median())
    
    test_data.Age = test_data.Age.fillna(test_data.Age.median())
    test_data.Fare = test_data.Fare.fillna(test_data.Fare.median())    
    
    train_data = correct_data_common(train_data)
    test_data = correct_data_common(test_data)    

    return train_data,  test_data

def correct_data_common(titanic_data):
    titanic_data.Sex = titanic_data.Sex.replace(['male', 'female'], [0, 1])
    titanic_data.Embarked = titanic_data.Embarked.fillna("S")
    titanic_data.Embarked = titanic_data.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
    
    return titanic_data

train_data,  test_data = correct_data(train, test)

** Correct answer rate 0.79426 ** ** I went to 2189th out of 7922 people. ** **

Plan for further optimization

・ Analyze the name. (Can you guess from here because there is Mr. Mrs. Miss etc.)

· Use a different learning method (eg XGBoost, LightGBM)

10. Machine learning flow

一から始める機械学習（Kaggleで学ぶ機械学習）__Machine_learning_to_learn_at_Kaggle_key2.png

Method selection is algorithm cheat sheet

Choosing_the_right_estimator_—_scikit-learn_0_19_0_documentation.png

Hyperparameter selection

grid search

Model evaluation

** Cross-validation **

11. Introducing some corporate contests

Prudential Life Insurance Assessment ・ Can you make buying life insurance easier? ・ Calculate the risk level from the attributes of the life insurance applicant ・ Prize of $ 30,000 ・ Already finished (code can be referenced) ・ Https://www.kaggle.com/c/prudential-life-insurance-assessment

Zillow Prize: Zillow’s Home Value Prediction (Zestimate) ・ Can you improve the algorithm that changed the world of real estate? · Predict the error between Zestimate and the actual selling price, taking into account all the features of your home ・ Prize of 1.2 million dollars ・ Ends after 4 months ・ Https://www.kaggle.com/c/zillow-prize-1