[PYTHON] Machine learning starting from scratch (machine learning learned with Kaggle)

Target audience and purpose of this page

Target audience

・ Those who know the outline of machine learning ・ Or, for those who read Machine learning from scratch (overview of machine learning)

Purpose

・ Understand Kaggle ・ Understand the actual flow of machine learning ・ Practice using Kaggle tutorials ・ Practice using scikit-learn

agenda

  1. What is Kaggle?
  2. Why Kaggle?
  3. Kaggle Tutorial (Titanic: Machine Learning from Disaster)
  4. Which one to use in the data?
  5. Data preprocessing
  6. Usage data
  7. Learning method to use
  8. Learning execution and cross-validation
  9. Optimization
  10. Machine learning flow
  11. Introducing some corporate contests

This page is a re-edited version of the presentation. If you would like to see the original presentation, please click here. https://www.edocr.com/v/vlzyelxe/tflare/Kaggle_-Machine-learning-to-learn-at-Kaggle

1. What is Kaggle?

If you put it together without fear of misunderstanding "Kaggle is a site where companies and researchers solve data science and machine learning related themes. Some of them have prize money (and we publish and explain the code to solve it. Explanation) There is also a function to communicate with comments etc.)

2. Why Kaggle?

・ In explanations such as books, data sets for explanation are often used, and it is difficult to get a real feeling. ・ You can understand the actual flow of machine learning because it is necessary to carry out even the parts that are broken in the explanations such as books. ・ I get motivated because the ranking comes out. (You can compete with and collaborate with data analysts around the world) ・ Prize money will be given (some will be awarded $ 1.5 million)

3. Kaggle Tutorial (Titanic: Machine Learning from Disaster)

Predict if passengers survived the sinking of Titanic -Training data (891 rows x 12 columns csv) Some data is missing ・ Test data (418 rows x 11 columns csv) Some data is missing ・ Learn with training data and predict whether or not you survived against the test data.

4. Which one to use in the data?

-PassengerId: The number attached to the data sequentially ・ Survived: Survival (0 = No, 1 = Yes) Exists only in training data ・ Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd) ・ Name: Name ・ Sex: Gender ・ Age: Age ・ SibSp: Number of siblings and spouses on the Titanic ・ Parch: Number of parents and children on the Titanic ・ Ticket: Ticket number ・ Fare: Passenger fare ・ Cabin: Cabin number ・ Embarked: Boarding area (C = Cherbourg, Q = Queenstown, S = Southampton)

Execution code


import numpy as np
import pandas as pd
train = pd.read_csv("train.csv", dtype={"Age": np.float64}, )
test  = pd.read_csv("test.csv", dtype={"Age": np.float64}, )
train.head(10)

kaggle_titanic1.png

Execution code


train_corr = train.corr()
train_corr

kaggle_titanic3.png

5. Data preprocessing

It seems that you can use other than PassengerId. Since there is data that is not currently used for analysis, it will be converted to usable data (numerical value). In addition, there is missing data, so correct it.

Execution code


def correct_data(titanic_data):
    
    titanic_data.Age = titanic_data.Age.fillna(titanic_data.Age.median())
    
    titanic_data.Sex = titanic_data.Sex.replace(['male', 'female'], [0, 1])
    
    titanic_data.Embarked = titanic_data.Embarked.fillna("S")
    titanic_data.Embarked = titanic_data.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])

    titanic_data.Fare = titanic_data.Fare.fillna(titanic_data.Fare.median())
   
    return titanic_data

train_data = correct_data(train)
test_data  = correct_data(test)

Execution code


train_corr = train.corr()
train_corr

kaggle_titanic.png

6. Usage data

This time, we will use the following items. ・ Ticket class ·sex ·age ・ Number of siblings and spouses on the Titanic ・ Number of parents and children on the Titanic ・ Passenger fare ・ Boarding area

7. Learning method to use

・ Logistic regression ・ Support vector machine ・ K-nearest neighbor method ・ Decision tree ・ Random forest ·neural network

References See below for details on learning methods. Machine learning started with Python Features learned with scikit-learn Basics of engineering and machine learning https://www.oreilly.co.jp/books/9784873117980/

8. Learning execution and cross-validation

Specify the data and learning method.

Execution code


from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import cross_val_score

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]

models = []

models.append(("LogisticRegression",LogisticRegression()))
models.append(("SVC",SVC()))
models.append(("LinearSVC",LinearSVC()))
models.append(("KNeighbors",KNeighborsClassifier()))
models.append(("DecisionTree",DecisionTreeClassifier()))
models.append(("RandomForest",RandomForestClassifier()))
models.append(("MLPClassifier",MLPClassifier(solver='lbfgs', random_state=0)))

Perform cross-validation.

In cross-validation, the dataset is divided into training data and test data (here, 3 divisions). It is a method to stabilize the accuracy by evaluating each

一から始める機械学習(Kaggleで学ぶ機械学習)__Machine_learning_to_learn_at_Kaggle_key.png

Execution code


results = []
names = []
for name,model in models:
    result = cross_val_score(model, train_data[predictors], train_data["Survived"],  cv=3)
    names.append(name)
    results.append(result)
    

The results divided into three are averaged and evaluated. Random forest gave good results.

Execution code


for i in range(len(names)):
    print(names[i],results[i].mean())
    
LogisticRegression 0.785634118967
SVC 0.687991021324
LinearSVC 0.58810325477
KNeighbors 0.701459034792
DecisionTree 0.766554433221
RandomForest 0.796857463524
MLPClassifier 0.785634118967

Based on what you learned in Random Forest Make a prediction with test data and send the result as csv.

Execution code


alg = RandomForestClassifier()
alg.fit(train_data[predictors], train_data["Survived"])

predictions = alg.predict(test_data[predictors])

submission = pd.DataFrame({
        "PassengerId": test_data["PassengerId"],
        "Survived": predictions
    })

submission.to_csv('submission.csv', index=False)
    

9. Optimization

Correct answer rate

Correct answer rate 0.74163 It was 7043th out of 7922 people. I'm a little disappointed, so I'll optimize it.

optimisation

If you use grid search, hyperparameters will be optimized automatically. However, please note that it takes a long time to execute.

Execution code


parameters = {
        'n_estimators'      : [5, 10, 20, 30, 50, 100, 300],
        'max_depth'         : [3, 5, 10, 15, 20, 25, 30, 40, 50, 100]
        'random_state'      : [0],
}
gsc = GridSearchCV(RandomForestClassifier(), parameters,cv=3)
gsc.fit(train_data[predictors], train_data["Survived"])

Let's apply the result optimized by the above.

Correct answer rate 0.77990 I went to 4129th out of 7922 people.

Titanic__Machine_Learning_from_Disaster___Kaggle.png

Optimization by changing data preprocessing

I got a comment when I published the code to Kaggle. It was better to find the missing values from the test data rather than from the training data. I tried it. The modified code is shown below.

Execution code


def correct_data(train_data, test_data):
    
    # Make missing values ​​for training data from test data as well
    train_data.Age = train_data.Age.fillna(test_data.Age.median())
    train_data.Fare = train_data.Fare.fillna(test_data.Fare.median())
    
    test_data.Age = test_data.Age.fillna(test_data.Age.median())
    test_data.Fare = test_data.Fare.fillna(test_data.Fare.median())    
    
    train_data = correct_data_common(train_data)
    test_data = correct_data_common(test_data)    

    return train_data,  test_data

def correct_data_common(titanic_data):
    titanic_data.Sex = titanic_data.Sex.replace(['male', 'female'], [0, 1])
    titanic_data.Embarked = titanic_data.Embarked.fillna("S")
    titanic_data.Embarked = titanic_data.Embarked.replace(['C', 'S', 'Q'], [0, 1, 2])
    
    return titanic_data

train_data,  test_data = correct_data(train, test)

** Correct answer rate 0.79426 ** ** I went to 2189th out of 7922 people. ** **

Titanic__Machine_Learning_from_Disaster___Kaggle2.png

Plan for further optimization

・ Analyze the name. (Can you guess from here because there is Mr. Mrs. Miss etc.)

kaggle_titanic1.png

· Use a different learning method (eg XGBoost, LightGBM)

10. Machine learning flow

一から始める機械学習(Kaggleで学ぶ機械学習)__Machine_learning_to_learn_at_Kaggle_key2.png

Method selection is algorithm cheat sheet

Choosing_the_right_estimator_—_scikit-learn_0_19_0_documentation.png

Hyperparameter selection

grid search

Model evaluation

** Cross-validation **

11. Introducing some corporate contests

Prudential Life Insurance Assessment ・ Can you make buying life insurance easier? ・ Calculate the risk level from the attributes of the life insurance applicant ・ Prize of $ 30,000 ・ Already finished (code can be referenced) ・ Https://www.kaggle.com/c/prudential-life-insurance-assessment

Zillow Prize: Zillow’s Home Value Prediction (Zestimate) ・ Can you improve the algorithm that changed the world of real estate? · Predict the error between Zestimate and the actual selling price, taking into account all the features of your home ・ Prize of 1.2 million dollars ・ Ends after 4 months ・ Https://www.kaggle.com/c/zillow-prize-1

Recommended Posts

Machine learning starting from scratch (machine learning learned with Kaggle)
Machine learning learned with Pokemon
Try machine learning with Kaggle
Create a machine learning environment from scratch with Winsows 10
Business efficiency starting from scratch with Python
Microservices with GCP on RoR starting from scratch
Machine learning starting with Python Personal memorandum Part2
Deep Learning from scratch
Machine learning starting from 0 for theoretical physics students # 1
Machine learning starting with Python Personal memorandum Part1
Overview of machine learning techniques learned from scikit-learn
Machine learning starting from 0 for theoretical physics students # 2
Deep Learning from scratch The theory and implementation of deep learning learned with Python Chapter 3
Deep learning image analysis starting with Kaggle and Keras
Non-information graduate student studied machine learning from scratch # 1: Perceptron
Study method for learning machine learning from scratch (March 2020 version)
Deep Learning from scratch 1-3 chapters
Django starting from scratch (part: 2)
Django starting from scratch (part: 1)
Reinforcement learning starting with Python
Machine learning with Python! Preparation
Machine learning Minesweeper with PyTorch
Beginning with Python machine learning
Create an environment for "Deep Learning from scratch" with Docker
Non-information graduate student studied machine learning from scratch # 2: Neural network
[Machine learning] Understanding uncorrelatedness from mathematics
I tried machine learning with liblinear
Machine learning with python (1) Overall classification
Read kaggle Courses --- Intermediate Machine Learning 5
Deep learning from scratch (cost calculation)
Read kaggle Courses --- Intermediate Machine Learning 6
Perceptron learning experiment learned with Python
Try machine learning with scikit-learn SVM
Deep Learning memos made from scratch
Quantum-inspired machine learning with tensor networks
Get started with machine learning with SageMaker
"Scraping & machine learning with Python" Learning memo
Re: Life in Heroku starting from scratch with Flask ~ PhantomJS to Heroku ~
Let Code Day75 starting from scratch "15.3 Sum"
Predict power demand with machine learning Part 2
[Learning memo] Deep Learning made from scratch [Chapter 7]
Amplify images for machine learning with python
Deep learning from scratch (forward propagation edition)
Non-information graduate students studied machine learning from scratch # 3: MNIST Handwritten digit recognition
Machine learning imbalanced data sklearn with k-NN
Use machine learning APIs A3RT from Python
Machine learning with python (2) Simple regression analysis
Re: Life in Heroku starting from scratch with Flask ~ Selenium & PhantomJS & Beautifulsoup ~
Deep learning / Deep learning from scratch 2-Try moving GRU
Deep learning / Deep learning made from scratch Chapter 6 Memo
I tried learning with Kaggle's Titanic (kaggle②)
Realize environment construction for "Deep Learning from scratch" with docker and Vagrant
A story about machine learning with Kyasuket
[Learning memo] Deep Learning made from scratch [Chapter 5]
Let Code Day 29 "46. Permutations" starting from scratch
Algorithm learned with Python 2nd: Vending machine
[Learning memo] Deep Learning made from scratch [Chapter 6]
Machine learning
[Shakyo] Encounter with Python for machine learning
Source code of sound source separation (machine learning practice series) learned with Python
[Python] First data analysis / machine learning (Kaggle)