[PYTHON] Read kaggle Courses --- Intermediate Machine Learning 5

Christmas is here again this year

Hello! My name is Nike! It's Christmas. It's a fun and fun Christmas. Yeah, it's fun. I used to make sweets during this time, but this year I have no one to send. I fell in love with this person I found recently. https://www.youtube.com/channel/UCqzebzc9N19X3MVFnuFYtRw

Well then, thank you again this time!


Intermediate Machine Learning digs deeper into machine learning

~ Flow of Intermediate Machine Learning ~

  1. Introduction
  2. Missing value
  3. Category variable
  4. Pipeline
  5. Cross-validation
  6. XGBoost
  7. Data leakage

This time it is the contents of 5!

Machine learning is interactive

Machine learning is an interactive task. Which explanatory variable to use, which model to use, what arguments to pass to that model, etc. We will consider these while measuring the quality of the model by verification.

However, these methods have drawbacks. Let's say you have a dataset with 5000 rows (which means you have a __less __ dataset). 20% for verification is 1000 lines. The model you've created may be __ working well __ on one 1000 lines, but __ not so __ on another 1000 lines.

As an extreme example, consider the case where the validation data is one line. When comparing multiple models, which model makes the best prediction for that row will be __ luck __!

In general, the more validation data you have, the smaller the __measurement error __ (called "noise") in your model and the more reliable it is. Unfortunately, a large amount of validation data can only be obtained by extracting a large amount from the training data. Doing so will result in inadequate learning and poor model quality!

What is cross-validation?

Cross-validation is a method of validating a model with higher accuracy for small datasets.

For example, if the validation data is 20% of the total, a total of 5 trials can be repeated. This is said to be divided into 5 __ "fold" __. 9k60cVA.png Quoted from kaggle

Cross-validation takes longer because of the increased complexity. So __ does not have to perform cross-validation when the __ dataset is large enough.

There is no clear standard for a dataset to be sufficient, but if your model finishes its calculations in minutes, it may be worth performing cross-validation.

Other than that, if you run cross-validation and all folds give similar results, then one validation will suffice.

Write code

The data used this time is the same as last time. It is located at here.

import pandas as pd
from sklearn.model_selection import train_test_split

# Data reading
train_data = pd.read_csv('train.csv', index_col='Id')
test_data = pd.read_csv('test.csv', index_col='Id')

# Exclude rows where the objective variable is missing, isolate the objective variable
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = train_data.SalePrice              
train_data.drop(['SalePrice'], axis=1, inplace=True)

# Extract a column of numbers
numeric_cols = [cname for cname in train_data.columns if train_data[cname].dtype in ['int64', 'float64']]
X = train_data[numeric_cols].copy()
X_test = test_data[numeric_cols].copy()


First, make a pipeline. To make up for missing values, SimpleImputer The model used is RandomForestRegressor

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor

my_pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor(n_estimators=50, random_state=0))])

Next, define a function to average the MAE by cross-validation. Put the number of random forests in n_estimators. Since cross_val_score__ in __scikit-learn returns MAE with __ minus, it is multiplied by -1. (* I didn't understand the reason *) Adjust the number of folds with the arguments you pass to cv.

from sklearn.model_selection import cross_val_score
def get_score(n_estimators):
    my_pipeline = Pipeline(steps=[
        ('preprocessor', SimpleImputer()),
        ('model', RandomForestRegressor(n_estimators, random_state=0))])

    scores = -1 * cross_val_score(my_pipeline, X, y,
                                  cv=3, 
                                  scoring='neg_mean_absolute_error')
    print(scores.mean())
    return scores.mean()

Finally, we'll put numbers into the function defined above. In addition, visualize the change in MAE obtained by the entered number in a graph and find the minimum value. scikit-learn's cross_val_score returns a return value in list format, so results is placed in an empty list.

results = {}
for i in range(1,9):
    results[50*i] = get_score(50*i) 

import matplotlib.pyplot as plt
n_estimators_best = min(results, key=results.get)
print(n_estimators_best)

plt.plot(list(results.keys()), list(results.values()))
plt.show()

Execution result

18353.8393511688
18395.2151680032
18288.730020956387
18248.345889801505
18255.26922247291
18275.241922621914
18270.29183308043
18270.197974402367
200

2020-12-24_11h12_46.png

Thank you for reading until the end!

This year is finally over. What did you do this year? Looking back, I didn't do anything unexpectedly ... It is not a story that can be done with "because there was a self-restraint mood". Not from next year, but from today! , I will not repeat such a life!

Thank you for reading until the end!

Recommended Posts

Read kaggle Courses --- Intermediate Machine Learning 5
Read kaggle Courses --- Intermediate Machine Learning 6
Try machine learning with Kaggle
Machine learning
[Python] First data analysis / machine learning (Kaggle)
[Memo] Machine learning
Machine learning classification
Machine Learning sample
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
Machine learning starting from scratch (machine learning learned with Kaggle)
Machine learning tutorial summary
About machine learning overfitting
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
Machine learning learned with Pokemon
Data set for machine learning
Japanese preprocessing for machine learning
Learning record 13 (17th day) Kaggle3
Machine learning in Delemas (practice)
An introduction to machine learning
Machine learning / classification related techniques
Machine Learning: Supervised --Linear Regression
Learning record 12 (16th day) Kaggle2
Machine learning beginners tried RBM
[Machine learning] Understanding random forest
Machine learning with Python! Preparation
Machine Learning Study Resource Notepad
Machine learning ② Naive Bayes Summary
Understand machine learning ~ ridge regression ~.
Machine learning article summary (self-authored)
About machine learning mixed matrices
Machine Learning: Supervised --Random Forest
Practical machine learning system memo
Machine learning Minesweeper with PyTorch
Machine learning environment construction macbook 2021
Build a machine learning environment
Python Machine Learning Programming> Keywords
Machine learning algorithm (simple perceptron)
Used in machine learning EDA
Importance of machine learning datasets
Machine learning and mathematical optimization
Machine Learning: Supervised --Support Vector Machine
Supervised machine learning (classification / regression)
I implemented Extreme learning machine
Beginning with Python machine learning
Machine learning algorithm (support vector machine)
Super introduction to machine learning
4 [/] Four Arithmetic by Machine Learning