Christmas is here again this year

Hello! My name is Nike! It's Christmas. It's a fun and fun Christmas. Yeah, it's fun. I used to make sweets during this time, but this year I have no one to send. I fell in love with this person I found recently. https://www.youtube.com/channel/UCqzebzc9N19X3MVFnuFYtRw

Well then, thank you again this time!

Intermediate Machine Learning digs deeper into machine learning

～ Flow of Intermediate Machine Learning ～

Introduction
Missing value
Category variable
Pipeline
Cross-validation
XGBoost
Data leakage

This time it is the contents of 5!

Machine learning is interactive

Machine learning is an interactive task. Which explanatory variable to use, which model to use, what arguments to pass to that model, etc. We will consider these while measuring the quality of the model by verification.

However, these methods have drawbacks. Let's say you have a dataset with 5000 rows (which means you have a __less __ dataset). 20% for verification is 1000 lines. The model you've created may be __ working well __ on one 1000 lines, but __ not so __ on another 1000 lines.

As an extreme example, consider the case where the validation data is one line. When comparing multiple models, which model makes the best prediction for that row will be __ luck __!

In general, the more validation data you have, the smaller the __measurement error __ (called "noise") in your model and the more reliable it is. Unfortunately, a large amount of validation data can only be obtained by extracting a large amount from the training data. Doing so will result in inadequate learning and poor model quality!

What is cross-validation?

Cross-validation is a method of validating a model with higher accuracy for small datasets.

For example, if the validation data is 20% of the total, a total of 5 trials can be repeated. This is said to be divided into 5 __ "fold" __. Quoted from kaggle

Cross-validation takes longer because of the increased complexity. So __ does not have to perform cross-validation when the __ dataset is large enough.

There is no clear standard for a dataset to be sufficient, but if your model finishes its calculations in minutes, it may be worth performing cross-validation.

Other than that, if you run cross-validation and all folds give similar results, then one validation will suffice.

Write code

The data used this time is the same as last time. It is located at here.

import pandas as pd
from sklearn.model_selection import train_test_split

# Data reading
train_data = pd.read_csv('train.csv', index_col='Id')
test_data = pd.read_csv('test.csv', index_col='Id')

# Exclude rows where the objective variable is missing, isolate the objective variable
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = train_data.SalePrice              
train_data.drop(['SalePrice'], axis=1, inplace=True)

# Extract a column of numbers
numeric_cols = [cname for cname in train_data.columns if train_data[cname].dtype in ['int64', 'float64']]
X = train_data[numeric_cols].copy()
X_test = test_data[numeric_cols].copy()

First, make a pipeline. To make up for missing values, SimpleImputer The model used is RandomForestRegressor

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor

my_pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor(n_estimators=50, random_state=0))])

Next, define a function to average the MAE by cross-validation. Put the number of random forests in n_estimators. Since cross_val_score__ in __scikit-learn returns MAE with __ minus, it is multiplied by -1. (* I didn't understand the reason *) Adjust the number of folds with the arguments you pass to cv.

from sklearn.model_selection import cross_val_score
def get_score(n_estimators):
    my_pipeline = Pipeline(steps=[
        ('preprocessor', SimpleImputer()),
        ('model', RandomForestRegressor(n_estimators, random_state=0))])

    scores = -1 * cross_val_score(my_pipeline, X, y,
                                  cv=3, 
                                  scoring='neg_mean_absolute_error')
    print(scores.mean())
    return scores.mean()

Finally, we'll put numbers into the function defined above. In addition, visualize the change in MAE obtained by the entered number in a graph and find the minimum value. scikit-learn's cross_val_score returns a return value in list format, so results is placed in an empty list.

results = {}
for i in range(1,9):
    results[50*i] = get_score(50*i) 

import matplotlib.pyplot as plt
n_estimators_best = min(results, key=results.get)
print(n_estimators_best)

plt.plot(list(results.keys()), list(results.values()))
plt.show()

Execution result

18353.8393511688
18395.2151680032
18288.730020956387
18248.345889801505
18255.26922247291
18275.241922621914
18270.29183308043
18270.197974402367
200

Thank you for reading until the end!

This year is finally over. What did you do this year? Looking back, I didn't do anything unexpectedly ... It is not a story that can be done with "because there was a self-restraint mood". Not from next year, but from today! , I will not repeat such a life!

Thank you for reading until the end!

[PYTHON] Read kaggle Courses --- Intermediate Machine Learning 5

Christmas is here again this year

Machine learning is interactive

What is cross-validation?

Write code

Thank you for reading until the end!