[PYTHON] I tried to predict next year with AI

Introduction 2019 is just a few hours away, and interest in next year is growing. I made a prediction by AI about what year next year will be.

Method

Training data

I learned from 2019 data from this year to 2019.

Learning

I learned it with Kernel Ridge of rbf kernel.


import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

df = pd.read_csv('years.csv',names=("years", "result"))
features = df.drop(["result"], axis=1)
target = df["result"]

from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(features, target, test_size=0.2, random_state=0)

from sklearn.model_selection import GridSearchCV
from sklearn.kernel_ridge import KernelRidge
param_grid = {'alpha': [i*10**j for i in [1,3] for j in [-9,-8,-7]],
              'gamma': [i*10**j for i in [1,2,4,7] for j in [-6,-5,-4]]}
gs = GridSearchCV(KernelRidge(kernel='rbf'), param_grid, cv=5, n_jobs=3)
gs.fit(train_x, train_y)
rgr = gs.best_estimator_

The training data was randomly divided into a training data and a test data, and trained with the training data. Kernel Ridge has hyperparameters ʻalpha and gamma`, so I optimized it by grid search.

Result

Cross-validation

GridSearchCV further divides the given data and searches for the parameters that maximize the generalization performance. Determine the predictor performance with optimal parameters.

print(gs.best_estimator_)
print(gs.best_score_)

KernelRidge(alpha=1e-09, coef0=1, degree=3, gamma=2e-05, kernel='rbf',
      kernel_params=None)
0.9999999999996596

The generalization performance score was sufficiently high.

yyplot

plt.scatter(rgr.predict(train_x), train_y, marker='.', label='train')
plt.scatter(rgr.predict(test_x), test_y, marker='.', label='test')
plt.legend()
plt.show()

A yyplot was drawn to visualize whether or not a valid prediction was made for the training / test data.

It can be seen that correct predictions have been made for many existing data.

Learning curve

A learning curve was drawn and verified to determine whether it was overfitting.

from sklearn.model_selection import (learning_curve,ShuffleSplit)

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5), verbose=0):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
t_size = np.linspace(0.01, 1.00, 20)
plot_learning_curve(RandomForestRegressor(n_estimators=50),
                    "Learning Curve", features, target, cv=cv, ylim=[0.98,1.005], train_sizes=t_size, verbose=10)
plt.show()

Since both learning performance and generalization performance have converged to high values, it can be judged that the possibility of overfitting is low.

Forecast for next year


print(rgr.predict([[2019+1]]))

I entered the parameters for the next year of this year and predicted the next year.

[2019.99488853]

The result was $ 2.020 \ times 10 ^ 3 $ years. As a result, next year is expected to be 2020.

Discussion

The Kernel Ridge method of the rbf kernel is a method for finding a function that minimizes the loss function from the infinite dimensional Gaussian function space, and exhibits high generalization performance for problems where an explicit function form is assumed. It is highly probable that next year will be 2020.