Introduction 2019 is just a few hours away, and interest in next year is growing. I made a prediction by AI about what year next year will be.
Method
I learned from 2019 data from this year to 2019.
I learned it with Kernel Ridge of rbf kernel.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv('years.csv',names=("years", "result"))
features = df.drop(["result"], axis=1)
target = df["result"]
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(features, target, test_size=0.2, random_state=0)
from sklearn.model_selection import GridSearchCV
from sklearn.kernel_ridge import KernelRidge
param_grid = {'alpha': [i*10**j for i in [1,3] for j in [-9,-8,-7]],
'gamma': [i*10**j for i in [1,2,4,7] for j in [-6,-5,-4]]}
gs = GridSearchCV(KernelRidge(kernel='rbf'), param_grid, cv=5, n_jobs=3)
gs.fit(train_x, train_y)
rgr = gs.best_estimator_
The training data was randomly divided into a training data and a test data, and trained with the training data.
Kernel Ridge has hyperparameters ʻalpha and
gamma`, so I optimized it by grid search.
Result
GridSearchCV
further divides the given data and searches for the parameters that maximize the generalization performance. Determine the predictor performance with optimal parameters.
print(gs.best_estimator_)
print(gs.best_score_)
KernelRidge(alpha=1e-09, coef0=1, degree=3, gamma=2e-05, kernel='rbf',
kernel_params=None)
0.9999999999996596
The generalization performance score was sufficiently high.
yyplot
plt.scatter(rgr.predict(train_x), train_y, marker='.', label='train')
plt.scatter(rgr.predict(test_x), test_y, marker='.', label='test')
plt.legend()
plt.show()
A yyplot was drawn to visualize whether or not a valid prediction was made for the training / test data.
It can be seen that correct predictions have been made for many existing data.
A learning curve was drawn and verified to determine whether it was overfitting.
from sklearn.model_selection import (learning_curve,ShuffleSplit)
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5), verbose=0):
plt.figure()
plt.title(title)
if ylim is not None:
plt.ylim(*ylim)
plt.xlabel("Training examples")
plt.ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, verbose=verbose)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1, color="g")
plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
plt.legend(loc="best")
return plt
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
t_size = np.linspace(0.01, 1.00, 20)
plot_learning_curve(RandomForestRegressor(n_estimators=50),
"Learning Curve", features, target, cv=cv, ylim=[0.98,1.005], train_sizes=t_size, verbose=10)
plt.show()
Since both learning performance and generalization performance have converged to high values, it can be judged that the possibility of overfitting is low.
print(rgr.predict([[2019+1]]))
I entered the parameters for the next year of this year and predicted the next year.
[2019.99488853]
The result was $ 2.020 \ times 10 ^ 3 $ years. As a result, next year is expected to be 2020.
Discussion
The Kernel Ridge method of the rbf kernel is a method for finding a function that minimizes the loss function from the infinite dimensional Gaussian function space, and exhibits high generalization performance for problems where an explicit function form is assumed. It is highly probable that next year will be 2020.
Recommended Posts