Youtube commentary is 4th (1) per 40 minutes
Create 30 training data with an error of $ N (0,1) \ times0.1 $ on $ y = \ cos (1.5 \ pi x) $ and perform polynomial regression.
Cross-validation enters from here.
It returns in order from the 1st order to the 20th order.
This is the training data.
Source code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures as PF
from sklearn import linear_model
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
def true_f(x):
return np.cos(1.5 * x * np.pi)
n_samples = 30
#X-axis data for drawing
x_plot = np.linspace(0,1,100)
#Training data
x_tr = np.sort(np.random.rand(n_samples))
y_tr = true_f(x_tr) + np.random.randn(n_samples) * 0.1
#Convert to Matrix
X_tr = x_tr.reshape(-1,1)
X_plot = x_plot.reshape(-1,1)
for degree in range(1,DEGREE+1):
plt.scatter(x_tr,y_tr,label="Training Samples")
filename = f"{degree}.png "
pf = PF(degree=degree,include_bias=False)
linear_reg = linear_model.LinearRegression()
steps = [("Polynomial_Features",pf),("Linear_Regression",linear_reg)]
pipeline = Pipeline(steps=steps),y_tr)
y_predict = pipeline.predict(X_tr)
mse = mean_squared_error(y_tr,y_predict)
scores = cross_val_score(pipeline,X_tr,y_tr,scoring="neg_mean_squared_error",cv=10)
plt.title(f"Degree: {degree} TrainErr: {mse:.2e} TestErr: {-scores.mean():.2e}(+/- {scores.std():.2e})")
In the previous task 3.1, I prepared $ x, x ^ 2, x ^ 3 $, etc. in Polynomial Features and then performed Linear Regression, but I learned that it can be done in one shot by using pipeline.
When I actually saw the source code in the explanation video of Exercise 3.1, I was using pipeline.
There is nothing difficult, just list the processing contents with steps
steps = [("Polynomial_Features",pf),("Linear_Regression",linear_reg)]
pipeline = Pipeline(steps=steps),y_tr)
Other than this part, the difference from Task 3.1 is that cross-validation is included. This part in the program.
scores = cross_val_score(pipeline,X_tr,y_tr,scoring="neg_mean_squared_error",cv=10)
After dividing the data into 10 with cv = 10
, one part is used as the test data to evaluate the test error.
Basically, the one with a small test error is excellent.
When the program is executed, 20 graph files up to 1.png-20.png will be created.
--Minimum training error = 20th order
--Minimum test error = 3rd order
From this, we can see how overfitting is bad.
