[PYTHON] About machine learning overfitting

○ The main points of this article Note that overfitting has been reproduced Overfitting: It can handle learning data, but it cannot handle unknown data. Feeling that there is no application power.

○ Source code (Python): Model overfitting and confirmation of overfitting

How to check model overfitting and overfitting


from sklearn.datasets import load_boston
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
%matplotlib inline

#Data preparation. Boston Home Prices
data = load_boston()
X = data.data[:, [5,]] #Extract only the number of rooms as explanatory variables
y = data.target

#Separated into training data and test data
train_X, test_X = X[:400], X[400:]
train_y, test_y = y[:400], y[400:]

#SVR with modified hyperparameters(Support vector machine (kernel method))Learning at
model_s = SVR(C=1.0, kernel='rbf') #Uses rbf kernel with regularization parameter 1
model_s.fit(train_X, train_y)
#Prediction using learning data
s_pred = model_s.predict(train_X)
#Prediction using test data (prediction for unknown data)
s_pred_t = model_s.predict(test_X)

#graph display
fig, ax = plt.subplots()
ax.scatter(train_X, train_y, color='red', marker='s', label='data')
ax.plot(train_X, s_pred, color='blue', label='svr_rbf curve(train)')
ax.plot(test_X, s_pred_t, color='orange', label='svr_rbf curve(test)')
ax.legend()
plt.show()

print("○ Mean square error and coefficient of determination of training data")
print(mean_squared_error(train_y, s_pred))
print(r2_score(train_y, s_pred))
print("○ Mean square error and coefficient of determination of test data")
print(mean_squared_error(test_y, s_pred_t))
print(r2_score(test_y, s_pred_t))

result ダウンロード.png ○ Mean square error and coefficient of determination of training data 30.330756428515905 0.6380880725968641 ○ Mean square error and coefficient of determination of test data 69.32813164021485 -1.4534559402985217

The training data line (blue) is drawn fairly nicely, but the test data line (orange) is subtle. It is clear from the values of mean square error and coefficient of determination. This is overfitting.

There are various ways to prevent overfitting, but I'll explain them again. ・ Increase the number of learning (training) data ・ Perform cross-validation ・ Adjust hyperparameters (make the model simple) ・ Reduce features ・ Implement regularization

Recommended Posts

About machine learning overfitting
Machine learning
About machine learning mixed matrices
[Memo] Machine learning
Machine learning classification
Machine Learning sample
A story about machine learning with Kyasuket
Personal notes and links about machine learning ① (Machine learning)
Machine learning tutorial summary
A story about simple machine learning using TensorFlow
Machine learning ⑤ AdaBoost Summary
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Machine learning support vector machine
Studying Machine Learning ~ matplotlib ~
Machine learning linear regression
Machine learning course memo
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
About the development contents of machine learning (Example)
Somehow learn machine learning
A story about data analysis by machine learning
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors
What is machine learning?
What I learned about AI / machine learning using Python (1)
About data preprocessing of systems that use machine learning
About testing in the implementation of machine learning models
What I learned about AI / machine learning using Python (3)
What I learned about AI / machine learning using Python (2)
Talk about improving machine learning algorithm bottlenecks with Cython
Machine learning model considering maintainability
Japanese preprocessing for machine learning
Machine learning in Delemas (practice)
An introduction to machine learning
Machine learning / classification related techniques
Machine Learning: Supervised --Linear Regression
Basics of Machine Learning (Notes)
Machine learning beginners tried RBM
[Machine learning] Understanding random forest
About learning with google colab
Machine learning with Python! Preparation
Machine Learning Study Resource Notepad
Machine learning ② Naive Bayes Summary
Understand machine learning ~ ridge regression ~.
Machine learning article summary (self-authored)
Machine Learning: Supervised --Random Forest
Practical machine learning system memo
Machine learning Minesweeper with PyTorch
Machine learning environment construction macbook 2021
Build a machine learning environment
Python Machine Learning Programming> Keywords
Machine learning algorithm (simple perceptron)
Used in machine learning EDA
Importance of machine learning datasets
Machine learning and mathematical optimization
Machine Learning: Supervised --Support Vector Machine
Supervised machine learning (classification / regression)
I implemented Extreme learning machine