[PYTHON] Supervised machine learning (classification / regression)

"Supervised machine learning" to find the function $ f $ such that $ y = f (X) $ when the corresponding objective variable $ y $ is known for some explanatory variables $ X_n $. Is called. The simplest of these are "linear simple regression" and "linear multiple regression".

Classification problem

As an example, we will use the data "Pima Indian Diabetes Diagnosis"

#Import a library that provides access to resources by URL.
import urllib.request 
#Specify resources on the web
url = 'https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/pima-indians-diabetes.txt'
#Download the resource from the specified URL and give it a name.
urllib.request.urlretrieve(url, 'pima-indians-diabetes.txt') 
('pima-indians-diabetes.txt', <http.client.HTTPMessage at 0x7fd16c201550>)
#Import of spreadsheet-like data processing library
import pandas as pd 
#Read data and save as data frame format
df = pd.read_csv('pima-indians-diabetes.txt', delimiter="\t", index_col=0)
#Check the contents
df
NumTimePreg OralGluTol BloodPres SkinThick SerumInsulin BMI PedigreeFunc Age Class
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
... ... ... ... ... ... ... ... ... ...
765 5 121 72 23 112 26.2 0.245 30 0
766 1 126 60 0 0 30.1 0.349 47 1
767 1 93 70 31 0 30.4 0.315 23 0

768 rows × 9 columns

The above "Class" is the objective variable $ y $, and if it is 1, it is judged that it is not diabetic, and if it is 0, it is judged that it is not diabetic. Let's create a model that predicts it.

#Explanatory variable
X = df.iloc[:, :8]
#Normalization that sets the maximum value to 1 and the minimum value to 0.
# axis=If 1, it normalizes by row instead of column.
X = X.apply(lambda x: (x-x.min())/(x.max() - x.min()), axis=0)
#Objective variable
y = df.iloc[:, 8]

Divide into training data and test data

In machine learning, in order to evaluate its performance, known data is divided into training data (also called teacher data and teacher set) and test data (also called test set). A prediction model is constructed by training (learning) using the training data, and performance evaluation is performed based on how accurately the test data that was not used in the prediction model construction can be predicted. Such an evaluation method is called "cross-validation". here,

We aim to learn the relationship between X_train and y_train and predict y_test from X_test.

Python's machine learning library scikit-learn provides methods for splitting into training and test data.

#Import method to split into training data and test data
from sklearn.model_selection import train_test_split 
#To training data / test data 6:Randomly split by a ratio of 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4) 

Logistic regression

Logistic regression is similar to linear multiple regression, but corresponds to the discrete value of whether the objective variable $ y $ is 0 or 1. In the lecture the other day, I implemented logistic regression using the scipy library, but it is more convenient to use scikit-learn in practice.

Check out what methods and parameters are available from the following sites.

from sklearn.linear_model import LogisticRegression #Logistic regression
classifier = LogisticRegression() #Generate classifier
classifier.fit(X_train, y_train) #Learning
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Time measurement

It is convenient to use timeit to measure the time required for learning.

import timeit #Library for measuring execution time
timeit.timeit(lambda: classifier.fit(X_train, y_train), number=1)
0.006864218999993454

Calculation of correct answer rate

It is necessary to distinguish between "classification accuracy of data used for learning" and "classification accuracy of data not used for learning". Models with an extremely high former and a low latter have low generalization performance and are said to be "overfitted".

#Correct answer rate(train) :How accurately can the data used for training be predicted?
classifier.score(X_train,y_train)
0.7800289435600579
#Correct answer rate(test) :How accurately can you predict the data that was not used for training?
classifier.score(X_test,y_test)
0.7402597402597403

Prediction of data not used for training and confusion matrix

It is possible to predict not only the accuracy rate but also which data is specifically classified into which.

#Predict data not used for training
y_pred = classifier.predict(X_test)
y_pred
array([1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0])

If you know the correct answer, you can match the answers. The confusion matrix is useful for that aggregation.

from sklearn.metrics import confusion_matrix #Method to calculate confusion matrix
#A confusion matrix that shows how well the prediction result matches the correct answer (the true answer).
pd.DataFrame(confusion_matrix(y_pred, y_test), 
             index=['predicted 0', 'predicted 1'], columns=['real 0', 'real 1'])
real 0 real 1
predicted 0 47 18
predicted 1 2 10

ROC curve / PR curve

Which data is classified into which is judged by the "strength of confidence" in each classification. By arranging them in order of confidence, you can draw ROC curves and PR curves and evaluate the performance of the prediction model.

#Calculate the strength of confidence in the forecast results
y_proba = classifier.predict_proba(X_test)
y_proba
array([[0.21417054, 0.78582946],
       [0.46404957, 0.53595043],
       [0.70401466, 0.29598534],
       [0.75314361, 0.24685639],
       [0.76452966, 0.23547034],
       [0.33685542, 0.66314458],
       [0.76393323, 0.23606677],
       [0.82487552, 0.17512448],
       [0.87720401, 0.12279599],
       [0.83530283, 0.16469717],
       [0.64980016, 0.35019984],
       [0.78574888, 0.21425112],
       [0.51054138, 0.48945862],
       [0.24870259, 0.75129741],
       [0.91082684, 0.08917316],
       [0.86200773, 0.13799227],
       [0.71562431, 0.28437569],
       [0.62886446, 0.37113554],
       [0.63181921, 0.36818079],
       [0.77975231, 0.22024769],
       [0.65396517, 0.34603483],
       [0.81535938, 0.18464062],
       [0.54607196, 0.45392804],
       [0.79688063, 0.20311937],
       [0.80333846, 0.19666154],
       [0.728435  , 0.271565  ],
       [0.36817034, 0.63182966],
       [0.54025915, 0.45974085],
       [0.6614052 , 0.3385948 ],
       [0.74309548, 0.25690452],
       [0.92572332, 0.07427668],
       [0.80406998, 0.19593002],
       [0.61165474, 0.38834526],
       [0.43564389, 0.56435611],
       [0.42922327, 0.57077673],
       [0.61369072, 0.38630928],
       [0.68195508, 0.31804492],
       [0.86971152, 0.13028848],
       [0.81006182, 0.18993818],
       [0.86324924, 0.13675076],
       [0.82269894, 0.17730106],
       [0.48717372, 0.51282628],
       [0.72772261, 0.27227739],
       [0.81581007, 0.18418993],
       [0.54651378, 0.45348622],
       [0.65486361, 0.34513639],
       [0.69695761, 0.30304239],
       [0.50397912, 0.49602088],
       [0.70579261, 0.29420739],
       [0.56812519, 0.43187481],
       [0.28702944, 0.71297056],
       [0.78684682, 0.21315318],
       [0.77913962, 0.22086038],
       [0.20665217, 0.79334783],
       [0.64020202, 0.35979798],
       [0.54394942, 0.45605058],
       [0.74972094, 0.25027906],
       [0.89307226, 0.10692774],
       [0.63129007, 0.36870993],
       [0.775181  , 0.224819  ],
       [0.88651222, 0.11348778],
       [0.83087546, 0.16912454],
       [0.52015754, 0.47984246],
       [0.17895175, 0.82104825],
       [0.68620306, 0.31379694],
       [0.6503939 , 0.3496061 ],
       [0.53702941, 0.46297059],
       [0.74395419, 0.25604581],
       [0.79430285, 0.20569715],
       [0.70717315, 0.29282685],
       [0.74036824, 0.25963176],
       [0.35031104, 0.64968896],
       [0.59128595, 0.40871405],
       [0.62945511, 0.37054489],
       [0.85812094, 0.14187906],
       [0.95492842, 0.04507158],
       [0.82726693, 0.17273307]])
#Import a library to illustrate diagrams and graphs.
import matplotlib.pyplot as plt
%matplotlib inline

Here is the method for handling the ROC curve and the AUC score below it.

from sklearn.metrics import roc_curve
from sklearn.metrics import auc

#Give an AUC score
fpr, tpr, thresholds = roc_curve(y_test, y_proba[:, 1])
roc_auc = auc(fpr, tpr)
print ("AUC curve : %f" % roc_auc)

#Draw a ROC curve
plt.figure(figsize=(4,4))
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC curve: AUC=%0.2f' % roc_auc)
plt.legend(loc="lower right")
plt.show()
AUC curve : 0.756560

output_26_1.png

Similarly, here is the method for drawing a PR curve.

from sklearn.metrics import auc
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba[:, 1])
area = auc(recall, precision)
print ("AUPR score: %0.2f" % area)

#Draw a PR curve
plt.figure(figsize=(4,4))
plt.plot(recall, precision, label='Precision-Recall curve')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall curve: AUPR=%0.2f' % area)
plt.legend(loc="lower left")
plt.show()
AUPR score: 0.68

output_28_1.png

Parameter tuning by grid search

Methods for machine learning require many parameters. The default (default) is not a good prediction. One of the methods to find good parameters is "grid search". GridSearchCV further divides the training data (by default, 3-fold cross-validation), tries all combinations of parameter candidates, and searches for parameters that show excellent performance on average.

%%time
#Grid search to find the best parameters
from sklearn.model_selection import GridSearchCV

#Parameters for grid search (Logistic Regression parameters)
parameters = [
    {'solver': ['liblinear', 'saga'], 'penalty':['l1', 'l2'], 'C': [0.1, 1, 10, 100]},
    {'solver': ['newton-cg', 'sag', 'lbfgs' ], 'penalty':['l2'], 'C': [0.1, 1, 10, 100]},
]

#Grid search execution
classifier = GridSearchCV(LogisticRegression(), parameters, cv=3, n_jobs=-1)
classifier.fit(X_train, y_train)
print("Accuracy score (train): ", classifier.score(X_train, y_train))
print("Accuracy score (test): ", classifier.score(X_test, y_test))
print(classifier.best_estimator_) #Classifier with the best parameters
Accuracy score (train):  0.784370477568741
Accuracy score (test):  0.6883116883116883
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l1',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)
CPU times: user 192 ms, sys: 25.4 ms, total: 218 ms
Wall time: 2.52 s

Record and compare evaluation indicators

From now on, I would like to compare the performance of various machine learning methods. Therefore, let's prepare a variable for recording the evaluation index.

scores = []

There are various indicators of the performance of a classification model. The main ones are as follows. Let's check the meaning of each.

I created a function that calculates these evaluation indexes together and stores them in a variable for recording as follows. Repeat training and cross-validation for different data splits and record their average performance, standard deviation and training time.

import timeit
from sklearn import metrics
def record_classification_scores(classifier_name, classifier, iter=5):
    records = []
    for run_id in range(iter):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4) 
        print('Run ', run_id + 1)
        seconds = timeit.timeit(lambda: classifier.fit(X_train, y_train), number=1)
        print('    Learning Time (s):', seconds)
        y_pred = classifier.predict(X_test)
        y_proba = classifier.predict_proba(X_test)

        accuracy_score = metrics.accuracy_score(y_test, y_pred)
        precision_score = metrics.precision_score(y_test, y_pred)
        recall_score = metrics.recall_score(y_test, y_pred)
        f1_score = metrics.f1_score(y_test, y_pred)

        fpr, tpr, thresholds = roc_curve(y_test, y_proba[:, 1])
        roc_auc = auc(fpr, tpr)

        pre, rec, thresholds = precision_recall_curve(y_test, y_proba[:, 1])
        aupr = auc(rec, pre)
        
        mcc = metrics.matthews_corrcoef(y_test, y_pred)

        records.append([classifier_name, accuracy_score, precision_score, recall_score, 
                        f1_score, roc_auc, aupr, mcc, seconds])
    return records

Now, let's learn using the "classifier with the best parameters" created earlier and record the performance index.

%%time
scores += record_classification_scores('LR', classifier.best_estimator_)
Run  1
    Learning Time (s): 0.004809510999990607
Run  2
    Learning Time (s): 0.004076423000000773
Run  3
    Learning Time (s): 0.004598837999992611
Run  4
    Learning Time (s): 0.004291107000000238
Run  5
    Learning Time (s): 0.003665049000005638
CPU times: user 65.8 ms, sys: 3.33 ms, total: 69.1 ms
Wall time: 67.6 ms

The average performance and its standard deviation are as follows.

df_scores = pd.DataFrame(scores, columns = ['Classifier', 'Accuracy', 'Precision', 'Recall', 
                                            'F1 score', 'ROC AUC', 'AUPR', 'MCC', 'Time'])
df_scores_mean = df_scores.iloc[:, :-1].mean()
df_scores_errors = df_scores.iloc[:, :-1].std()
df_scores_mean.plot(kind='bar', grid=True, yerr=df_scores_errors)
<matplotlib.axes._subplots.AxesSubplot at 0x7fd15943ca90>

output_38_1.png

Gradient boosting

Gradient Boosting is a technique that has been gaining attention lately. I won't go into details here. Check the required parameters from the site below.

%%time
#Grid search to find the best parameters
from sklearn.model_selection import GridSearchCV

#Gradient boosting
from sklearn.ensemble import GradientBoostingClassifier

#Parameters for grid search
parameters = [{
    'loss': ['deviance', 'exponential'],
    'learning_rate':[0.1,0.2],
    'n_estimators':[20,100,200],
    'max_depth':[3,5,7,9]
}]

#Grid search execution
classifier = GridSearchCV(GradientBoostingClassifier(), parameters, cv=3, n_jobs=-1)
classifier.fit(X_train, y_train)
print("Accuracy score (train): ", classifier.score(X_train, y_train))
print("Accuracy score (test): ", classifier.score(X_test, y_test))
print(classifier.best_estimator_) #Best parameters
Accuracy score (train):  1.0
Accuracy score (test):  0.7142857142857143
GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.2, loss='deviance', max_depth=7,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)
CPU times: user 947 ms, sys: 25.1 ms, total: 973 ms
Wall time: 25.2 s

Learn and record performance using the resulting "classifier with the best parameters".

%%time
scores += record_classification_scores('GB', classifier.best_estimator_)
Run  1
    Learning Time (s): 0.3291641410000068
Run  2
    Learning Time (s): 0.31575948799999765
Run  3
    Learning Time (s): 0.3144692120000059
Run  4
    Learning Time (s): 0.3252903609999862
Run  5
    Learning Time (s): 0.3103595519999942
CPU times: user 1.64 s, sys: 7.28 ms, total: 1.65 s
Wall time: 1.65 s

Performance comparison result display of multiple methods

I made a function to visualize the performance comparison of multiple classification methods.

def visualize_classification_result(scores):
    df_scores = pd.DataFrame(scores, columns = ['Classifier', 'Accuracy', 'Precision', 'Recall', 
                                            'F1 score', 'ROC AUC', 'AUPR', 'MCC', 'Time'])
    df_scores_mean = df_scores.iloc[:, :-1].groupby('Classifier').mean()
    df_scores_errors = df_scores.iloc[:, :-1].groupby('Classifier').std()
    df_scores_mean.T.plot(kind='bar', grid=True, yerr=df_scores_errors.T, 
                          figsize=(12, 2), legend=False)
    plt.legend(loc = 'right', bbox_to_anchor = (0.7, 0.5, 0.5, 0.0))
    df_scores_mean.plot(kind='bar', grid=True, yerr=df_scores_errors, 
                        figsize=(12, 2), legend=False)
    plt.legend(loc = 'right', bbox_to_anchor = (0.7, 0.5, 0.5, 0.0))

    df_time_mean = df_scores.iloc[:, [0, -1]].groupby('Classifier').mean()
    df_time_errors = df_scores.iloc[:, [0, -1]].groupby('Classifier').std()
    df_time_mean.plot(kind='bar', grid=True, yerr=df_time_errors, 
                        figsize=(12, 2), legend=False)
    plt.yscale('log')
visualize_classification_result(scores)

output_45_0.png

output_45_1.png

output_45_2.png

Multilayer perceptron

Multi-Layer Perceptron is the simplest model of deep learning and is also implemented in scikit-learn.

Check out what methods and parameters are available from the following sites.

%%time
#Grid search to find the best parameters
from sklearn.model_selection import GridSearchCV

#Multilayer perceptron
from sklearn.neural_network import MLPClassifier
#Parameters for grid search
parameters = [{'hidden_layer_sizes': [8, (8, 8), (8, 8, 8)], 
               'solver': ['sgd', 'adam', 'lbfgs'],
                     'activation': ['logistic', 'tanh', 'relu'],
              'learning_rate_init': [0.1, 0.01, 0.001]}]
#Grid search execution
classifier = GridSearchCV(MLPClassifier(max_iter=10000, early_stopping=True), 
                          parameters, cv=3, n_jobs=-1)
classifier.fit(X_train, y_train)
print("Accuracy score (train): ", classifier.score(X_train, y_train))
print("Accuracy score (test): ", classifier.score(X_test, y_test))
print(classifier.best_estimator_) #Classifier with the best parameters
Accuracy score (train):  0.7930535455861071
Accuracy score (test):  0.7272727272727273
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=True, epsilon=1e-08,
              hidden_layer_sizes=8, learning_rate='constant',
              learning_rate_init=0.1, max_iter=10000, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=None, shuffle=True, solver='lbfgs', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)
CPU times: user 1.15 s, sys: 39.8 ms, total: 1.19 s
Wall time: 2min 29s

Learn and record performance using the resulting "classifier with the best parameters".

%%time
scores += record_classification_scores('MLP', classifier.best_estimator_)
Run  1
    Learning Time (s): 0.4756240830000138
Run  2
    Learning Time (s): 0.34581674499997916
Run  3
    Learning Time (s): 0.15651393699999971
Run  4
    Learning Time (s): 0.14490434999999025
Run  5
    Learning Time (s): 0.005184319999955278
CPU times: user 1.16 s, sys: 3.54 ms, total: 1.17 s
Wall time: 1.17 s

Compare performance.

visualize_classification_result(scores)

output_51_0.png

output_51_1.png

output_51_2.png

Exercise 1

scikit-learn provides the breast_cancer dataset as training data for machine learning. Divide the dataset into explanatory variables and objective variables as follows, classify breast_cancer data with MLPClassifier while tuning parameters with GridSearchCV, and evaluate the performance.

# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
X = pd.DataFrame(breast_cancer.data)
y = pd.DataFrame(breast_cancer.target.ravel())

Exercise 2

scikit-learn provides a wine dataset as data for machine learning practice. Divide the dataset into explanatory variables and objective variables as follows, classify the wine data with MLPClassifier while tuning the parameters with GridSearchCV, and evaluate the performance.

# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html
from sklearn.datasets import load_wine
wine = data = load_wine()
X = pd.DataFrame(wine.data)
y = pd.DataFrame(wine.target)

However, since the wine dataset is a three-class classification rather than a two-class classification such as the breast_cancer dataset, the objective variable must be preprocessed as follows.

# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
import numpy as np
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(categories="auto", sparse=False, dtype=np.float32)
y = pd.DataFrame(encoder.fit_transform(y))

Also, if the number of MLP nodes (number of neurons) is too small, the classification performance will be low, so increase the number of nodes (number of neurons) appropriately.

Regression problem

Now that we've solved the classification problem, let's solve the regression problem. As a subject, we will discuss the relationship between the composition and strength of concrete. First, get the concrete data.

The data source is Concrete Slump Test Data Set https://archive.ics.uci.edu/ml/datasets/Concrete+Slump+Test.

#Import a library that provides access to resources by URL.
import urllib.request 
#Specify resources on the web
url = 'https://raw.githubusercontent.com/maskot1977/ipython_notebook/master/toydata/slump_test.data'
#Download the resource from the specified URL and give it a name.
urllib.request.urlretrieve(url, 'slump_test.data') 
('slump_test.data', <http.client.HTTPMessage at 0x7ff02ed82518>)
import pandas as pd
df = pd.read_csv('slump_test.data', index_col=0)
df
Cement Slag Fly ash Water SP Coarse Aggr. Fine Aggr. SLUMP(cm) FLOW(cm) Compressive Strength (28-day)(Mpa)
No
1 273.0 82.0 105.0 210.0 9.0 904.0 680.0 23.0 62.0 34.99
2 163.0 149.0 191.0 180.0 12.0 843.0 746.0 0.0 20.0 41.14
3 162.0 148.0 191.0 179.0 16.0 840.0 743.0 1.0 20.0 41.81
... ... ... ... ... ... ... ... ... ... ...
101 258.8 88.0 239.6 175.3 7.6 938.9 646.0 0.0 20.0 50.50
102 297.1 40.9 239.9 194.0 7.5 908.9 651.8 27.5 67.0 49.17
103 348.7 0.1 223.1 208.5 9.6 786.2 758.1 29.0 78.0 48.77

103 rows × 10 columns

Let's read the explanation of the source of the data. Here, the left 7 columns are regarded as explanatory variables, and the rightmost column is regarded as the objective variable.

X = df.iloc[:, :-3].apply(lambda x: (x-x.min())/(x.max() - x.min()), axis=0)
y = df.iloc[:, -1]

Divide into training data and test data

#Import method to split into training data and test data
from sklearn.model_selection import train_test_split 
#To training data / test data 6:Randomly split by a ratio of 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

Linear multiple regression

Let's try the simplest regression model, Multiple Linear Regression.

from sklearn.linear_model import LinearRegression 
regressor = LinearRegression() #Linear multiple regression
regressor.fit(X_train, y_train) #Learning
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Performance evaluation of regression model

In scikit-learn's regression model, a method called .score () calculates the coefficient of determination (R2 value). R2 value is

It is calculated like this. Therefore,

There is a feature.

Now, let's check the performance with the test set after learning with the teacher set.

regressor.score(X_train, y_train), regressor.score(X_test, y_test)
(0.9224703183565424, 0.8177828980042425)

Prediction of data not used for training

Prediction can be made by substituting data not used for training into the obtained regression model.

A plot of the objective variable and its predicted value is commonly referred to as the y-y plot (although it does not seem to be the official name). The more diagonal the plot is, the better the regression model.

import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test) #Substitute data not used for learning

print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()
R2= 0.8177828980042425
RMSE= 3.0139396734524633
MAE= 2.4622169354183447

output_12_1.png

Performance evaluation of regression model

scikit-learn provides methods to calculate the following indicators for performance evaluation of regression models.

PLS (Partial Least Squares)

Partial least squares regression (PLS) is a type of linear regression method that uses variables (latent variables) that are linearly transformed from explanatory variables so that they are uncorrelated with each other. Compared to normal linear multiple regression

It has the advantage of.

#Import method to split into training data and test data
from sklearn.model_selection import train_test_split 
#To training data / test data 6:Randomly split by a ratio of 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
from sklearn.cross_decomposition import PLSRegression #Method to perform PLS regression
regressor = PLSRegression() #Regressor generation
regressor.fit(X_train, y_train) #Learning
PLSRegression(copy=True, max_iter=500, n_components=2, scale=True, tol=1e-06)
import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test)
print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()
R2= 0.7649827980366806
RMSE= 3.523505754210429
MAE= 2.8524226793359198

output_17_1.png

Parameter tuning by grid search

Let's tune the parameters and make a better model.

%%time
#Grid search to find the best parameters
from sklearn.model_selection import GridSearchCV

#Parameters for grid search
parameters = [
    {'n_components': [2, 3, 4, 5, 6], 'scale':[True, False], 'max_iter': [1000]},
]

#Grid search execution
regressor = GridSearchCV(PLSRegression(), parameters, cv=3, n_jobs=-1)
regressor.fit(X_train, y_train)
print("R2 (train): ", regressor.score(X_train, y_train))
print("R2 (test): ", regressor.score(X_test, y_test))
print(regressor.best_estimator_) #Regression model with best parameters
R2 (train):  0.9225247360485105
R2 (test):  0.8162623239997147
PLSRegression(copy=True, max_iter=1000, n_components=3, scale=False, tol=1e-06)
CPU times: user 114 ms, sys: 30.4 ms, total: 144 ms
Wall time: 2.18 s

Let's make a prediction using the obtained regression model. Has the prediction performance improved?

import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test)
print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()
R2= 0.8162623239997147
RMSE= 3.115474987155132
MAE= 2.3005236909984426

output_21_1.png

Record and compare evaluation indicators

From now on, I would like to compare the performance of various regression models. Therefore, let's prepare a variable for recording the evaluation index.

scores = []

I made the following function to calculate the evaluation index of the regression model and store it in the variable for recording. Repeat training and cross-validation for different data splits and record their average performance, standard deviation and training time.

import timeit
import numpy as np
from sklearn import metrics

def record_regression_scores(regressor_name, regressor, iter=5):
    records = []
    run_id = 0
    successful = 0
    max_trial = 100
    while successful < iter:
        run_id += 1
        if run_id >= max_trial:
            break

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4) 
        print('Run ', run_id)
        seconds = timeit.timeit(lambda: regressor.fit(X_train, y_train), number=1)
        print('    Learning Time (s):', seconds)
        y_pred = regressor.predict(X_test)
        r2_score = metrics.r2_score(y_test, y_pred)
        rmse_score = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
        mae_score = metrics.mean_absolute_error(y_test, y_pred)

        if r2_score < 0:
            print("\t\t encountered negative r2_score")
            continue
        else:
            successful += 1

        records.append([regressor_name, r2_score, rmse_score, mae_score, seconds])
    return records

Now, let's learn using the "regression model with the best parameters" created earlier and record the performance evaluation index.

%%time
scores += record_regression_scores("PLS", regressor)
Run  1
    Learning Time (s): 2.0181297670001186
Run  2
    Learning Time (s): 1.9526900320001914
Run  3
    Learning Time (s): 1.9921050099997046
Run  4
    Learning Time (s): 2.0573012720001316
Run  5
    Learning Time (s): 1.979584856999736
CPU times: user 552 ms, sys: 101 ms, total: 653 ms
Wall time: 10 s

The average performance and its standard deviation are as follows.

df_scores = pd.DataFrame(scores, columns = ['Regressor', 'R2', 'RMSE', 'Mae', 'Time'])
df_scores_mean = df_scores.iloc[:, :-1].mean()
df_scores_errors = df_scores.iloc[:, :-1].std()
df_scores_mean.plot(kind='bar', grid=True, yerr=df_scores_errors)
<matplotlib.axes._subplots.AxesSubplot at 0x7ff0177f0550>

output_29_1.png

Gradient boosting

We classified by gradient boosting earlier, but you can also return by gradient boosting.

#Import method to split into training data and test data
from sklearn.model_selection import train_test_split 
#To training data / test data 6:Randomly split by a ratio of 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
from sklearn.ensemble import GradientBoostingRegressor
regressor = GradientBoostingRegressor() #Gradient boosting
regressor.fit(X_train, y_train) #Learning
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)
regressor.score(X_train, y_train), regressor.score(X_test, y_test)
(0.9996743754326906, 0.7386973055974495)
import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test)

print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()
R2= 0.7386973055974495
RMSE= 4.012901982806575
MAE= 3.0486670616108

output_34_1.png

%%time
#Grid search to find the best parameters
from sklearn.model_selection import GridSearchCV

#Gradient boosting
from sklearn.ensemble import GradientBoostingRegressor

#Parameters for grid search
parameters = [{
    'learning_rate':[0.1,0.2],
    'n_estimators':[20,100],
    'max_depth':[3,5]
}]

#Grid search execution
regressor = GridSearchCV(GradientBoostingRegressor(), parameters, cv=3, n_jobs=-1)
regressor.fit(X_train, y_train)
print("R2 (train): ", regressor.score(X_train, y_train))
print("R2 (test): ", regressor.score(X_test, y_test))
print(regressor.best_estimator_) #Regression model with best parameters
R2 (train):  0.9996743754326906
R2 (test):  0.7195388936429337
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
                          learning_rate=0.1, loss='ls', max_depth=3,
                          max_features=None, max_leaf_nodes=None,
                          min_impurity_decrease=0.0, min_impurity_split=None,
                          min_samples_leaf=1, min_samples_split=2,
                          min_weight_fraction_leaf=0.0, n_estimators=100,
                          n_iter_no_change=None, presort='auto',
                          random_state=None, subsample=1.0, tol=0.0001,
                          validation_fraction=0.1, verbose=0, warm_start=False)
CPU times: user 130 ms, sys: 15.4 ms, total: 145 ms
Wall time: 2.26 s
import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test)
print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()
R2= 0.7195388936429337
RMSE= 4.15741070004397
MAE= 3.0704656592653317

output_36_1.png

%%time
scores += record_regression_scores("GB", regressor.best_estimator_)
Run  1
    Learning Time (s): 0.027196151000225655
Run  2
    Learning Time (s): 0.01961764999987281
Run  3
    Learning Time (s): 0.01894888400011041
Run  4
    Learning Time (s): 0.019140249999964
Run  5
    Learning Time (s): 0.020592135999777383
CPU times: user 123 ms, sys: 2.75 ms, total: 126 ms
Wall time: 128 ms

Performance comparison result display of multiple methods

I made a function to visualize the performance comparison of multiple regression methods.

def visualize_regression_result(scores):
    df_scores = pd.DataFrame(scores, columns =['Regressor', 'R2', 'RMSE', 'MAE', 'Time'])
    df_scores_mean = df_scores.iloc[:, :-1].groupby('Regressor').mean()
    df_scores_errors = df_scores.iloc[:, :-1].groupby('Regressor').std()
    df_scores_mean.T.plot(kind='bar', grid=True, yerr=df_scores_errors.T, 
                          figsize=(12, 2), legend=False)
    #plt.yscale('log')

    plt.legend(loc = 'right', bbox_to_anchor = (0.7, 0.5, 0.5, 0.0))
    df_scores_mean.plot(kind='bar', grid=True, yerr=df_scores_errors, 
                        figsize=(12, 2), legend=False)
    #plt.yscale('log')

    plt.legend(loc = 'right', bbox_to_anchor = (0.7, 0.5, 0.5, 0.0))
    df_time_mean = df_scores.iloc[:, [0, -1]].groupby('Regressor').mean()
    df_time_errors = df_scores.iloc[:, [0, -1]].groupby('Regressor').std()
    df_time_mean.plot(kind='bar', grid=True, yerr=df_time_errors, 
                        figsize=(12, 2), legend=False)
    plt.yscale('log')
visualize_regression_result(scores)

output_40_0.png

output_40_1.png

output_40_2.png

Multilayer perceptron

We classified by multi-layer perceptron earlier, but regression by multi-layer perceptron is also possible.

#Import method to split into training data and test data
from sklearn.model_selection import train_test_split 
#To training data / test data 6:Randomly split by a ratio of 4
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
from sklearn.neural_network import MLPRegressor
regressor = MLPRegressor() #Regressor generation
regressor.fit(X_train, y_train) #Learning
/usr/local/lib/python3.6/dist-packages/sklearn/neural_network/multilayer_perceptron.py:566: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)





MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(100,), learning_rate='constant',
             learning_rate_init=0.001, max_iter=200, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=None, shuffle=True, solver='adam', tol=0.0001,
             validation_fraction=0.1, verbose=False, warm_start=False)
import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test)
print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()
R2= -4.0467411800805415
RMSE= 19.21649631146132
MAE= 17.449687389239205

output_44_1.png

%%time
#Grid search to find the best parameters
from sklearn.model_selection import GridSearchCV

#Parameters for grid search
parameters = [{
    'hidden_layer_sizes': [10, (10, 10)],
    'solver': ['sgd', 'adam', 'lbfgs'],
    #'solver': ['lbfgs'],
    #'activation': ['logistic', 'tanh', 'relu']
    'activation': ['relu']
}]

#Grid search execution
regressor = GridSearchCV(MLPRegressor(max_iter=10000, early_stopping=True), 
                         parameters, cv=3, n_jobs=-1)
regressor.fit(X_train, y_train)
print("R2 (train): ", regressor.score(X_train, y_train))
print("R2 (test): ", regressor.score(X_test, y_test))
print(regressor.best_estimator_) #Regression model with best parameters
R2 (train):  0.9742637037080083
R2 (test):  0.9562295568855493
MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=True, epsilon=1e-08,
             hidden_layer_sizes=(10, 10), learning_rate='constant',
             learning_rate_init=0.001, max_iter=10000, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=None, shuffle=True, solver='lbfgs', tol=0.0001,
             validation_fraction=0.1, verbose=False, warm_start=False)
CPU times: user 222 ms, sys: 17.5 ms, total: 239 ms
Wall time: 7.86 s
import numpy as np
import sklearn.metrics as metrics

y_pred = regressor.predict(X_test)
print("R2=", metrics.r2_score(y_test, y_pred))
print("RMSE=", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("MAE=", metrics.mean_absolute_error(y_test, y_pred))

%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize=(4,4))
plt.scatter(y_test, y_pred, alpha=0.2, c="blue")
plt.plot([y.min(), y.max()], [y.min(), y.max()], c="black")
plt.grid()
plt.xlabel("Real Y")
plt.ylabel("Predicted Y")
plt.show()
R2= 0.9562295568855493
RMSE= 1.789613149534058
MAE= 1.3873465536350154

output_46_1.png

%%time
scores += record_regression_scores("MLP", regressor.best_estimator_)
Run  1
    Learning Time (s): 0.06779548599979535
Run  2
    Learning Time (s): 0.1298420270004499
Run  3
    Learning Time (s): 0.1824235089998183
Run  4
    Learning Time (s): 0.43246253200004503
Run  5
    Learning Time (s): 0.22879209799975797
CPU times: user 1.06 s, sys: 3.13 ms, total: 1.06 s
Wall time: 1.07 s
visualize_regression_result(scores)

output_48_0.png

output_48_1.png

output_48_2.png

Exercise 3

scikit-learn provides the diabetes dataset as training data for machine learning. Divide the dataset into explanatory variables and objective variables as follows, and while tuning the parameters with GridSearchCV, regress the diabetes data with MLPRegressor or GradientBoostingRegressor and compare the performance.

# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target

Recommended Posts

Supervised machine learning (classification / regression)
Machine Learning: Supervised --Linear Regression
Supervised learning (classification)
Machine learning classification
Classification and regression in machine learning
Machine Learning: Supervised --AdaBoost
Machine learning logistic regression
Machine learning linear regression
Supervised learning (regression) 1 Basics
Python: Supervised Learning (Regression)
Python: Supervised Learning (Classification)
Machine learning / classification related techniques
Understand machine learning ~ ridge regression ~.
Machine Learning: Supervised --Random Forest
Supervised learning 1 Basics of supervised learning (classification)
Supervised learning (regression) 2 Advanced edition
Machine Learning: Supervised --Support Vector Machine
Machine learning stacking template (regression)
Machine Learning: Supervised --Decision Tree
Machine learning algorithm (logistic regression)
Machine learning
Python: Application of supervised learning (regression)
Machine learning with python (1) Overall classification
Machine learning beginners try linear regression
Machine learning algorithm (multiple regression analysis)
Machine learning algorithm (simple regression analysis)
Machine Learning: Supervised --Linear Discriminant Analysis
Machine learning algorithm (generalization of linear regression)
[Machine learning] LDA topic classification using scikit-learn
Machine learning with python (2) Simple regression analysis
Machine learning algorithm (implementation of multi-class classification)
[Machine learning] Supervised learning using kernel density estimation
<Course> Machine Learning Chapter 1: Linear Regression Model
Machine learning algorithm classification and implementation summary
[Memo] Machine learning
Stock price forecast using machine learning (regression)
Machine learning algorithm (linear regression summary & regularization)
[Machine learning] Regression analysis using scikit learn
Machine Learning sample
EV3 x Pyrhon Machine Learning Part 3 Classification
Classification of guitar images by machine learning Part 1
Python & Machine Learning Study Memo ⑤: Classification of irises
[Machine learning] Supervised learning using kernel density estimation Part 2
Machine learning algorithms (from two-class classification to multi-class classification)
EV3 x Python Machine Learning Part 2 Linear Regression
[Machine learning] Supervised learning using kernel density estimation Part 3
[Python3] Let's analyze data using machine learning! (Regression)
Classification of guitar images by machine learning Part 2
Basics of Supervised Learning Part 3-Multiple Regression (Implementation)-(Notes)-
Machine learning tutorial summary
About machine learning overfitting
Machine learning support vector machine
Machine learning course memo
Chapter 6 Supervised Learning: Classification pg212 ~ [Learn by moving with Python! New machine learning textbook]
Machine learning library dlib
Machine learning (TensorFlow) + Lotto 6
Somehow learn machine learning
Machine learning library Shogun
Machine learning rabbit challenge
Introduction to machine learning
Machine Learning: k-Nearest Neighbors