Last time University of Tsukuba Machine Learning Course: Study sklearn while making the Python script part of the task (7) Make your own steepest descent method https://github.com/legacyworld/sklearn-basic

Exercise 4.3 The steepest descent method and the stochastic steepest descent method

Explanation is the 5th (1) per 24 minutes 30 seconds Last time, only the re-descent method was possible, so this time I will implement the probabilistic re-descent method. The program itself doesn't change that much. Mathematically it looks like this:

\lambda =Regularization parameters,
\beta = \begin{pmatrix} \beta_0 \\ \beta_1\\ \vdots \\ \beta_m \end{pmatrix},
y = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{pmatrix},
X = \begin{pmatrix}
1&x_{11}&x_{12}&\cdots&x_{1m}\\
1&x_{21}&x_{22}&\cdots&x_{2m}\\
\vdots\\
1&x_{N1}&x_{N2}&\cdots&x_{Nm}
\end{pmatrix}\\ \\
\beta^{t+1} = \beta^{t}(1-2\lambda\eta) - \eta\frac{1}{N}x_i^T(x_i\beta^t-y_i)

Until now, all the data was used for the calculation of the gradient, but the calculation is performed only with $ x_i, y_i $, which are randomly selected. What caught me a little was the Numpy Transpose specification. In the case of a one-dimensional vector, if you use transpose (.T), it will be returned as it is, so you need to use .reshape (-1,1). See here. https://note.nkmk.me/python-numpy-transpose/

a_1d = np.arange(3)
print(a_1d)
# [0 1 2]

print(a_1d.T)
# [0 1 2]

a_col = a_1d.reshape(-1, 1)
print(a_col)
# [[0]
#  [1]
#  [2]]

Click here for the source code.

`python:Homework_4.3SGD.py`


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.base import BaseEstimator
from sklearn.model_selection import cross_validate
import statsmodels.api as sm

class MyEstimator(BaseEstimator):
    def __init__(self,ep,eta,l):
        self.ep = ep
        self.eta = eta
        self.l = l
        self.loss = []
    # fit()Implemented
    def fit(self, X, y):
        self.coef_ = self.stochastic_grad_desc(X,y)
        #fit returns self
        return self

    # predict()Implemented
    def predict(self, X):
        return np.dot(X, self.coef_)

    def shuffle(self,X,y):
        r = np.random.permutation(len(y))
        return X[r],y[r]

    def stochastic_grad_desc(self,X,y):
        m = len(y)
        loss = []
        #Types of features
        dim = X.shape[1]
        #Initial value of beta
        beta = np.ones(dim).reshape(-1,1)
        eta = self.eta
        l = self.l
        X_shuffle, y_shuffle = self.shuffle(X,y)
        #If it does not improve T times, it ends
        T = 100
        #Number of times not improved
        not_improve = 0
        #Objective function minimum value initial value
        min = 10 ** 9
        while True:
            for Xi,yi in zip(X_shuffle,y_shuffle):
                loss.append((1/(2*m))*np.sum(np.square(np.dot(X,beta)-y)))
                beta = beta*(1-2*l*eta) - eta*Xi.reshape(-1,1)*(np.dot(Xi,beta)-yi)
                if loss[len(loss)-1] < min:
                    min = loss[len(loss)-1]
                    min_beta = beta
                    not_improve = 0
                else:
                    #If the minimum value of the objective function is not updated
                    not_improve += 1
                    if not_improve >= T:
                        break
            #If all samples are finished but the minimum value changes within T times, loop again
            if not_improve >= T:
                self.loss = loss
                break
        return min_beta

#scikit-Import wine data from lean
df= pd.read_csv('winequality-red.csv',sep=';')
#Since the target value quality is included, create a dropped dataframe
df1 = df.drop(columns='quality')
y = df['quality'].values.reshape(-1,1)
X = df1.values
scaler = preprocessing.StandardScaler()
X_fit = scaler.fit_transform(X)
X_fit = sm.add_constant(X_fit) #Add 1 to the first column
epsilon = 10 ** (-7)
eta_list = [0.03,0.01,0.003]
loss = []
coef = []
for eta in eta_list:
    l = 10**(-5)
    test_min = 10**(9)
    while l <= 1/(2*eta):
        myest = MyEstimator(epsilon,eta,l)
        myest.fit(X_fit,y)
        scores = cross_validate(myest,X_fit,y,scoring="neg_mean_squared_error",cv=10)
        if abs(scores['test_score'].mean()) < test_min:
            test_min = abs(scores['test_score'].mean())
            loss = myest.loss
            l_min = l
            coef = myest.coef_
        l = l * 10**(0.5)
    plt.plot(loss,label=f"$\eta$={eta}")
    print(f"eta = {eta} : iter = {len(loss)}, loss = {loss[-1]}, lambda = {l_min}, TestErr = {test_min}")
    #Coefficient output: Since the intercept is included at the very beginning, take it out from the second and output the intercept at the end.
    i = 1
    for column in df1.columns:
        print(column,coef[i][0])
        i+=1
    print('intercept',coef[0][0])
plt.legend()
plt.savefig("sgd.png ")

In the commentary, the condition "Stop if there is no improvement 100 times in a row" was written, so it is adjusted accordingly. In the pattern where $ \ eta $ is small, even if all 1599 are used, this condition is not met, so it may enter the second lap. The result looks like this. At $ \ eta = 0.03 $, you can see that the error increases a little toward the end. The coefficient obtained at the end, etc.

eta = 0.03 : iter = 298, loss = 0.29072324272824085, lambda = 0.0031622776601683803, TestErr = 0.47051639691326796
fixed acidity 0.1904239451124434
volatile acidity -0.11242984344193296
citric acid -0.00703125780915424
residual sugar 0.2092352618792849
chlorides -0.044795495356479025
free sulfur dioxide -0.018863685196341816
total sulfur dioxide 0.07447982325062003
density -0.17305138620126106
pH 0.05808006453308803
sulphates 0.13876262568557934
alcohol 0.2947134691111974
intercept 5.6501294014064145
eta = 0.01 : iter = 728, loss = 0.24203354045966255, lambda = 0.00010000000000000002, TestErr = 0.45525344581852156
fixed acidity 0.25152952212309976
volatile acidity -0.03876889927769888
citric acid 0.14059421863669852
residual sugar 0.06793602828251821
chlorides -0.0607861479963043
free sulfur dioxide 0.08441853171277111
total sulfur dioxide -0.09642176480191654
density -0.2345690991118163
pH 0.1396740265674562
sulphates 0.1449843342292861
alcohol 0.19737851967044345
intercept 5.657998427200384
eta = 0.003 : iter = 1758, loss = 0.22475918775097103, lambda = 0.00010000000000000002, TestErr = 0.44693442950748147
fixed acidity 0.2953542653448508
volatile acidity -0.12934364893075953
citric acid 0.04629080083382285
residual sugar 0.013753852832452122
chlorides -0.03688613363045954
free sulfur dioxide 0.045541235818534614
total sulfur dioxide -0.049594638329345575
density -0.17427360277645224
pH 0.13897225246491407
sulphates 0.15425590075925466
alcohol 0.26518804857692096
intercept 5.597149258230254

It's shuffled randomly, so the results change with each run.

Past posts

University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the assignment (1) University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the assignment (2) University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the assignment (3) University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the assignment (4) University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the assignment (5) University of Tsukuba Machine Learning Course: Study sklearn while creating the Python script part of the assignment (6) https://github.com/legacyworld/sklearn-basic

University of Tsukuba Machine Learning Course: Study sklearn while making the Python script part of the task (8) Make your own stochastic steepest descent method

Exercise 4.3 The steepest descent method and the stochastic steepest descent method

python:Homework_4.3SGD.py

Past posts

`python:Homework_4.3SGD.py`