[PYTHON] Least squares method and maximum likelihood estimation method (comparison by model fitting)

The world is full of various phenomena. Understanding them is about creating a model that explains their behavior. If the model can predict the data of observing a certain phenomenon, it can be said that it understands a certain phenomenon.

There are two main ways to fit a model to your data.

  1. ** least squared method **
  2. ** maximum likelihood method **

In fact, which one should be used when? This time, I will try ** curve fitting ** through a simple example.

Task

Let's take the example of a study of classical consciousness. Have the subject sit in front of the monitor and present the visual stimulus on the monitor. Visual stimuli include signals and noise, and subjects are asked to report whether they have seen the signal for each trial.

Intuitively, the higher the signal ratio than the noise, the easier it is to detect. In fact, if you take the signal strength on the horizontal axis and the subject's detection performance on the vertical axis, for example, it will be as follows.

figure_1.png

This is called a psychometric function, but it doesn't matter now, and you can clearly see the ** non-linear ** relationship. Let's create a model that represents this non-linear relationship and fit it into the data. If you can do that, you can predict the detection performance at that time even if the signal strength is not used in the experiment. If you can do that, you understand this non-linear phenomenon.

The data used this time is summarized below.

signal performance number of trials
0 0.5 50
0.1 0.73 45
0.2 0.84 40
0.4 0.92 35
0.8 0.98 30

The information required for fitting is the performance corresponding to the strength of each signal and the number of trials of the stimulus when the subjects were asked to perform all 200 trials.

Since we are using python this time, import the required libraries and put the data in the list.

curvefitting.py



# import libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import optimize

# experimental data ========
# signal strength
signal = [0, 0.1, 0.2, 0.4, 0.8]

# accuracy
accuracy = [0.5, 0.73, 0.84, 0.92, 0.98]

# number of trials
ntr = [50, 45, 40, 35, 30]


model

There are many non-linear models, but we have found that ** cumulative Weibull ** works well for detection tasks like this one.

When the strength of a signal $ s $ is presented as a stimulus, the probability that the subject will answer correctly $ p $ is expressed as follows.


p = 0.5 + (0.5 - \lambda)f((\frac{s}{\theta})^\beta)

$ f $ is $ f (x) = (1 --e ^ {-x}) $. There are three free parameters that determine the shape of the cumulative Weibull, $ \ lambda $, $ \ theta $, $ \ beta $. Since the experiment knows the signal strength $ s $ and the subject's performance $ p $, we will use these to find ** free parameters that fit the data **.

In actual programming, all free parameters are put in a list variable called $ x $. $ x [0] = \ lambda $, $ x [1] = \ theta $, $ x [2] = \ beta $ and so on.

curvefitting.py



# model for performance ========
def probabilityCorrect(x, signal):
    # x: model parameters
    # s: signal strength
    
    # probability of correct 
    return 0.5 + (0.5 - x[0])*(1 - np.exp(-(signal/x[1])**x[2]))


Model fitting (least squares)

Let's try ** least squares method ** first. The least squares method is a method of finding a free parameter such as ** "The difference between the predicted value of the model and the actual data is minimized" **. The square is used simply because you don't have to think about the sign.

To find the free parameters, we will create a ** cost function **. The flow is to throw it into the algorithm and have the algorithm find a free parameter that minimizes its loss function **.

With the least squares method, the loss function is very intuitive. The predicted value of the model and the square difference of the actual data should be calculated by the strength of each signal, and added together to give one numerical value.

The following is an implementation example.

curvefitting.py



def cost_LSE(x, signal, performance, ntr):       
    # compute cost
    c = 0
    for n in np.arange(len(ntr)):
        # model output 
        p = probabilityCorrect(x, signal[n])
        
        # cost as the summed squared error
        c += ntr[n]*(performance[n] - p)**2
        
    return c 

Now that we have a loss function, we throw it into the algorithm to find a free parameter that takes the minimum value. In python, scipy includes a package called optimize, so use that.

curvefitting.py



# initial guess and boundary
x0 = [0, 0.2, 1]
bound = [(0, None),(0.0001, None),(0,None)]

# least squared error
params_LSE = optimize.minimize(cost_LSE,x0,args=(signal,performance,ntr,),method='l-bfgs-b',\
                        jac=None, bounds=bound, tol=None, callback=None,\
                        options={'disp': None, 'maxls': 20, 'iprint': -1,\
                                 'gtol': 1e-05, 'eps': 1e-08, 'maxiter': 15000,\
                                 'ftol': 2.220446049250313e-09, 'maxcor': 10,\
                                 'maxfun': 15000})

The arguments may be confusing, but basically official site This is the default copy and paste listed in. Please note that ** you have to decide the initial value of the free parameter $ x0 $ **.

The algorithm used this time is L-BFGS-B, which is a memory-friendly general-purpose algorithm, but any algorithm repeats numerical calculations (numerically) to find the minimum value, so the initial value is close to the answer. It's ideal. If the initial value is irrelevant, the algorithm calculation may ** not converge ** and the fitting cannot be performed as intended. I have no choice but to try various things, but this time, empirically, the initial value is "Well, this will happen".

Another, ** you can preset the bounds of the free parameters $ \ lambda $, $ \ theta $, $ \ beta $ **. With proper range settings, the algorithm can avoid wandering through vast numbers of oceans. This time, all free parameters are positive, and we don't want $ x [1] = \ theta $ to be $ 0 $ because it comes to the denominator of the model, so we set bound as above. It is set.

The algorithm L-BFGS-B now finds the value of the free parameter that minimizes the loss function.

Model fitting (maximum likelihood estimation method)

For comparison, try the ** maximum likelihood method **. The maximum likelihood estimation method is a method of finding a free parameter such as ** "Maximize the probability (likelihood) that a model predicts actual data when a certain free parameter is given" **. is.

I did an experiment and got some data. The data comes from the probability distribution of a model, which is determined by the free parameters of the model. So, the rationale is to find a free parameter that maximizes the probability (plausibility, likelihood, likelihood **) that the actual data value will come out of the model.

More specifically, ** likelihood is defined by the joint probability of $ P (data_i | x) $ given to the model by the free parameter $ x $ for each observation event $ data_i $. **. As we learned in high school, if each observation event occurs independently, the simultaneous probability is the product of the probabilities that each event will occur.


likelihood = P(data_0 | x)P(data_1 | x)P(data_2 | x)...P(data_n | x) 

I want to create a function that maximizes this, but the only way to find the free parameter is to create a loss function whose minimum value can be found by an algorithm. However, since I want the maximum value this time, I pass the likelihood with a minus sign to the algorithm.

Another thing is that the likelihood is the product of many probabilities, so if the number of observations is large, the value will asymptotically approach 0, making it difficult for the algorithm to find the maximum value. Therefore, we use the mathematical trick of taking the log. As I learned in high school, by taking a log, the product of elements can be converted into the sum of elements.

As a result, the loss function thrown into the algorithm by the maximum likelihood estimation method is as follows, with the log taken and minused.


- log(likelihood) = - (log(P(data_0 | x)) + log(P(data_1 | x)) + log(P(data_2 | x)) ... + log(P(data_n | x)) 

The following is an implementation example.

curvefitting.py



def cost_MLE(x, signal, performance, ntr):       
    # compute cost
    c = 0
    for n in np.arange(len(ntr)):
        # model output 
        p = probabilityCorrect(x, signal[n])
        
        # cost as the negative likelihood
        if 0 < p <= 1:
            c += -ntr[n]*(performance[n]*np.log(p) + (1 - performance[n])*np.log(1 - p))
        
    return c 

For convenience of logging, the predicted value of the model must be positive. Also, since it is a probability, it cannot be greater than 1. I put these two conditions in $ if $. Note that in this example, the probability that the subject will answer correctly is modeled as $ p $, so the probability that the subject will not answer correctly is $ 1-p $.

Now that we have a loss function, we throw it into the algorithm.

curvefitting.py



# initial guess and boundary
x0 = [0, 0.2, 1]
bound = [(0, None),(0.0001, None),(0,None)]

# maximum likelihood
params_MLE = optimize.minimize(cost_MLE,x0,args=(signal,performance,ntr,),method='l-bfgs-b',\
                        jac=None, bounds=bound, tol=None, callback=None,\
                        options={'disp': None, 'maxls': 20, 'iprint': -1,\
                                 'gtol': 1e-05, 'eps': 1e-08, 'maxiter': 15000,\
                                 'ftol': 2.220446049250313e-09, 'maxcor': 10,\
                                 'maxfun': 15000})

Comparison of fitting results

What about the fitting result? Let's make a graph.

curvefitting.py



# visualize
signals = np.linspace(0,0.8,100)
acc_mle = np.zeros(100)
acc_lse = np.zeros(100)

for i in np.arange(100):
    acc_mle[i] = probabilityCorrect(params_MLE.x,signals[i])
    acc_lse[i] = probabilityCorrect(params_LSE.x,signals[i])

fig = plt.figure()
fig.suptitle('psychometric function')

ax = fig.add_subplot(111)

ax.set_xlabel('signal strength')
ax.set_ylabel('performance')    
plt.plot(signal, performance, 'ko')
plt.plot(signals, acc_mle, '-b')
plt.plot(signals, acc_lse, '-r')
    

Plot the fittings obtained by the least squares method in red and the fittings obtained by the maximum likelihood estimation method in blue.

figure_2.png

…… It's almost the same w Both are fitted neatly.

In fact, how well the fitting is done can be quantified by the index ** variance explained **. The following is official.


varianceExplained = 1 - var(data - prediction)/var(data)

We are quantifying how much the variation in the predicted value explained the variation in the actual data. Let's calculate this.

curvefitting.py



# compute variance explained
fitted_acc_mle = np.zeros(len(signal))
fitted_acc_lse = np.zeros(len(signal))
for s in np.arange(len(signal)):
    fitted_acc_mle[s] = probabilityCorrect(params_MLE.x,signal[s])
    fitted_acc_lse[s] = probabilityCorrect(params_LSE.x,signal[s])
    
varexp_mle = 1 - (np.var(fitted_acc_mle - performance)/np.var(performance))
varexp_lse = 1 - (np.var(fitted_acc_lse - performance)/np.var(performance))

print('variance explained (MLE): ' + str(varexp_mle))
print('variance explained (LSE): ' + str(varexp_lse))    
    

As a result of the calculation

variance explained (MLE): 0.999339315626 variance explained (LSE): 0.999394193462

have become. The least squares method seems to be a little better, but the difference is negligible. You may want to try both for the time being and choose the one that fits better.

Summary Summary

--Use the least squares method or maximum likelihood estimation method to fit the model. --The flow of creating a model, creating a loss function, and throwing it into an algorithm that finds the minimum value. ――The quality of fitting can be calculated by variance explained.

At the end

Modeling is an essential idea in various fields such as machine learning. The flow is fixed, so even if it looks difficult at first, if you try it yourself, you may get used to it unexpectedly. I'd like to model it, but I'm a little scared, and I hope it helps those people. The source code is listed below.

curvefitting.py



# -*- coding: utf-8 -*-

# import libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import optimize

# model for performance ========
def probabilityCorrect(x, signal):
    # x: model parameters
    # s: signal strength
    
    # probability of correct 
    return 0.5 + (0.5 - x[0])*(1 - np.exp(-(signal/x[1])**x[2]))

def cost_MLE(x, signal, performance, ntr):       
    # compute cost
    c = 0
    for n in np.arange(len(ntr)):
        # model output 
        p = probabilityCorrect(x, signal[n])
        
        # cost as the negative likelihood
        if 0 < p <= 1:
            c += -ntr[n]*(performance[n]*np.log(p) + (1 - performance[n])*np.log(1 - p))
        
    return c 

def cost_LSE(x, signal, performance, ntr):       
    # compute cost
    c = 0
    for n in np.arange(len(ntr)):
        # model output 
        p = probabilityCorrect(x, signal[n])
        
        # cost as the summed squared error
        c += ntr[n]*(performance[n] - p)**2
        
    return c 

# signal strength
signal = [0, 0.1, 0.2, 0.4, 0.8]

# performance
performance = [0.5, 0.73, 0.84, 0.92, 0.98]

# number of trials
ntr = [50, 45, 40, 35, 30]

# initial guess and boundary
x0 = [0, 0.2, 1]
bound = [(0, None),(0.0001, None),(0,None)]

# maximum likelihood
params_MLE = optimize.minimize(cost_MLE,x0,args=(signal,performance,ntr,),method='l-bfgs-b',\
                        jac=None, bounds=bound, tol=None, callback=None,\
                        options={'disp': None, 'maxls': 20, 'iprint': -1,\
                                 'gtol': 1e-05, 'eps': 1e-08, 'maxiter': 15000,\
                                 'ftol': 2.220446049250313e-09, 'maxcor': 10,\
                                 'maxfun': 15000})

# least squared error
params_LSE = optimize.minimize(cost_LSE,x0,args=(signal,performance,ntr,),method='l-bfgs-b',\
                        jac=None, bounds=bound, tol=None, callback=None,\
                        options={'disp': None, 'maxls': 20, 'iprint': -1,\
                                 'gtol': 1e-05, 'eps': 1e-08, 'maxiter': 15000,\
                                 'ftol': 2.220446049250313e-09, 'maxcor': 10,\
                                 'maxfun': 15000})

# compute variance explained
fitted_acc_mle = np.zeros(len(signal))
fitted_acc_lse = np.zeros(len(signal))
for s in np.arange(len(signal)):
    fitted_acc_mle[s] = probabilityCorrect(params_MLE.x,signal[s])
    fitted_acc_lse[s] = probabilityCorrect(params_LSE.x,signal[s])
    
varexp_mle = 1 - (np.var(fitted_acc_mle - performance)/np.var(performance))
varexp_lse = 1 - (np.var(fitted_acc_lse - performance)/np.var(performance))

print('variance explained (MLE): ' + str(varexp_mle))
print('variance explained (LSE): ' + str(varexp_lse))    
    
# visualize
signals = np.linspace(0,0.8,100)
acc_mle = np.zeros(100)
acc_lse = np.zeros(100)

for i in np.arange(100):
    acc_mle[i] = probabilityCorrect(params_MLE.x,signals[i])
    acc_lse[i] = probabilityCorrect(params_LSE.x,signals[i])

fig = plt.figure()
fig.suptitle('psychometric function')

ax = fig.add_subplot(111)

ax.set_xlabel('signal strength')
ax.set_ylabel('performance')    
plt.plot(signal, performance, 'ko')
plt.plot(signals, acc_mle, '-b')
plt.plot(signals, acc_lse, '-r')
    

Recommended Posts

Least squares method and maximum likelihood estimation method (comparison by model fitting)
Let's try again Maximum likelihood estimation and fitting of model (probability distribution) ① Discrete probability distribution
Let's try again Maximum likelihood estimation and fitting of model (probability distribution) ② Continuous probability distribution
Linear regression based on Bayesian estimation (least squares estimation, maximum likelihood estimation, MAP estimation, Bayesian estimation)
Single regression analysis by least squares method
Maximum likelihood estimation implementation of topic model in python
Maximum likelihood estimation of mean and variance with TensorFlow
Calculation of homography matrix by least squares method (DLT method)
Until the maximum likelihood estimation method finds the true parameter
Machine Learning Super Introduction Probability Model and Maximum Likelihood Estimate
Perform least squares fitting with numpy.
Least squares method (triangular matrix calculation)
Example of python code for exponential distribution and maximum likelihood estimation (MLE)