Linear regression in Python (statmodels, scikit-learn, PyMC3)

Introduction

There are several libraries in Python that you can use to do linear regression. I personally needed to do linear regression, and I researched a method for that, so I would like to share it as a memo. The library used is as follows:

Data preparation

First, prepare appropriate data. This time, we will use artificial data with noise $ \ epsilon $ added to the following formula.

y = \beta_0 + \beta_1 x_1 + \beta_2 x2 + \epsilon

The values estimated here are $ \ beta_0, \ beta_1, \ beta_2 $.

import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Generate random data
beta = [1.2, 0.5]
# prep data
x1 = np.random.random(size=1000)*5 
x2 = np.random.random(size=1000)*10 
x = np.transpose([x1, x2])
y = np.dot(x, beta)  + np.random.normal(scale=0.5, size=1000)
#data = dict(x=x, y=y)
data = dict(x1=x1, x2=x2, y=y)
df = pd.DataFrame(data)
# visualisation
plt.scatter(x1, y, color='b')
plt.scatter(x2, y, color='orange')
# 3D
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(x1, x2, y)
plt.show()

output_2_0.png output_2_1.png

When using Statsmodels

import statsmodels.api as sm

x = sm.add_constant(x)
results = sm.OLS(endog=y, exog=x).fit()
results.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.951
Model: OLS Adj. R-squared: 0.951
Method: Least Squares F-statistic: 9745.
Date: Fri, 10 Mar 2017 Prob (F-statistic): 0.00
Time: 09:58:59 Log-Likelihood: -724.14
No. Observations: 1000 AIC: 1454.
Df Residuals: 997 BIC: 1469.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const 0.0499 0.042 1.181 0.238 -0.033 0.133
x1 1.1823 0.011 107.081 0.000 1.161 1.204
x2 0.4983 0.005 91.004 0.000 0.488 0.509
Omnibus: 0.654 Durbin-Watson: 2.079
Prob(Omnibus): 0.721 Jarque-Bera (JB): 0.599
Skew: -0.059 Prob(JB): 0.741
Kurtosis: 3.023 Cond. No. 17.2

When using scikit-learn

from sklearn import linear_model
# compare different regressions
lreg = linear_model.LinearRegression()
lreg.fit(x, y)
print("Linear regression: \t", lreg.coef_)
breg = linear_model.BayesianRidge()
breg.fit(x, y)
print("Bayesian regression: \t", breg.coef_)
ereg = linear_model.ElasticNetCV()
ereg.fit(x, y)
print("Elastic net:  \t\t", ereg.coef_)
print("true parameter values: \t", beta)
Linear regression: 	 [ 1.18232244  0.49832431]
Bayesian regression: 	 [ 1.18214701  0.49830501]
Elastic net:  		 [ 1.17795855  0.49756084]
true parameter values: 	 [1.2, 0.5]

When using PyMC3

In the above two cases, the parameters were estimated using maximum likelihood estimation, but in PyMC3, the probability distribution of the parameters is calculated by Markov chain Monte Carlo method (MCMC) based on Bayes' theorem.

import pymc3 as pm

with pm.Model() as model_robust:
    family = pm.glm.families.Normal()
    #pm.glm.glm('y ~ x', data, family=family)
    pm.glm.glm('y ~ x1+x2', df, family=family)
    start = pm.find_MAP()
    step = pm.NUTS(scaling=start)
    #step = pm.Metropolis()
    trace_robust = pm.sample(25000, step)
Optimization terminated successfully.
         Current function value: 764.008811
         Iterations: 19
         Function evaluations: 30
         Gradient evaluations: 30


 100%|██████████| 25000/25000 [00:32<00:00, 761.52it/s]
# show results
pm.traceplot(trace_robust[2000:])
#pm.summary(trace_robust[2000:])
plt.show()
pm.df_summary(trace_robust[2000:])

output_9_0.png

mean sd mc_error hpd_2.5 hpd_97.5
Intercept -0.022296 0.040946 0.000564 -0.100767 0.057460
x1 1.199371 0.011235 0.000126 1.177191 1.221067
x2 0.500502 0.005346 0.000057 0.490258 0.511003
sd 0.490271 0.010949 0.000072 0.469599 0.512102

Try it in a slightly more complicated case

Let's try a little more complicated case to see how much PyMC3 can do. The original formula is:

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_5 + \epsilon

# Generate random data
beta = [1.2, 0.5, 9.8, 0.2, 1.08]
# prep data
x1 = np.random.random(size=1000)*5 
x2 = np.random.random(size=1000)*10 
x3 = np.random.random(size=1000)
x4 = np.random.normal(size=1000)*2 
x5 = np.random.random(size=1000)*60 
x = np.transpose([x1, x2, x3, x4, x5])
y = np.dot(x, beta)  + np.random.normal(scale=0.5, size=1000)
#data = dict(x=x, y=y)
data = dict(x1=x1, x2=x2, x3=x3, x4=x4, x5=x5, y=y)
df = pd.DataFrame(data)

with pm.Model() as model_robust:
    family = pm.glm.families.Normal()
    pm.glm.glm('y ~ x1+x2+x3+x4+x5', df, family=family)
    start = pm.find_MAP()
    step = pm.NUTS(scaling=start)
    #step = pm.Metropolis()
    trace_robust = pm.sample(25000, step)
# show results
pm.traceplot(trace_robust[2000:])
plt.show()
print("true parameter values are:", beta)
pm.df_summary(trace_robust[2000:])

output_13_0.png

true parameter values are: [1.2, 0.5, 9.8, 0.2, 1.08]
mean sd mc_error hpd_2.5 hpd_97.5
Intercept 0.041924 0.059770 0.000737 -0.080421 0.154130
x1 1.199973 0.011466 0.000106 1.177061 1.222395
x2 0.494488 0.005656 0.000053 0.483624 0.505661
x3 9.699889 0.056527 0.000484 9.587273 9.809352
x4 0.197271 0.008424 0.000052 0.181196 0.214320
x5 1.081120 0.000922 0.000008 1.079339 1.082917
sd 0.514316 0.011438 0.000067 0.492296 0.536947

in conclusion

I didn't explain it because it can be written in short code, but you can see that all of them are able to estimate the parameters well. If you have any questions please feel free to comment.

** (Addition) ** Corrected because the output image was incorrect.

Recommended Posts

Linear regression in Python (statmodels, scikit-learn, PyMC3)
[Python] Linear regression with scikit-learn
Online linear regression in Python
Online Linear Regression in Python (Robust Estimate)
Linear search in Python
Regression analysis in Python
Implemented in Python PRML Chapter 3 Bayesian Linear Regression
Multiple regression expressions in Python
"Linear regression" and "Probabilistic version of linear regression" in Python "Bayesian linear regression"
2. Multivariate analysis spelled out in Python 1-1. Simple regression analysis (scikit-learn)
Python Scikit-learn Linear Regression Analysis Nonlinear Simple Regression Analysis Machine Learning
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
2. Multivariate analysis spelled out in Python 2-1. Multiple regression analysis (scikit-learn)
Simple regression analysis in Python
Robust linear regression with scikit-learn
First simple regression analysis in Python
I can't install scikit-learn in Python
Linear regression
2. Multivariate analysis spelled out in Python 6-2. Ridge regression / Lasso regression (scikit-learn) [Ridge regression vs. Lasso regression]
Basic Linear Algebra Learned in Python (Part 1)
Eigenvalues and eigenvectors: Linear algebra in Python <7>
[Illegal hardware instruction python] error in PyMC3
I implemented Cousera's logistic regression in Python
Linear Independence and Basis: Linear Algebra in Python <6>
Introduction to Vectors: Linear Algebra in Python <1>
I tried to implement Bayesian linear regression by Gibbs sampling in python
2. Multivariate analysis spelled out in Python 6-1. Ridge regression / Lasso regression (scikit-learn) [multiple regression vs. ridge regression]
Quadtree in Python --2
Python in optimization
CURL in python
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
2. Multivariate analysis spelled out in Python 6-3. Ridge regression / Lasso regression (scikit-learn) [How regularization works]
Epoch in Python
Discord in Python
Sudoku in Python
DCI in Python
quicksort in python
EV3 x Python Machine Learning Part 2 Linear Regression
nCr in python
N-Gram in Python
Programming in python
Plink in Python
Constant in python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Identity matrix and inverse matrix: Linear algebra in Python <4>
Constant in python
nCr in Python.
format in python
Scons in Python3
Puyo Puyo in python