[PYTHON] I tried using GLM (generalized linear model) for stock price data

Introduction

When looking at the relationship between two stock price data, we generally start the analysis on the assumption that the logarithmic profit margins of both have a normal distribution. However, when looking at the actual stock price, it is difficult to see a clean normal distribution, so it is necessary to pay careful attention to the output statistical values when performing regression analysis using a linear model.

There is GLM (= Generalized Linear model) as a model that handles relationships that do not have a normal distribution, but in order to apply this, it is necessary to learn the concept of statistical modeling, and some technical gaps are required. I'm feeling it. However, since it is supported by python ** statsmodels **, I decided to use it in "trial" without thinking too much about the strictness this time.

First, we picked up automobile-related stocks (3 companies) on the First Section of the Tokyo Stock Exchange as targets for analysis. The Scatter Plot of the logarithmic profit margin of three combinations of two companies selected from three companies is shown in the figure below.

It can be confirmed that all three have a not-so-strong (weak) positive correlation. We decided to take the middle data (stock2 vs. stock3) from these three and perform regression analysis. By the way, stock2 is a stock with a stock price code of 7203, and stock3 is a stock with a stock price code of 7267.

Applying a Linear Model

First, regression analysis was performed using the Linear Model. The following code was used for data reading and regression analysis of the linear model.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.api as sm

    
def my_tof(s):
	f1 = float(s.replace(',', ''))
	return f1

# pandas read_csv()
my_colmn = ['Date', 'Open', 'High', 'Low', 'Close', 'Diff', 'Volume', 'cH', 'cI', 'cJ', 'cK', 'cL', 'cM', 'cN', 'cO']

index = pd.date_range(start='2014/1/1', end='2014/12/31', freq='B')
stock_raw = pd.DataFrame(index=index)
mydf = pd.DataFrame(index=index)

stock1 = pd.read_csv('./x7201-2014.csv', index_col=0, parse_dates=True, skiprows=1, names=my_colmn, header=None)
stock_raw['stock1'] = stock1[::-1].loc[:, 'Close']
stock2 = pd.read_csv('./x7203-2014.csv', index_col=0, parse_dates=True, skiprows=1, names=my_colmn, header=None)
stock_raw['stock2'] = stock2[::-1].loc[:, 'Close']
stock3 = pd.read_csv('./x7267-2014.csv', index_col=0, parse_dates=True, skiprows=1, names=my_colmn, header=None)
stock_raw['stock3'] = stock3[::-1].loc[:, 'Close']

stock_raw.dropna(inplace=True)
stock_base_label = ['stock1', 'stock2', 'stock3']

for st in stock_base_label:
	st_price = st + '_p'
	st_return = st + '_ret'
	st_log_return = st + '_lgret'
	
	mydf[st_price] = stock_raw[st].apply(my_tof)
	mydf[st_price].fillna(method='ffill', inplace=True)
	mydf[st_return] = mydf[st_price] / mydf[st_price].shift(1)
	mydf[st_log_return] = np.log(mydf[st_return])

# scatter plotting
(Omitted)

# apply OLS model 
mydf.dropna(inplace=True)

x1 = mydf['stock2_lgret'].values    # stock2 log-return
x1a = sm.add_constant(x1)
y1 = mydf['stock3_lgret'].values    # stock3 log-return

# OLS (linear model)
md0 = sm.OLS(y1, x1a)
res0 = md0.fit()
print res0.summary()

plt.figure(figsize=(5,4))
plt.scatter(mydf['stock2_lgret'], mydf['stock3_lgret'], c='b', alpha=0.6)
plt.plot(x1, res0.fittedvalues, 'r-', label='Linear Model')
plt.grid(True)

As above, statsmodels.api.OLS () is used. As a result, the following graph was obtained.

Fig. stock2 vs. stock3 (Log Return) Linear Model

GLM (Gaussian distribution) Next, regression analysis by GLM is performed. The GLM (Generalized Linear Models) of statsmodels supports the following as usable probability distributions (called Family): (Excerpt from Document)

Families for GLM(Generalized Linear Model)

Family	The parent class for one-parameter exponential families.	Remark
Binomial	Binomial exponential family distribution.	Binomial distribution
Gamma	Gamma exponential family distribution.	Gamma distribution
Gaussian	Gaussian exponential family distribution.	Gaussian distribution
InverseGaussian	InverseGaussian exponential family.	Inverse Gaussian distribution
NegativeBinomial	Negative Binomial exponential family.	Negative binomial distribution
Poisson	Poisson exponential family.	Poisson distribution

In addition, the link function that can be used (combination) is determined for each family. (Excerpt from Document) The link function can be specified as an option, but if it is not specified, the default one seems to be used.

	ident	log	logit	probit	cloglog	pow	opow	nbinom	loglog	logc
Gaussian	x	x				x
inv Gaussian	x	x				x
binomial	x	x	x	x	x	x	x		x	x
Poission	x	x				x
neg binomial	x	x				x		x
gamma	x	x				x

First, the calculation was performed using the Gaussian function. Code is as follows.


# apply GLM(Gaussian) model
md1 = sm.GLM(y1, x1a, family=sm.families.Gaussian())    # Gaussian()
res1 = md1.fit()
print res1.summary()

plt.figure(figsize=(5,4))
plt.scatter(mydf['stock2_lgret'], mydf['stock3_lgret'], c='g', alpha=0.6)
plt.plot(x1, res1.fittedvalues, 'r-', label='GLM(Gaussian)')
plt.grid(True)

Fig. stock2 vs. stock3 (GLM(gaussian dist.))

The line fitted by GLM does not seem to change at all from the above figure. Compare the calculation result summary ().

** OLS summary **

In [71]: print res0.summary()
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.486
Model:                            OLS   Adj. R-squared:                  0.484
Method:                 Least Squares   F-statistic:                     241.1
Date:                Sun, 26 Jul 2015   Prob (F-statistic):           1.02e-38
Time:                        16:18:16   Log-Likelihood:                 803.92
No. Observations:                 257   AIC:                            -1604.
Df Residuals:                     255   BIC:                            -1597.
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -0.0013      0.001     -1.930      0.055        -0.003  2.64e-05
x1             0.7523      0.048     15.526      0.000         0.657     0.848
==============================================================================
Omnibus:                       10.243   Durbin-Watson:                   1.997
Prob(Omnibus):                  0.006   Jarque-Bera (JB):               16.017
Skew:                          -0.235   Prob(JB):                     0.000333
Kurtosis:                       4.129   Cond. No.                         73.0
==============================================================================

** GLM (Gaussian dist.) Summary **

In [72]: print res1.summary()

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  257
Model:                            GLM   Df Residuals:                      255
Model Family:                Gaussian   Df Model:                            1
Link Function:               identity   Scale:                0.00011321157031
Method:                          IRLS   Log-Likelihood:                 803.92
Date:                Sun, 26 Jul 2015   Deviance:                     0.028869
Time:                        16:12:11   Pearson chi2:                   0.0289
No. Iterations:                     4
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -0.0013      0.001     -1.930      0.054        -0.003  2.02e-05
x1             0.7523      0.048     15.526      0.000         0.657     0.847
==============================================================================

It can be seen that the contents of the two outputs are quite different.

In OLS, the numerical values of R-squared, AIC, and BIC are output, but in GLM, these are not output, and instead, Deviance (deviance), Pearson chi2 statistics, etc. are output. As for both, the Log-Likelihood (Log-likelihood) value is output.

From the output of GLM, it can be seen that the Link Function is set to "identity" (identity link function). In addition, since the partial regression coefficients are the same (-0.0013, 0.7523) for OLS and GLM, it was confirmed that the results (contents) of the regression analysis are the same.

GLM (Gamma distribution)

Next, I tried to calculate GLM using the Gamma distribution as the distribution. I thought it was arguable whether the gamma distribution could represent the price-earnings ratio well, but I tried it with the aim of trying out GLM-like calculations.

The problem with executing the calculation is that the logarithmic price-earnings ratio takes a negative value when the stock price falls, but this is outside the range of the gamma distribution. Therefore, the price-earnings ratio before taking the logarithm was set as the y value for calculation. (I can't deny that I feel a little forced ...)

# apply GLM(gamma) model

x2 = x1 ; x2a = x1a
y2 = mydf['stock3_ret'].values    # replaced

md2 = sm.GLM(y2, x2a, family=sm.families.Gamma())
res2 = md2.fit()

# print summary and plot fitting curve
print res2.summary()
plt.figure(figsize=(5,4))
plt.scatter(mydf['stock2_lgret'], mydf['stock3_ret'], c='c', alpha=0.6)
plt.plot(x2, res2.fittedvalues, 'r-', label='GLM(Gamma)')
plt.grid(True)

y2_fit_log = np.log(res2.fittedvalues)
plt.figure(figsize=(5,4))
plt.scatter(mydf['stock2_lgret'], mydf['stock3_lgret'], c='c', alpha=0.6)
plt.plot(x2, y2_fit_log, 'r-', label='GLM(Gamma)')

Fig. stock2 vs. stock3 (GLM(gamma dist.)) (log - ident) scatter_md2(log-ident).png ** (log --log) ** (converted y value) scatter_md2(log-log).png

As a graph, the same result was obtained. Let's look at summary ().

In [73]: print res2.summary()

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  257
Model:                            GLM   Df Residuals:                      255
Model Family:                   Gamma   Df Model:                            1
Link Function:          inverse_power   Scale:               0.000113369003649
Method:                          IRLS   Log-Likelihood:                 803.72
Date:                Sun, 26 Jul 2015   Deviance:                     0.028956
Time:                        16:12:16   Pearson chi2:                   0.0289
No. Iterations:                     5
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          1.0013      0.001   1502.765      0.000         1.000     1.003
x1            -0.7491      0.048    -15.470      0.000        -0.844    -0.654
==============================================================================

By changing from GLM (gaussian dist.) To GLM (gamma dist.), The Log-Likelihood values and Deviance have changed slightly. However, it is certain that the model has not changed so much that it could be improved. Since the y value was converted and calculated, the partial regression coefficient is different.

Around the same time, a histogram of the log-rate of return of stock2 and stock3 was drawn to confirm the normality of the data. The shape was as shown in the figure below.

Conclusion for the time being

In this data analysis, we could not confirm the improvement of model accuracy by applying GLM. It is probable that this is because the stock prices in the same industry (the period is about one year) were not complicated (non-linear). However, it is not a bad thing that the number of tools that can be used to analyze various data will increase in the future, so I would like to deepen my understanding of GLM and other advanced regression analysis methods.

This time (stock price of automobile manufacturer A vs. stock price of company B) did not show its power, but it may be effectively used in a combination with a slightly different coat color, for example (maximum temperature vs. stock price of beer company). Are expected.

References

--statsmodels documentation http://statsmodels.sourceforge.net/stable/glm.html

--Introduction to Statistics (Department of Statistics, Faculty of Liberal Arts, University of Tokyo) http://www.utp.or.jp/bd/978-4-13-042065-5.html

--Introduction to Statistical Modeling for Data Analysis (Kubo, Iwanami Shoten) https://www.iwanami.co.jp/cgi-bin/isearch?isbn=ISBN978-4-00-006973-1