[PYTHON] I tried using GLM (generalized linear model) for stock price data

Introduction

When looking at the relationship between two stock price data, we generally start the analysis on the assumption that the logarithmic profit margins of both have a normal distribution. However, when looking at the actual stock price, it is difficult to see a clean normal distribution, so it is necessary to pay careful attention to the output statistical values when performing regression analysis using a linear model.

There is GLM (= Generalized Linear model) as a model that handles relationships that do not have a normal distribution, but in order to apply this, it is necessary to learn the concept of statistical modeling, and some technical gaps are required. I'm feeling it. However, since it is supported by python ** statsmodels **, I decided to use it in "trial" without thinking too much about the strictness this time.

First, we picked up automobile-related stocks (3 companies) on the First Section of the Tokyo Stock Exchange as targets for analysis. The Scatter Plot of the logarithmic profit margin of three combinations of two companies selected from three companies is shown in the figure below.

scatter01.png

It can be confirmed that all three have a not-so-strong (weak) positive correlation. We decided to take the middle data (stock2 vs. stock3) from these three and perform regression analysis. By the way, stock2 is a stock with a stock price code of 7203, and stock3 is a stock with a stock price code of 7267.

Applying a Linear Model

First, regression analysis was performed using the Linear Model. The following code was used for data reading and regression analysis of the linear model.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.api as sm

    
def my_tof(s):
	f1 = float(s.replace(',', ''))
	return f1

# pandas read_csv()
my_colmn = ['Date', 'Open', 'High', 'Low', 'Close', 'Diff', 'Volume', 'cH', 'cI', 'cJ', 'cK', 'cL', 'cM', 'cN', 'cO']

index = pd.date_range(start='2014/1/1', end='2014/12/31', freq='B')
stock_raw = pd.DataFrame(index=index)
mydf = pd.DataFrame(index=index)

stock1 = pd.read_csv('./x7201-2014.csv', index_col=0, parse_dates=True, skiprows=1, names=my_colmn, header=None)
stock_raw['stock1'] = stock1[::-1].loc[:, 'Close']
stock2 = pd.read_csv('./x7203-2014.csv', index_col=0, parse_dates=True, skiprows=1, names=my_colmn, header=None)
stock_raw['stock2'] = stock2[::-1].loc[:, 'Close']
stock3 = pd.read_csv('./x7267-2014.csv', index_col=0, parse_dates=True, skiprows=1, names=my_colmn, header=None)
stock_raw['stock3'] = stock3[::-1].loc[:, 'Close']

stock_raw.dropna(inplace=True)
stock_base_label = ['stock1', 'stock2', 'stock3']

for st in stock_base_label:
	st_price = st + '_p'
	st_return = st + '_ret'
	st_log_return = st + '_lgret'
	
	mydf[st_price] = stock_raw[st].apply(my_tof)
	mydf[st_price].fillna(method='ffill', inplace=True)
	mydf[st_return] = mydf[st_price] / mydf[st_price].shift(1)
	mydf[st_log_return] = np.log(mydf[st_return])

# scatter plotting
(Omitted)

# apply OLS model 
mydf.dropna(inplace=True)

x1 = mydf['stock2_lgret'].values    # stock2 log-return
x1a = sm.add_constant(x1)
y1 = mydf['stock3_lgret'].values    # stock3 log-return

# OLS (linear model)
md0 = sm.OLS(y1, x1a)
res0 = md0.fit()
print res0.summary()

plt.figure(figsize=(5,4))
plt.scatter(mydf['stock2_lgret'], mydf['stock3_lgret'], c='b', alpha=0.6)
plt.plot(x1, res0.fittedvalues, 'r-', label='Linear Model')
plt.grid(True)

As above, statsmodels.api.OLS () is used. As a result, the following graph was obtained.

Fig. stock2 vs. stock3 (Log Return) Linear Model scatter_md0.png

GLM (Gaussian distribution) Next, regression analysis by GLM is performed. The GLM (Generalized Linear Models) of statsmodels supports the following as usable probability distributions (called Family): (Excerpt from Document)

Families for GLM(Generalized Linear Model)

Family The parent class for one-parameter exponential families. Remark
Binomial Binomial exponential family distribution. Binomial distribution
Gamma Gamma exponential family distribution. Gamma distribution
Gaussian Gaussian exponential family distribution. Gaussian distribution
InverseGaussian InverseGaussian exponential family. Inverse Gaussian distribution
NegativeBinomial Negative Binomial exponential family. Negative binomial distribution
Poisson Poisson exponential family. Poisson distribution

In addition, the link function that can be used (combination) is determined for each family. (Excerpt from Document) The link function can be specified as an option, but if it is not specified, the default one seems to be used.

ident log logit probit cloglog pow opow nbinom loglog logc
Gaussian x x x
inv Gaussian x x x
binomial x x x x x x x x x
Poission x x x
neg binomial x x x x
gamma x x x

First, the calculation was performed using the Gaussian function. Code is as follows.


# apply GLM(Gaussian) model
md1 = sm.GLM(y1, x1a, family=sm.families.Gaussian())    # Gaussian()
res1 = md1.fit()
print res1.summary()

plt.figure(figsize=(5,4))
plt.scatter(mydf['stock2_lgret'], mydf['stock3_lgret'], c='g', alpha=0.6)
plt.plot(x1, res1.fittedvalues, 'r-', label='GLM(Gaussian)')
plt.grid(True)

Fig. stock2 vs. stock3 (GLM(gaussian dist.)) scatter_md1.png

The line fitted by GLM does not seem to change at all from the above figure. Compare the calculation result summary ().

** OLS summary **

In [71]: print res0.summary()
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.486
Model:                            OLS   Adj. R-squared:                  0.484
Method:                 Least Squares   F-statistic:                     241.1
Date:                Sun, 26 Jul 2015   Prob (F-statistic):           1.02e-38
Time:                        16:18:16   Log-Likelihood:                 803.92
No. Observations:                 257   AIC:                            -1604.
Df Residuals:                     255   BIC:                            -1597.
Df Model:                           1
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -0.0013      0.001     -1.930      0.055        -0.003  2.64e-05
x1             0.7523      0.048     15.526      0.000         0.657     0.848
==============================================================================
Omnibus:                       10.243   Durbin-Watson:                   1.997
Prob(Omnibus):                  0.006   Jarque-Bera (JB):               16.017
Skew:                          -0.235   Prob(JB):                     0.000333
Kurtosis:                       4.129   Cond. No.                         73.0
==============================================================================

** GLM (Gaussian dist.) Summary **

In [72]: print res1.summary()

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  257
Model:                            GLM   Df Residuals:                      255
Model Family:                Gaussian   Df Model:                            1
Link Function:               identity   Scale:                0.00011321157031
Method:                          IRLS   Log-Likelihood:                 803.92
Date:                Sun, 26 Jul 2015   Deviance:                     0.028869
Time:                        16:12:11   Pearson chi2:                   0.0289
No. Iterations:                     4
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -0.0013      0.001     -1.930      0.054        -0.003  2.02e-05
x1             0.7523      0.048     15.526      0.000         0.657     0.847
==============================================================================

It can be seen that the contents of the two outputs are quite different.

In OLS, the numerical values of R-squared, AIC, and BIC are output, but in GLM, these are not output, and instead, Deviance (deviance), Pearson chi2 statistics, etc. are output. As for both, the Log-Likelihood (Log-likelihood) value is output.

From the output of GLM, it can be seen that the Link Function is set to "identity" (identity link function). In addition, since the partial regression coefficients are the same (-0.0013, 0.7523) for OLS and GLM, it was confirmed that the results (contents) of the regression analysis are the same.

GLM (Gamma distribution)

Next, I tried to calculate GLM using the Gamma distribution as the distribution. I thought it was arguable whether the gamma distribution could represent the price-earnings ratio well, but I tried it with the aim of trying out GLM-like calculations.

The problem with executing the calculation is that the logarithmic price-earnings ratio takes a negative value when the stock price falls, but this is outside the range of the gamma distribution. Therefore, the price-earnings ratio before taking the logarithm was set as the y value for calculation. (I can't deny that I feel a little forced ...)

# apply GLM(gamma) model

x2 = x1 ; x2a = x1a
y2 = mydf['stock3_ret'].values    # replaced

md2 = sm.GLM(y2, x2a, family=sm.families.Gamma())
res2 = md2.fit()

# print summary and plot fitting curve
print res2.summary()
plt.figure(figsize=(5,4))
plt.scatter(mydf['stock2_lgret'], mydf['stock3_ret'], c='c', alpha=0.6)
plt.plot(x2, res2.fittedvalues, 'r-', label='GLM(Gamma)')
plt.grid(True)

y2_fit_log = np.log(res2.fittedvalues)
plt.figure(figsize=(5,4))
plt.scatter(mydf['stock2_lgret'], mydf['stock3_lgret'], c='c', alpha=0.6)
plt.plot(x2, y2_fit_log, 'r-', label='GLM(Gamma)')

Fig. stock2 vs. stock3 (GLM(gamma dist.)) (log - ident) scatter_md2(log-ident).png ** (log --log) ** (converted y value) scatter_md2(log-log).png

As a graph, the same result was obtained. Let's look at summary ().

In [73]: print res2.summary()

                 Generalized Linear Model Regression Results
==============================================================================
Dep. Variable:                      y   No. Observations:                  257
Model:                            GLM   Df Residuals:                      255
Model Family:                   Gamma   Df Model:                            1
Link Function:          inverse_power   Scale:               0.000113369003649
Method:                          IRLS   Log-Likelihood:                 803.72
Date:                Sun, 26 Jul 2015   Deviance:                     0.028956
Time:                        16:12:16   Pearson chi2:                   0.0289
No. Iterations:                     5
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          1.0013      0.001   1502.765      0.000         1.000     1.003
x1            -0.7491      0.048    -15.470      0.000        -0.844    -0.654
==============================================================================

By changing from GLM (gaussian dist.) To GLM (gamma dist.), The Log-Likelihood values and Deviance have changed slightly. However, it is certain that the model has not changed so much that it could be improved. Since the y value was converted and calculated, the partial regression coefficient is different.

Around the same time, a histogram of the log-rate of return of stock2 and stock3 was drawn to confirm the normality of the data. The shape was as shown in the figure below.

histogram2.png

Conclusion for the time being

In this data analysis, we could not confirm the improvement of model accuracy by applying GLM. It is probable that this is because the stock prices in the same industry (the period is about one year) were not complicated (non-linear). However, it is not a bad thing that the number of tools that can be used to analyze various data will increase in the future, so I would like to deepen my understanding of GLM and other advanced regression analysis methods.

This time (stock price of automobile manufacturer A vs. stock price of company B) did not show its power, but it may be effectively used in a combination with a slightly different coat color, for example (maximum temperature vs. stock price of beer company). Are expected.

References

--statsmodels documentation http://statsmodels.sourceforge.net/stable/glm.html

--Introduction to Statistics (Department of Statistics, Faculty of Liberal Arts, University of Tokyo) http://www.utp.or.jp/bd/978-4-13-042065-5.html

--Introduction to Statistical Modeling for Data Analysis (Kubo, Iwanami Shoten) https://www.iwanami.co.jp/cgi-bin/isearch?isbn=ISBN978-4-00-006973-1

Recommended Posts

I tried using GLM (generalized linear model) for stock price data
Introduction to Statistical Modeling for Data Analysis Generalized Linear Models (GLM)
I tried using YOUTUBE Data API V3
I tried logistic regression analysis for the first time using Titanic data
I tried using firebase for Django's cache server
Stock price forecast using deep learning [Data acquisition]
I tried DBM with Pylearn 2 using artificial data
■ Kaggle Practice for Beginners -House Sale Price (I tried using PyCaret)-by Google Colaboratory
I tried to implement various methods for machine learning (prediction model) using scikit-learn.
I tried using parameterized
I tried using argparse
I tried using mimesis
I tried using anytree
I tried using aiomysql
I tried using Summpy
I tried using coturn
I tried using Pipenv
I tried hosting a Pytorch sample model using TorchServe
I tried using matplotlib
I tried using "Anvil".
I tried using Hubot
I tried using ESPCN
I tried using openpyxl
I tried clustering ECG data using the K-Shape method
I tried using Ipython
I tried using PyCaret
I tried using cron
I tried using the API of the salmon data project
I tried using ngrok
I tried using face_recognition
[MNIST] I tried Fine Tuning using the ImageNet model.
I tried using Jupyter
I tried using PyCaret
I tried reading data from a file using Node.js.
I tried using Heapq
I tried using doctest
I tried using folium
I tried using jinja2
I tried using folium
PyTorch Learning Note 2 (I tried using a pre-trained model)
I tried using time-window
Generalized linear model (GLM) and neural network are the same (1)
I tried to search videos using Youtube Data API (beginner)
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried hosting a TensorFlow deep learning model using TensorFlow Serving
I tried using Tensorboard, a visualization tool for machine learning
Generalized linear model (GLM) and neural network are the same (2)
Miscellaneous notes that I tried using python for the matter
[Python] I tried collecting data using the API of wikipedia
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
Introduction to Statistical Modeling for Data Analysis GLM Model Selection
I tried to get data from AS / 400 quickly using pypyodbc
[For beginners] I tried using the Tensorflow Object Detection API
[I tried using Pythonista 3] Introduction
I tried using Random Forest
I tried using BigQuery ML
I tried using Amazon Glacier
I tried using git inspector
[Python] I tried using OpenPose
Stock Price Forecasting Using LSTM_1
I tried using magenta / TensorFlow