[PYTHON] Predict hot summers with a linear regression model

Introduction

Weather forecasters make seasonal temperature forecasts from weather map information and seawater temperature maps related to the El Nino phenomenon. This time, I investigated how to predict whether it will be a "hot summer" from past temperature information data, not based on such expert knowledge.

First of all, regarding the base temperature data set, the Japan Meteorological Agency had data on "long-term changes in temperature and precipitation in the world and Japan" as information on "global warming", so I referred to it. In addition to the annual average temperature in Japan, there was data on the seasonal average temperature, so I decided to use that.

The formatted csv file has the following contents. The seasonal average temperature consists of data for spring (March-May), summer (June-August), autumn (September-November), and winter (December-February).

# Data source : www.data.go.jp,,,,,, 
# column1 -Annual mean temperature deviation in Japan (℃),,,,,, 
# column2 -Annual mean precipitation deviation in Japan (mm),,,,,, 
# column3..6 -Japan's seasonal average temperature deviation (℃) Mar-May, Jun-Aug, Sep-Nov, Dec-Feb,,,,,, 
Year,temp_yr,rain_yr,spring,summer,autumn,winter 
1898,-0.75,15.1,-0.98,-1.7,-0.53,-0.67
1899,-0.81,199.2,-0.27,-0.21,-0.6,-2.12
1900,-1.06,-43.3,-1.35,-1.09,-1.11,-0.41
1901,-1.03,48.6,-0.74,-0.89,-1.26,-0.87
1902,-1.03,154.7,-1.65,-0.87,-2.2,-0.43
1903,-0.77,266.2,0.73,-0.19,-1.5,-0.93
1904,-0.86,-48.6,-1.37,-0.75,-0.16,-1.68
1905,-0.95,256.9,-0.55,-1.45,-1.39,-0.86
(Omitted)

Data has been accumulated and published for over 100 years since 1898. Regarding temperature (precipitation) data, the value of "deviation from the cumulative average value" was disclosed instead of the raw data of temperature.

http://www.data.jma.go.jp/cpdinfo/temp/index.html

First, the temperature (deviation) of each season was plotted in chronological order. ** Fig. Seasonal Temperature Trend ** temp_TL1.png

It seems that it fluctuates finely from year to year and from season to season. For the sake of clarity, the moving average that averages 10 years was calculated by pandas.rolling_mean () and plotted. (Plot only in spring and summer.)

** Fig. Seasonal Temperature Trend (moving average) ** temp_TL2.png

The situation of climate change can be grasped by the rising line. Also, since the two lines have similar movements, I feel that I can expect a little to "anticipate a hot summer".

Linear Regression Model (OLS, 1 variable)

First, we considered the following model. ** (Temperature deviation in that summer) ~ (Temperature deviation in the previous spring) ** Regression analysis was performed using statsmodels OLS ().

# Regression Analysis (OLS)

import statsmodels.api as sm 
x1 = tempdf['spring'].values 
x1a = sm.add_constant(x1) 
y1 = tempdf['summer'].values 

model1 = sm.OLS(y1, x1a) 
res1 = model1.fit() 
print res1.summary() 

plt.figure(figsize=(5,4)) 
plt.scatter(tempdf['spring'], tempdf['summer'], c='b', alpha=0.6)
plt.plot(x1, res1.fittedvalues, 'r-', label='OLS(model1)') 
plt.grid(True)

** Fig. Spring Temperature Deviation (x-axis) vs. Summer Temperature Deviation (y-axis) ** temp_trend_scatter_model1.png

This is a result with a lower degree of correlation than expected. The fitted line shows a positive correlation for the time being. The summary () information was as follows.

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.282
Model:                            OLS   Adj. R-squared:                  0.276
Method:                 Least Squares   F-statistic:                     45.26
Date:                Mon, 10 Aug 2015   Prob (F-statistic):           7.07e-10
Time:                        16:39:14   Log-Likelihood:                -107.55
No. Observations:                 117   AIC:                             219.1
Df Residuals:                     115   BIC:                             224.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -0.3512      0.068     -5.148      0.000        -0.486    -0.216
x1             0.4489      0.067      6.727      0.000         0.317     0.581
==============================================================================
Omnibus:                        0.159   Durbin-Watson:                   1.571
Prob(Omnibus):                  0.924   Jarque-Bera (JB):                0.075
Skew:                           0.062   Prob(JB):                        0.963
Kurtosis:                       2.990   Cond. No.                         1.88
==============================================================================

As mentioned above, the correlation is weak with R-squared = 0.282.

Linear Regression Model (OLS, 2 vars, multiple)

After all, I thought that it might be related to the temperature not only in spring but also in the winter before that, so I considered the following model. ** (Summer temperature) ~ (Winter temperature (6 months ago)) & (Spring temperature) **

This is a model for multiple regression analysis. Since each row of DataFrame contains the temperature deviation of the same year, shift () is processed for (winter temperature) to refer to the previous year's one.

# Regression Analysis (OLS: summer ~ b0 + b1*winter.shift(1) + b2*spring)
tempdf['winter_last'] = tempdf['winter'].shift(1)
X2 = tempdf[['winter_last', 'spring']].iloc[1:,:].values
X2a = sm.add_constant(X2)
y2 = tempdf['summer'].iloc[1:,].values

model2 = sm.OLS(y2, X2a)
res2 = model2.fit()
print res2.summary()

x2_model2 = res2.params[1] * tempdf['winter_last'] + res2.params[2] *tempdf['spring']
plt.figure(figsize=(5,4))
plt.scatter(x2_model2[1:], y2, c='g', alpha=0.6)
plt.plot(x2_model2[1:], res2.fittedvalues, 'r-', label='OLS(model2)')
plt.grid(True)

** Fig. Winter and spring temperature anomalies (x-axis) vs. Summer temperature anomalies (y-axis) ** temp_trend_scatter_model2.png

(OLS: summer ~ b0 + b1 x winter.shift(1) + b2 x spring)

With R-squared = 0.303, the change from the previous model is negligible. The regression parameters are as follows. (Excerpt from summary () output.) Only the P-value (0.065) of x1 is slightly larger.

==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -0.2927      0.073     -4.007      0.000        -0.437    -0.148
x1             0.1609      0.086      1.863      0.065        -0.010     0.332
x2             0.3984      0.070      5.674      0.000         0.259     0.537
==============================================================================


From the R-squared values, it was found that it is quite difficult to predict the temperature trend with high accuracy by this method.

PyMC3 Trial (Linear Regression)

Up to the above, we can almost see the game of "summer temperature forecast", but the conclusion of "weak correlation" is boring, so try Bayesian estimation of regression parameters using PyMC3, which is a Python implementation of MCMC. I made it. (It was a while ago, but I had a lot of trouble installing PyMC3.)

The problem settings are as follows. --Bayesian estimation of linear regression model. The model here is the second model above. (summer ~ beta0(intercept) + beta1 x winter + beta2 x spring) --Applying the concept of no information prior distribution, the initial value of the variance sigma gives an appropriate distribution.

With reference to the Tutorial page of PyMC3, the following code was prepared and simulated.

# using PyMC3

import pymc3 as pm

model3 = pm.Model()

myX1 = tempdf['winter_last'].iloc[1:,].values
myX2 = tempdf['spring'].iloc[1:,].values
myY = y2

# model definition (step.1)
with model3:
    # Priors for unknown model parameters
    intercept = pm.Normal('intercept', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    sigma = pm.HalfNormal('sigma', sd=1)
    # Expected value of outcome
    mu = intercept + beta[0]*myX1 + beta[1]*myX2
    # Likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=myY)

# run simulation (step.2)
from scipy import optimize
with model3:
    # obtain starting values via MAP
    start = pm.find_MAP(fmin=optimize.fmin_powell)
    # instantiate sampler
    step = pm.NUTS(scaling=start)
    # draw 500 posterior samples
    trace = pm.sample(500, step, start=start)

pm.summary(trace)
pm.traceplot(trace)

Regarding PyMC3, the source etc. are published on github, but unfortunately the document is decisively lacking at this time (August 2015). If this is combined with my own lack of understanding of Bayesian statistical theory, it is a situation of double pain and triple pain. What I learned this time is. .. ..

--PyMC3 is different (in terms of compatibility) from its previous version (PyMC 2.x.x). Therefore, the PyMC documentation is not very helpful. --Search for an example python code of what you want to do. There is an example folder on github (although not very abundant), which provides code for typical problems. --If you want to know the details such as the arguments of Class and Function, you have to look at the source. (It is a relief that the explanation is included in the header of the file and the structure is easy to see due to the features of python.)

The results of the obtained simulation are as follows.

** Fig. Traceplot ** temp_trend_traceplot1.png

(I'm not sure why the status of "sigma_log" is output ...)

** Summary (trace) output **

intercept:
 
  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  -0.286           0.080            0.005            [-0.431, -0.127]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.435         -0.341         -0.286         -0.235         -0.129


beta:
 
  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.161            0.095            0.005            [-0.024, 0.347]
  0.407            0.077            0.004            [0.266, 0.568]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.017         0.095          0.159          0.222          0.361
  0.252          0.362          0.405          0.459          0.559

(Omitted)

Here, the 95% highest posterior density (HPD) is output instead of the 95% confidence interval (C.I.) that was output by OLS. (High-density posterior distribution interval, confidence interval) Personally, I would like to request output with a little closer line spacing (in a tight format).

From this simulation result, a scatter plot was drawn.

myX = trace['beta'][-1][0] *myX1 + trace['beta'][-1][1] *myX2

plt.figure(figsize=(5,4))
plt.scatter(myX, myY, c='g', alpha=0.6)
plt.grid(True)

for i in range(len(trace)):
    plt.plot(myX, trace['beta'][i][0] *myX1 + trace['beta'][i][1] * myX2, color='blue', alpha=0.005)

** Fig. Winter and spring temperature anomalies (x-axis) vs. Summer temperature anomalies (y-axis) ** temp_trend_scatter_model3.png

Comparison of results between OLS and MCMC (PyMC3)

intercept beta1 beta2
OLS -0.293 0.161 0.398
MCMC(500step) -0.286 0.161 0.407
MCMC(10000step) -0.291 0.163 0.397

The regression parameters were almost the same.

Summary and future issues

The situation was a little disappointing for the original purpose of "anticipating a hot summer." It seems that this model was too simple, but I think that this may be improved by introducing a model (AR model, etc.) that incorporates elements of time series analysis.

I have just started using PyMC3, and I still have to study the theory of Bayesian statistics and how to use modules. There are many things to do, such as adjusting calculation parameters and determining the convergence status. (In a good way, it is a messy tool (toy).)

In addition, the lack of information on PyMC3 is waiting for the preparation of documents by related parties, but I would like to deepen my understanding of modeling and analysis methods while examining examples such as BUGS and Stan that have already been proven.

References (web site)

Recommended Posts

Predict hot summers with a linear regression model
Regression with linear model
Linear regression with statsmodels
Implement a discrete-time logistic regression model with stan
[Python] Linear regression with scikit-learn
Robust linear regression with scikit-learn
Linear regression with Student's t distribution
Make a model iterator with PySide
Try to infer using a linear regression model on android [PyTorch Mobile]
<Course> Machine Learning Chapter 1: Linear Regression Model
Linear regression
Implement a model with state and behavior
Predict infectious disease epidemics with SIR model
Try TensorFlow RNN with a basic model
A model that identifies the guitar with fast.ai
Optimization learned with OR-Tools [Linear programming: multi-stage model]
Multivariable regression model with scikit-learn --SVR comparison verification
Simulate a good Christmas date with a Python optimized model
A python implementation of the Bayesian linear regression class
Getting Started with Tensorflow-About Linear Regression Hypothesis and Cost
Create a 3D model viewer with PyQt5 and PyQtGraph