[PYTHON] Predict hot summers with a linear regression model

Introduction

Weather forecasters make seasonal temperature forecasts from weather map information and seawater temperature maps related to the El Nino phenomenon. This time, I investigated how to predict whether it will be a "hot summer" from past temperature information data, not based on such expert knowledge.

First of all, regarding the base temperature data set, the Japan Meteorological Agency had data on "long-term changes in temperature and precipitation in the world and Japan" as information on "global warming", so I referred to it. In addition to the annual average temperature in Japan, there was data on the seasonal average temperature, so I decided to use that.

The formatted csv file has the following contents. The seasonal average temperature consists of data for spring (March-May), summer (June-August), autumn (September-November), and winter (December-February).

# Data source : www.data.go.jp,,,,,, 
# column1 -Annual mean temperature deviation in Japan (℃),,,,,, 
# column2 -Annual mean precipitation deviation in Japan (mm),,,,,, 
# column3..6 -Japan's seasonal average temperature deviation (℃) Mar-May, Jun-Aug, Sep-Nov, Dec-Feb,,,,,, 
Year,temp_yr,rain_yr,spring,summer,autumn,winter 
1898,-0.75,15.1,-0.98,-1.7,-0.53,-0.67
1899,-0.81,199.2,-0.27,-0.21,-0.6,-2.12
1900,-1.06,-43.3,-1.35,-1.09,-1.11,-0.41
1901,-1.03,48.6,-0.74,-0.89,-1.26,-0.87
1902,-1.03,154.7,-1.65,-0.87,-2.2,-0.43
1903,-0.77,266.2,0.73,-0.19,-1.5,-0.93
1904,-0.86,-48.6,-1.37,-0.75,-0.16,-1.68
1905,-0.95,256.9,-0.55,-1.45,-1.39,-0.86
(Omitted)

Data has been accumulated and published for over 100 years since 1898. Regarding temperature (precipitation) data, the value of "deviation from the cumulative average value" was disclosed instead of the raw data of temperature.

http://www.data.jma.go.jp/cpdinfo/temp/index.html

First, the temperature (deviation) of each season was plotted in chronological order. ** Fig. Seasonal Temperature Trend **

It seems that it fluctuates finely from year to year and from season to season. For the sake of clarity, the moving average that averages 10 years was calculated by pandas.rolling_mean () and plotted. (Plot only in spring and summer.)

** Fig. Seasonal Temperature Trend (moving average) **

The situation of climate change can be grasped by the rising line. Also, since the two lines have similar movements, I feel that I can expect a little to "anticipate a hot summer".

Linear Regression Model (OLS, 1 variable)

First, we considered the following model. ** (Temperature deviation in that summer) ~ (Temperature deviation in the previous spring) ** Regression analysis was performed using statsmodels OLS ().

# Regression Analysis (OLS)

import statsmodels.api as sm 
x1 = tempdf['spring'].values 
x1a = sm.add_constant(x1) 
y1 = tempdf['summer'].values 

model1 = sm.OLS(y1, x1a) 
res1 = model1.fit() 
print res1.summary() 

plt.figure(figsize=(5,4)) 
plt.scatter(tempdf['spring'], tempdf['summer'], c='b', alpha=0.6)
plt.plot(x1, res1.fittedvalues, 'r-', label='OLS(model1)') 
plt.grid(True)

** Fig. Spring Temperature Deviation (x-axis) vs. Summer Temperature Deviation (y-axis) **

This is a result with a lower degree of correlation than expected. The fitted line shows a positive correlation for the time being. The summary () information was as follows.

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.282
Model:                            OLS   Adj. R-squared:                  0.276
Method:                 Least Squares   F-statistic:                     45.26
Date:                Mon, 10 Aug 2015   Prob (F-statistic):           7.07e-10
Time:                        16:39:14   Log-Likelihood:                -107.55
No. Observations:                 117   AIC:                             219.1
Df Residuals:                     115   BIC:                             224.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -0.3512      0.068     -5.148      0.000        -0.486    -0.216
x1             0.4489      0.067      6.727      0.000         0.317     0.581
==============================================================================
Omnibus:                        0.159   Durbin-Watson:                   1.571
Prob(Omnibus):                  0.924   Jarque-Bera (JB):                0.075
Skew:                           0.062   Prob(JB):                        0.963
Kurtosis:                       2.990   Cond. No.                         1.88
==============================================================================

As mentioned above, the correlation is weak with R-squared = 0.282.

Linear Regression Model (OLS, 2 vars, multiple)

After all, I thought that it might be related to the temperature not only in spring but also in the winter before that, so I considered the following model. ** (Summer temperature) ~ (Winter temperature (6 months ago)) & (Spring temperature) **

This is a model for multiple regression analysis. Since each row of DataFrame contains the temperature deviation of the same year, shift () is processed for (winter temperature) to refer to the previous year's one.

# Regression Analysis (OLS: summer ~ b0 + b1*winter.shift(1) + b2*spring)
tempdf['winter_last'] = tempdf['winter'].shift(1)
X2 = tempdf[['winter_last', 'spring']].iloc[1:,:].values
X2a = sm.add_constant(X2)
y2 = tempdf['summer'].iloc[1:,].values

model2 = sm.OLS(y2, X2a)
res2 = model2.fit()
print res2.summary()

x2_model2 = res2.params[1] * tempdf['winter_last'] + res2.params[2] *tempdf['spring']
plt.figure(figsize=(5,4))
plt.scatter(x2_model2[1:], y2, c='g', alpha=0.6)
plt.plot(x2_model2[1:], res2.fittedvalues, 'r-', label='OLS(model2)')
plt.grid(True)

** Fig. Winter and spring temperature anomalies (x-axis) vs. Summer temperature anomalies (y-axis) **

(OLS: summer ~ b0 + b1 x winter.shift(1) + b2 x spring)

With R-squared = 0.303, the change from the previous model is negligible. The regression parameters are as follows. (Excerpt from summary () output.) Only the P-value (0.065) of x1 is slightly larger.

==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -0.2927      0.073     -4.007      0.000        -0.437    -0.148
x1             0.1609      0.086      1.863      0.065        -0.010     0.332
x2             0.3984      0.070      5.674      0.000         0.259     0.537
==============================================================================

From the R-squared values, it was found that it is quite difficult to predict the temperature trend with high accuracy by this method.

PyMC3 Trial (Linear Regression)

Up to the above, we can almost see the game of "summer temperature forecast", but the conclusion of "weak correlation" is boring, so try Bayesian estimation of regression parameters using PyMC3, which is a Python implementation of MCMC. I made it. (It was a while ago, but I had a lot of trouble installing PyMC3.)

The problem settings are as follows. --Bayesian estimation of linear regression model. The model here is the second model above. (summer ~ beta0(intercept) + beta1 x winter + beta2 x spring) --Applying the concept of no information prior distribution, the initial value of the variance sigma gives an appropriate distribution.

With reference to the Tutorial page of PyMC3, the following code was prepared and simulated.

# using PyMC3

import pymc3 as pm

model3 = pm.Model()

myX1 = tempdf['winter_last'].iloc[1:,].values
myX2 = tempdf['spring'].iloc[1:,].values
myY = y2

# model definition (step.1)
with model3:
    # Priors for unknown model parameters
    intercept = pm.Normal('intercept', mu=0, sd=10)
    beta = pm.Normal('beta', mu=0, sd=10, shape=2)
    sigma = pm.HalfNormal('sigma', sd=1)
    # Expected value of outcome
    mu = intercept + beta[0]*myX1 + beta[1]*myX2
    # Likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=myY)

# run simulation (step.2)
from scipy import optimize
with model3:
    # obtain starting values via MAP
    start = pm.find_MAP(fmin=optimize.fmin_powell)
    # instantiate sampler
    step = pm.NUTS(scaling=start)
    # draw 500 posterior samples
    trace = pm.sample(500, step, start=start)

pm.summary(trace)
pm.traceplot(trace)

Regarding PyMC3, the source etc. are published on github, but unfortunately the document is decisively lacking at this time (August 2015). If this is combined with my own lack of understanding of Bayesian statistical theory, it is a situation of double pain and triple pain. What I learned this time is. .. ..

--PyMC3 is different (in terms of compatibility) from its previous version (PyMC 2.x.x). Therefore, the PyMC documentation is not very helpful. --Search for an example python code of what you want to do. There is an example folder on github (although not very abundant), which provides code for typical problems. --If you want to know the details such as the arguments of Class and Function, you have to look at the source. (It is a relief that the explanation is included in the header of the file and the structure is easy to see due to the features of python.)

The results of the obtained simulation are as follows.

** Fig. Traceplot **

(I'm not sure why the status of "sigma_log" is output ...)

** Summary (trace) output **

intercept:
 
  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  -0.286           0.080            0.005            [-0.431, -0.127]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.435         -0.341         -0.286         -0.235         -0.129


beta:
 
  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.161            0.095            0.005            [-0.024, 0.347]
  0.407            0.077            0.004            [0.266, 0.568]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  -0.017         0.095          0.159          0.222          0.361
  0.252          0.362          0.405          0.459          0.559

(Omitted)

Here, the 95% highest posterior density (HPD) is output instead of the 95% confidence interval (C.I.) that was output by OLS. (High-density posterior distribution interval, confidence interval) Personally, I would like to request output with a little closer line spacing (in a tight format).

From this simulation result, a scatter plot was drawn.

myX = trace['beta'][-1][0] *myX1 + trace['beta'][-1][1] *myX2

plt.figure(figsize=(5,4))
plt.scatter(myX, myY, c='g', alpha=0.6)
plt.grid(True)

for i in range(len(trace)):
    plt.plot(myX, trace['beta'][i][0] *myX1 + trace['beta'][i][1] * myX2, color='blue', alpha=0.005)

** Fig. Winter and spring temperature anomalies (x-axis) vs. Summer temperature anomalies (y-axis) **

Comparison of results between OLS and MCMC (PyMC3)

	intercept	beta1	beta2
OLS	-0.293	0.161	0.398
MCMC(500step)	-0.286	0.161	0.407
MCMC(10000step)	-0.291	0.163	0.397

The regression parameters were almost the same.

Summary and future issues

The situation was a little disappointing for the original purpose of "anticipating a hot summer." It seems that this model was too simple, but I think that this may be improved by introducing a model (AR model, etc.) that incorporates elements of time series analysis.

I have just started using PyMC3, and I still have to study the theory of Bayesian statistics and how to use modules. There are many things to do, such as adjusting calculation parameters and determining the convergence status. (In a good way, it is a messy tool (toy).)

In addition, the lack of information on PyMC3 is waiting for the preparation of documents by related parties, but I would like to deepen my understanding of modeling and analysis methods while examining examples such as BUGS and Stan that have already been proven.

References (web site)

PyMC3 tutorial https://pymc-devs.github.io/pymc3/getting_started/
PyMC (previous version) documentation http://pymc-devs.github.io/pymc/index.html
StatsModels documentation http://statsmodels.sourceforge.net/stable/ --Japan Meteorological Agency (Climate Change) http://www.data.jma.go.jp/cpdinfo/index_temp.html --Introduction to statistical modeling for data analysis (Iwanami Shoten) https://www.iwanami.co.jp/cgi-bin/isearch?isbn=ISBN978-4-00-006973-1