Introduction

I usually study mainly in meteorology, but sometimes I want financial motivation, and by the way, if I can predict the Nikkei 225 by meteorological factors, it seems that few people are paying attention (it seems to be profitable). I thought it was simple), so I decided to try it.

Data acquisition

Meteorological data can be downloaded from Japan Meteorological Agency HP. The data used this time are average sea level pressure, maximum temperature, average temperature, average relative humidity, and precipitation. The Nikkei 225 data was downloaded from macrotrends.

Preprocessing

First load the library

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import signal

from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

It is reading and preprocessing of meteorological data. (Mainly) I filled in the missing values of precipitation with 0 and created a label of the presence or absence of precipitation based on the precipitation.


tokyo1 = pd.read_csv("data/tokyo_1961-1980.csv", header=2,skiprows=[4],encoding="shift-jis",parse_dates=["date"])
tokyo2 = pd.read_csv("data/tokyo_1981-2000.csv", header=2,skiprows=[4],encoding="shift-jis",parse_dates=["date"])
tokyo3 = pd.read_csv("data/tokyo_2001-.csv", header=2,skiprows=[4],encoding="shift-jis",parse_dates=["date"])
df = pd.concat([tokyo1, tokyo2, tokyo3], ignore_index=True)
df = df.sort_values("date", ignore_index=True)

df=df.drop(["wdr_pr"],axis=1)
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day"] = df["date"].dt.day

df = df.fillna(0)
df["rain_yn"] = 0
df.loc[df["smpre"] >= 1, "rain_yn"] = 1

I will also read the Nikkei average and graph it. Here, we use data from 1995, when the bursting of the bubble seems to have begun to settle down.

nikkei = pd.read_csv("data/nikkei-225-index-historical-chart-data.csv", header=8, parse_dates=["date"],names=["date","value"])
nikkei["year"]=nikkei["date"].dt.year
nikkei=nikkei[nikkei["year"]>=1995]
plt.plot(nikkei["date"],nikkei["value"])
plt.xlabel("year")
plt.ylabel("nikkei")
plt.savefig("transition")

You can see that there are long-term fluctuations. This long-term fluctuation is probably not due to weather factors, so it doesn't really mean that you can't predict it, but we'll apply a high-pass filter.


samplerate = 6270  #Waveform sampling rate
x = np.arange(0, round(samplerate/2)) / samplerate  #Creating a time axis for waveform generation

fp = 34  #Passband edge frequency[Hz]
fs = 17  #Blocking frequency[Hz]
gpass = 3  #Maximum loss at the end of the passband[dB]
gstop = 40  #Minimum loss at the blocking edge[dB]

def highpass(x, samplerate, fp, fs, gpass, gstop):
    fn = samplerate / 2                           #Nyquist frequency
    wp = fp / fn                                  #Normalize passband edge frequency with Nyquist frequency
    ws = fs / fn                                  #Normalize the blocking edge frequency with the Nyquist frequency
    N, Wn = signal.buttord(wp, ws, gpass, gstop)  #Calculate order and Butterworth normalized frequencies
    b, a = signal.butter(N, Wn, "high")           #Calculate the numerator and denominator of the filter transfer function
    y = signal.filtfilt(b, a, x)                  #Filter the signal
    return y

nikkei["high"] = highpass(nikkei["value"], samplerate, fp, fs, gpass, gstop)

#Output graph
plt.plot(np.abs(np.fft.fft(nikkei["high"])[:3000]))
plt.xlabel("frequency[Hz]")
plt.ylabel("amplitude")
plt.yscale('log')
plt.savefig("high")

I referred to this site. As a result, the graph after FFT is as follows.

raw data	After applying the high-pass filter

The scale may be hard to see, but I was able to remove the low frequency components. As a result, the graph of the Nikkei 225 is as follows.

This may be explained by the meteorological factor.

Combine the data and split the one up to 2017 into train and the rest into test


df=df.fillna(0)
df_merge=pd.merge(df, nikkei, on="date")
df_merge

train = df_merge[df_merge["year_y"] <= 2017]
x_train = train.drop(["high","value", "date"], axis=1)
y_train = train["high"]
test = df_merge[df_merge["year_y"] > 2017]
x_test = test.drop(["high","value", "date"], axis=1)
y_test = test["high"]

Random forest was used for the prediction.


model_fore=RandomForestRegressor(n_estimators=50,max_depth=5).fit(x_train, y_train)
y_pred=model_fore.predict(x_test)
print("test")
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("r^2 :", r2_score(y_test, y_pred), "\n")

print("train")
y_pred=model_fore.predict(x_train)
print("RMSE:", np.sqrt(mean_squared_error(y_train, y_pred)))
print("r^2 :", r2_score(y_train, y_pred), "\n")


dic={}
for key, value in zip(df.columns, model_fore.feature_importances_):
    dic[key]=value
sorted(dic.items(), key=lambda x: x[1], reverse=True)

As a result, it was like this.

test RMSE: 938 r^2 : -0.031
train RMSE: 513 r^2 : 0.124

It's completely useless. I also tried linear regression and neural networks, but the results were almost the same. After all, the Nikkei average and meteorological factors do not seem to be related. I'm sorry. By the way, the importance of variables was like this.

variable	importance
Year	0.376
Month	0.197
Average temperature	0.123
Day	0.088
Humidity	0.067
Barometric pressure	0.066
Highest temperature	0.063
Precipitation	0.019
Presence or absence of precipitation	0.001

You can see that the variables used for the forecast are also important, such as the year and month, which are not related to the weather ... This time, I couldn't predict it at all, so I would like to revenge by verifying whether there is a statistical correlation.

References

-Apply a low-pass filter with Python's SciPy!

[PYTHON] Does the Nikkei 225 Depend on Meteorological Factors?

Introduction

Data acquisition

Preprocessing

References