I usually study mainly in meteorology, but sometimes I want financial motivation, and by the way, if I can predict the Nikkei 225 by meteorological factors, it seems that few people are paying attention (it seems to be profitable). I thought it was simple), so I decided to try it.
Meteorological data can be downloaded from Japan Meteorological Agency HP. The data used this time are average sea level pressure, maximum temperature, average temperature, average relative humidity, and precipitation. The Nikkei 225 data was downloaded from macrotrends.
First load the library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import signal
from sklearn import linear_model
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
It is reading and preprocessing of meteorological data. (Mainly) I filled in the missing values of precipitation with 0 and created a label of the presence or absence of precipitation based on the precipitation.
tokyo1 = pd.read_csv("data/tokyo_1961-1980.csv", header=2,skiprows=[4],encoding="shift-jis",parse_dates=["date"])
tokyo2 = pd.read_csv("data/tokyo_1981-2000.csv", header=2,skiprows=[4],encoding="shift-jis",parse_dates=["date"])
tokyo3 = pd.read_csv("data/tokyo_2001-.csv", header=2,skiprows=[4],encoding="shift-jis",parse_dates=["date"])
df = pd.concat([tokyo1, tokyo2, tokyo3], ignore_index=True)
df = df.sort_values("date", ignore_index=True)
df=df.drop(["wdr_pr"],axis=1)
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day"] = df["date"].dt.day
df = df.fillna(0)
df["rain_yn"] = 0
df.loc[df["smpre"] >= 1, "rain_yn"] = 1
I will also read the Nikkei average and graph it. Here, we use data from 1995, when the bursting of the bubble seems to have begun to settle down.
nikkei = pd.read_csv("data/nikkei-225-index-historical-chart-data.csv", header=8, parse_dates=["date"],names=["date","value"])
nikkei["year"]=nikkei["date"].dt.year
nikkei=nikkei[nikkei["year"]>=1995]
plt.plot(nikkei["date"],nikkei["value"])
plt.xlabel("year")
plt.ylabel("nikkei")
plt.savefig("transition")
You can see that there are long-term fluctuations. This long-term fluctuation is probably not due to weather factors, so it doesn't really mean that you can't predict it, but we'll apply a high-pass filter.
samplerate = 6270 #Waveform sampling rate
x = np.arange(0, round(samplerate/2)) / samplerate #Creating a time axis for waveform generation
fp = 34 #Passband edge frequency[Hz]
fs = 17 #Blocking frequency[Hz]
gpass = 3 #Maximum loss at the end of the passband[dB]
gstop = 40 #Minimum loss at the blocking edge[dB]
def highpass(x, samplerate, fp, fs, gpass, gstop):
fn = samplerate / 2 #Nyquist frequency
wp = fp / fn #Normalize passband edge frequency with Nyquist frequency
ws = fs / fn #Normalize the blocking edge frequency with the Nyquist frequency
N, Wn = signal.buttord(wp, ws, gpass, gstop) #Calculate order and Butterworth normalized frequencies
b, a = signal.butter(N, Wn, "high") #Calculate the numerator and denominator of the filter transfer function
y = signal.filtfilt(b, a, x) #Filter the signal
return y
nikkei["high"] = highpass(nikkei["value"], samplerate, fp, fs, gpass, gstop)
#Output graph
plt.plot(np.abs(np.fft.fft(nikkei["high"])[:3000]))
plt.xlabel("frequency[Hz]")
plt.ylabel("amplitude")
plt.yscale('log')
plt.savefig("high")
I referred to this site. As a result, the graph after FFT is as follows.
raw data | After applying the high-pass filter |
---|---|
![]() |
![]() |
The scale may be hard to see, but I was able to remove the low frequency components. As a result, the graph of the Nikkei 225 is as follows.
This may be explained by the meteorological factor.
Combine the data and split the one up to 2017 into train and the rest into test
df=df.fillna(0)
df_merge=pd.merge(df, nikkei, on="date")
df_merge
train = df_merge[df_merge["year_y"] <= 2017]
x_train = train.drop(["high","value", "date"], axis=1)
y_train = train["high"]
test = df_merge[df_merge["year_y"] > 2017]
x_test = test.drop(["high","value", "date"], axis=1)
y_test = test["high"]
Random forest was used for the prediction.
model_fore=RandomForestRegressor(n_estimators=50,max_depth=5).fit(x_train, y_train)
y_pred=model_fore.predict(x_test)
print("test")
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("r^2 :", r2_score(y_test, y_pred), "\n")
print("train")
y_pred=model_fore.predict(x_train)
print("RMSE:", np.sqrt(mean_squared_error(y_train, y_pred)))
print("r^2 :", r2_score(y_train, y_pred), "\n")
dic={}
for key, value in zip(df.columns, model_fore.feature_importances_):
dic[key]=value
sorted(dic.items(), key=lambda x: x[1], reverse=True)
As a result, it was like this.
test RMSE: 938 r^2 : -0.031
train RMSE: 513 r^2 : 0.124
It's completely useless. I also tried linear regression and neural networks, but the results were almost the same. After all, the Nikkei average and meteorological factors do not seem to be related. I'm sorry. By the way, the importance of variables was like this.
variable | importance |
---|---|
Year | 0.376 |
Month | 0.197 |
Average temperature | 0.123 |
Day | 0.088 |
Humidity | 0.067 |
Barometric pressure | 0.066 |
Highest temperature | 0.063 |
Precipitation | 0.019 |
Presence or absence of precipitation | 0.001 |
You can see that the variables used for the forecast are also important, such as the year and month, which are not related to the weather ... This time, I couldn't predict it at all, so I would like to revenge by verifying whether there is a statistical correlation.
-Apply a low-pass filter with Python's SciPy!
Recommended Posts