[PYTHON] I tried to predict the change in snowfall for 2 years by machine learning

This entry is a sequel to the previously written I tried to predict the presence or absence of snow by machine learning. At this time, I predicted only the presence or absence of snow (1 or 0), but I tried a little more to predict the change in the amount of snow.

When I wrote down the result first, it looked like this. The horizontal axis is the number of days, and the vertical axis is the amount of snow (cm).

Result 1 (blue is the actual amount of snow, red line is the predicted amount of snow) スクリーンショット 2016-05-01 17.45.39.png

Result 2 (blue is the actual amount of snow, red line is the predicted amount of snow) スクリーンショット 2016-05-01 17.59.35.png

Please read the following to find out what "Result 1" and "Result 2" are respectively.

What I wanted to do

Previously, I tried to predict the presence or absence of snow by using scikit-learn in I tried to predict the presence or absence of snow by machine learning. However, I got a little greedy and wanted to predict the actual amount of snow (cm) for a certain period, not whether it was present or not.

Specifically, we will acquire meteorological data such as snow cover wind speed`` temperature provided by the Japan Meteorological Agency, and use the data for the first 7500 days for learning, and the remaining 2 years (365x2 = Predict changes in snowfall (730 days) and compare with actual changes in snowfall.

Collect training data

The learning data will be the one published by the Japan Meteorological Agency. Specifically, please refer to the previously written I tried to predict the presence or absence of snow by machine learning.

The obtained CSV data looks like this. The target was Tonami City in Toyama, which has a lot of snow.

data_2013_2015.csv


Download time: 2016/03/20 20:31:19

,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami
Date and time,temperature(℃),temperature(℃),temperature(℃),Snow cover(cm),Snow cover(cm),Snow cover(cm),wind speed(m/s),wind speed(m/s),wind speed(m/s),wind speed(m/s),wind speed(m/s),Precipitation(mm),Precipitation(mm),Precipitation(mm)
,,,,,,,,,Wind direction,Wind direction,,,,
,,quality information,Homogeneous number,,quality information,Homogeneous number,,quality information,,quality information,Homogeneous number,,quality information,Homogeneous number
2013/2/1 1:00:00,-3.3,8,1,3,8,1,0.4,8,West,8,1,0.0,8,1
2013/2/1 2:00:00,-3.7,8,1,3,8,1,0.3,8,North,8,1,0.0,8,1
2013/2/1 3:00:00,-4.0,8,1,3,8,1,0.2,8,Quiet,8,1,0.0,8,1
2013/2/1 4:00:00,-4.8,8,1,3,8,1,0.9,8,South-southeast,8,1,0.0,8,1
...

basic way of thinking

The idea is that this kind of prediction is probably standard, but we train the model with some types of peripheral data and the resulting amount of snow as a set, and only the peripheral data is applied to the resulting model. It is to give and get the predicted value of the amount of snowfall. So-called "supervised learning" </ b>.

In this case, the following data was used as peripheral data.

  • Temperature
  • Wind speed
  • Yesterday's snowfall
  • 1 day ago temperature, 2 days ago temperature, 3 days ago temperature
  • Wind speed 1 day ago, wind speed 2 days ago, wind speed 3 days ago

Expressed as an image, it looks like this.

[temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]→ Snowfall on the day

[temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]→ Snowfall on the day

[temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]→ Snowfall on the day

....


[temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]→ Snowfall on the day

So, based on this, give only the peripheral data and get the predicted value

[temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]→ (Predicted amount of snow on the day)

I did it like this. Basically, the data of the forecast target date is given, but only one yesterday's snowfall amount is the data one day before the forecast target date. And it seemed to have the most impact on the data it gave. Well, when you think about it, it's natural.

As I wrote at the beginning, I will use the data for about 7500 days from the data obtained from the Japan Meteorological Agency for learning, predict the change in snow cover for the remaining 2 years, and compare it with the actual change in snow cover.

Try to predict

The actual code looks like this:

snow_forecaster.py



import csv
import numpy as np
from matplotlib import pyplot
from sklearn import linear_model
from sklearn import cross_validation


class SnowForecast:

    def __init__(self):
        u"""Initialize each instance variable"""
        self.model = None    #Generated learning model
        self.data = []       #Array of training data
        self.target = []     #Array of actual snow cover
        self.predicts = []   #Array of predicted values of snowfall
        self.reals = []      #Array of actual snow cover
        self.day_counts = [] #Array of elapsed dates from the start date
        self.date_list = []
        self.record_count = 0

    def load_csv(self):
        u"""Read a CSV file for learning"""
        with open("sample_data/data.csv", "r") as f:
            reader = csv.reader(f)
            accumulation_yesterday0 = 0
            date_yesterday = ""            
            temp_3days = []
            wind_speed_3days = []

            for row in reader:
                if row[4] == "":
                    continue

                daytime = row[0]               # "yyyy/mmdd HH:MM:SS"
                date = daytime.split(" ")[0]   # "yyyy/mm/dd"
                temp = int(float(row[1]))      #temperature. There is a subtle effect
                wind_speed = float(row[7])     #wind speed. There is a subtle effect
                precipitation = float(row[12]) #Precipitation. no effect
                accumulation = int(row[4])     #The amount of snow. The amount of snowfall yesterday has a big impact

                if len(wind_speed_3days) == 3:
                    #Training data
                    # [temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]
                    sample = [temp, wind_speed, accumulation_yesterday0]
                    sample.extend(temp_3days)
                    sample.extend(wind_speed_3days)
                    self.data.append(sample)
                    self.target.append(accumulation)

                if date_yesterday != date:
                    accumulation_yesterday0 = accumulation
                    self.date_list.append(date)

                    wind_speed_3days.insert(0, wind_speed)
                    if len(wind_speed_3days) > 3:
                        wind_speed_3days.pop()

                    temp_3days.insert(0, temp)
                    if len(temp_3days) > 3:
                        temp_3days.pop()

                    date_yesterday = date

        self.record_count = len(self.data)
        return self.data

    def train(self):
        u"""Generate a learning model. Use the training data up to about 7500 days of the original data"""
        x = self.data
        y = self.target
        print(len(x))
        # ElasticNetCV,LassoCV,Select Elastic NetCV with the smallest error from RidgeCV
        model = linear_model.ElasticNetCV(fit_intercept=True)
        model.fit(x[0:self.training_data_count()], y[0:self.training_data_count()])
        self.model = model

    def predict(self):
        u"""Predict using a learning model. Forecast for the last two years"""
        x = self.data
        y = self.target
        model = self.model

        for i, xi in enumerate(x):
            real_val = y[i]

            if i < self.training_data_count() + 1:
                self.predicts.append(0)
                self.reals.append(real_val)
                self.day_counts.append(i)
                continue

            predict_val = int(model.predict([xi])[0])

            #If the snowfall forecast is 0 or less, it is set to 0.
            if predict_val < 0:
                predict_val = 0

            self.predicts.append(predict_val)
            self.reals.append(real_val)
            self.day_counts.append(i)

    def show_graph(self):
        u"""Compare predicted and measured values with a graph"""
        pyplot.plot(self.day_counts[self.predict_start_num():], self.reals[self.predict_start_num():], "b")
        pyplot.plot(self.day_counts[self.predict_start_num():], self.predicts[self.predict_start_num():], "r")
        pyplot.show()

    def check(self):
        u"""Measure the error between training data and forecast data"""
        x = np.array(self.data[self.predict_start_num():])
        y = np.array(self.target[self.predict_start_num():])
        model = self.model
        p = np.array(self.predicts[self.predict_start_num():])
        e = p - np.array(self.reals[self.predict_start_num():])
        error = np.sum(e * e)
        rmse_10cv = np.sqrt(error / len(self.data[self.predict_start_num():]))
        print("RMSE(10-fold CV: {})".format(rmse_10cv))

    def training_data_count(self):
        u"""Leave the last two years and use the data before that as training data. Returns the number of training data"""
        return self.record_count - 365 * 2

    def predict_start_num(self):
        u"""The last two years are predicted and used to measure the error from the measured value. Returns the predicted start position"""
        return self.training_data_count() + 1

if __name__ == "__main__":
    forecaster = SnowForecast()
    forecaster.load_csv()
    forecaster.train()
    forecaster.predict()
    forecaster.check()
    forecaster.show_graph()

The most annoying part was creating training data from raw data as in the previous chapter. Still, it's easy because it's python.

So, the execution result is as follows (blue is the actual amount of snow, red line is the predicted amount of snow). This is the first "result 1" shown. スクリーンショット 2016-05-01 17.45.39.png

I'm predicting something like that.

At this point, I suddenly wondered how to do this. "But I'm predicting by giving the amount of snow one day ago, so when I actually try to use it for future prediction, I can only predict the amount of snow tomorrow ...?" b>

No, do you know? If you say that, the temperature and wind speed will be the same. But you see, they're weather forecasts ... Gefun Gefun

Changed to predict the next day's snowfall using the yesterday's snowfall that I predicted

So, I immediately modified the code like that. There are no particular changes to the learning part of the model. Of the data given when predicting the amount of snowfall, let's replace the amount of snowfall for yesterday with the` predicted value one day before, which was predicted by himself, instead of the actual measurement value.

The code is as follows. Only the predict function has changed.

snow_forecaster.py


import csv
import numpy as np
from matplotlib import pyplot
from sklearn import linear_model
from sklearn import cross_validation


class SnowForecast:

    def __init__(self):
        u"""Initialize each instance variable"""
        self.model = None    #Generated learning model
        self.data = []       #Array of training data
        self.target = []     #Array of actual snow cover
        self.predicts = []   #Array of predicted values of snowfall
        self.reals = []      #Array of actual snow cover
        self.day_counts = [] #Array of elapsed dates from the start date
        self.date_list = []
        self.record_count = 0

    def load_csv(self):
        u"""Read a CSV file for learning"""
        with open("sample_data/data.csv", "r") as f:
            reader = csv.reader(f)
            accumulation_yesterday0 = 0
            date_yesterday = ""            
            temp_3days = []
            wind_speed_3days = []

            for row in reader:
                if row[4] == "":
                    continue

                daytime = row[0]               # "yyyy/mmdd HH:MM:SS"
                date = daytime.split(" ")[0]   # "yyyy/mm/dd"
                temp = int(float(row[1]))      #temperature. There is a subtle effect
                wind_speed = float(row[7])     #wind speed. There is a subtle effect
                precipitation = float(row[12]) #Precipitation. no effect
                accumulation = int(row[4])     #The amount of snow. The amount of snowfall yesterday has a big impact

                if len(wind_speed_3days) == 3:
                    #Training data
                    # [temperature,wind speed,Yesterday's snowfall,1日前のtemperature,2日前のtemperature,3日前のtemperature, 1日前のwind speed, 2日前のwind speed, 3日前のwind speed]
                    sample = [temp, wind_speed, accumulation_yesterday0]
                    sample.extend(temp_3days)
                    sample.extend(wind_speed_3days)
                    self.data.append(sample)
                    self.target.append(accumulation)

                if date_yesterday != date:
                    accumulation_yesterday0 = accumulation
                    self.date_list.append(date)

                    wind_speed_3days.insert(0, wind_speed)
                    if len(wind_speed_3days) > 3:
                        wind_speed_3days.pop()

                    temp_3days.insert(0, temp)
                    if len(temp_3days) > 3:
                        temp_3days.pop()

                    date_yesterday = date

        self.record_count = len(self.data)
        return self.data

    def train(self):
        u"""Generate a learning model. Use the training data up to about 7500 days of the original data"""
        x = self.data
        y = self.target
        print(len(x))
        # ElasticNetCV,LassoCV,Select Elastic NetCV with the smallest error from RidgeCV
        model = linear_model.ElasticNetCV(fit_intercept=True)
        model.fit(x[0:self.training_data_count()], y[0:self.training_data_count()])
        self.model = model

    def predict(self):
        u"""Predict the amount of snowfall using a learning model. Forecast for the last two years"""
        x = self.data
        y = self.target
        model = self.model
        yesterday_predict_val = None #Variable to store yesterday's forecast value

        for i, xi in enumerate(x):
            real_val = y[i]

            if i < self.training_data_count() + 1:
                self.predicts.append(0)
                self.reals.append(real_val)
                self.day_counts.append(i)
                continue

            #Replace yesterday's snowfall with yesterday's forecast
            if yesterday_predict_val != None:
                xi[2] = yesterday_predict_val

            predict_val = int(model.predict([xi])[0])

            #If the snowfall forecast is 0 or less, it is set to 0.
            if predict_val < 0:
                predict_val = 0

            self.predicts.append(predict_val)
            self.reals.append(real_val)
            self.day_counts.append(i)
            yesterday_predict_val = predict_val

    def show_graph(self):
        u"""Compare predicted and measured values with a graph"""
        pyplot.plot(self.day_counts[self.predict_start_num():], self.reals[self.predict_start_num():], "b")
        pyplot.plot(self.day_counts[self.predict_start_num():], self.predicts[self.predict_start_num():], "r")
        pyplot.show()

    def check(self):
        u"""Measure the error between training data and forecast data"""
        x = np.array(self.data[self.predict_start_num():])
        y = np.array(self.target[self.predict_start_num():])
        model = self.model
        p = np.array(self.predicts[self.predict_start_num():])
        e = p - np.array(self.reals[self.predict_start_num():])
        error = np.sum(e * e)
        rmse_10cv = np.sqrt(error / len(self.data[self.predict_start_num():]))
        print("RMSE(10-fold CV: {})".format(rmse_10cv))

    def training_data_count(self):
        u"""Leave the last two years and use the data before that as training data. Returns the number of training data"""
        return self.record_count - 365 * 2

    def predict_start_num(self):
        u"""The last two years are predicted and used to measure the error from the measured value. Returns the predicted start position"""
        return self.training_data_count() + 1

if __name__ == "__main__":
    forecaster = SnowForecast()
    forecaster.load_csv()
    forecaster.train()
    forecaster.predict()
    forecaster.check()
    forecaster.show_graph()

The result is as follows (blue is the actual amount of snow, red line is the predicted amount of snow). "Result 2" shown at the beginning. スクリーンショット 2016-05-01 17.59.35.png

Hmm. As expected, it became more inaccurate than when the actual amount of snow covered yesterday was given. However, it seems that the waveform is not so messed up.

Impressions etc.

I was wondering if it would be a more messed up prediction, but I thought I was able to predict it like that. However, although it was successfully deceived by Gefun Gefun on the way, the temperature and wind speed given when predicting are using the measured values of the day. However, if you want to make predictions for a certain period in the future, you have to use the predicted values separately or stop using those values in the first place, so if you use the predicted values, the accuracy will be higher. It will go down. Moreover, the more the future. So, if you want to do something like this, make a prediction using the predicted value, then make a prediction using it, and so on, and the later, the slight error in the previous process will greatly increase. thought. That's why the Japan Meteorological Agency does its best (

Recommended Posts