[PYTHON] I tried to predict the presence or absence of snow by machine learning.

What I wanted to do

As a machine learning beginner, I wanted to use Python's machine learning library scikit-learn to study machine learning. I was wondering what to predict for a while, but I started thinking that it would be interesting to predict whether or not snow is piled up on the ground given specific conditions.

By the way, this is my first time writing a Python app. I was wondering if I could do something with the familiar Ruby, but Python seems to be strong in this field, and I started with Python because I am a lazy person who does not want to stumble and struggle.

Collect data for learning

Well, first of all, prepare the actual snow cover data to train the machine learning engine. Here, the actual observation data is dropped from the Meteorological Data Download Site of the Japan Meteorological Agency. This time, we used the meteorological data of Tonami City, Toyama Prefecture, which has a lot of snow, as learning data.

`data_2013_2015.csv`


Download time: 2016/03/20 20:31:19

,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami
Date and time,temperature(℃),temperature(℃),temperature(℃),Snow cover(cm),Snow cover(cm),Snow cover(cm),wind speed(m/s),wind speed(m/s),wind speed(m/s),wind speed(m/s),wind speed(m/s),Precipitation(mm),Precipitation(mm),Precipitation(mm)
,,,,,,,,,Wind direction,Wind direction,,,,
,,quality information,Homogeneous number,,quality information,Homogeneous number,,quality information,,quality information,Homogeneous number,,quality information,Homogeneous number
2013/2/1 1:00:00,-3.3,8,1,3,8,1,0.4,8,West,8,1,0.0,8,1
2013/2/1 2:00:00,-3.7,8,1,3,8,1,0.3,8,North,8,1,0.0,8,1
2013/2/1 3:00:00,-4.0,8,1,3,8,1,0.2,8,Quiet,8,1,0.0,8,1
2013/2/1 4:00:00,-4.8,8,1,3,8,1,0.9,8,South-southeast,8,1,0.0,8,1
...

The data looks like this. On the above website of the Japan Meteorological Agency, you can select the necessary items, but I selected temperature, snow cover, wind speed, wind direction, and precipitation. However, I didn't use the wind speed and direction ...

Also, since the amount of data that can be dropped at one time is limited on the Japan Meteorological Agency site, the data for February and March from 2004 to 2015 was dropped in four parts.

this

iconv -f Shift-JIS -t UTF-8 sample_data_sjis/data_2004_2006.csv >> sample_data/data.csv

I converted from SJIS to UTF-8 in this way, and deleted unnecessary lines in one file [here](https://github.com/hiroeorz/snow-forecast/blob/master/ sample_data / data.csv).

From execution of learning to prediction

After that, we will train using scikit-learn and make predictions using the trained model. The script is as follows.

By the way, the apps that include the following Python scripts and learning data are listed on hiroeorz / snow-forecast on github.

`snow_forecaster.py`


import csv
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier,ExtraTreesClassifier,GradientBoostingClassifier,RandomForestClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn import datasets
from sklearn.cross_validation import cross_val_score

class SnowForecast:

    CLF_NAMES = ["LinearSVC","AdaBoostClassifier","ExtraTreesClassifier" ,
                 "GradientBoostingClassifier","RandomForestClassifier"]

    def __init__(self):
        u"""Initialize each instance variable"""
        self.clf = None
        self.data = {"target" : [], "data" : []}
        self.weather_data = None
        self.days_data = {}
        self.days_snow = {}

    def load_csv(self):
        u"""Read a CSV file for learning"""
        with open("sample_data/data.csv", "r") as f:
            reader = csv.reader(f)
            accumulation_yesterday = 0
            temp_yeaterday = 0
            date_yesterday = ""
            
            for row in reader:
                if row[4] == "":
                    continue

                daytime = row[0]
                date = daytime.split(" ")[0]
                temp = int(float(row[1]))
                accumulation = int(row[4])
                wind_speed = float(row[7])
                precipitation = float(row[12])

                if date_yesterday != "":
                    # [temperature,Precipitation, 昨日のtemperature,Yesterday's snowfall]
                    sample = [temp, precipitation, temp_yeaterday, accumulation_yesterday]
                    exist = self.accumulation_exist(accumulation)
                    self.data["data"].append(sample)
                    self.data["target"].append(exist)
                    self.days_data[daytime] = sample
                    self.days_snow[daytime] = exist

                if date_yesterday != date:
                    accumulation_yesterday = accumulation
                    temp_yeaterday = temp
                    date_yesterday = date

        return self.data

    def is_snow_exist(self, daytime_str):
        u"""Returns 1 if snow is piled up, 0 if not piled up."""
        return self.days_snow[daytime_str]

    def predict_with_date(self, daytime_str):
        u"""Predict the presence or absence of snow using the data of a given date."""
        sample = self.days_data[daytime_str]
        temp = sample[0]
        precipitation = sample[1]
        temp_yeaterday = sample[2]
        accumulation_yesterday = sample[3]
        return self.predict(temp, precipitation, temp_yeaterday, accumulation_yesterday)

    def predict(self, temp, precipitation, temp_yeaterday, accumulation_yesterday):
        u"""Predict the presence or absence of snow using the given parameters."""
        return self.clf.predict([[temp, precipitation, temp_yeaterday, accumulation_yesterday]])[0]

    def train_data(self):
        u"""Returns the data for learning. Returns it if it has already been read, if not yet read from the CVS file"""
        if self.weather_data is None:
            self.weather_data = self.load_csv()

        return self.weather_data

    def accumulation_exist(self, accumulation):
        u"""Snowfall(cm)And returns 1 if there is snow, 0 if not"""
        if accumulation > 0:
            return 1
        else:
            return 0

    def best_score_clf(self):
        u"""Calculate the score for each type of training model and store the object with the highest score as an instance variable.."""
        features = self._features()
        labels = self._labels()

        #This time, only 4 features are used to calculate the features, so the features will not be reduced. Therefore, the following is a comment.
        # lsa = TruncatedSVD(3)
        # reduced_features = lsa.fit_transform(features)

        best = LinearSVC()
        best_name = self.CLF_NAMES[0]
        best_score = 0

        for clf_name in self.CLF_NAMES:
            clf    = eval("%s()" % clf_name) 
            scores = cross_val_score(clf, features, labels, cv=5) #Reduced when the feature amount is reduced_Use features
            score  = sum(scores) / len(scores)  #Measure the correct answer rate of the model
            print("%s score:%s" % (clf_name,score))
            if score >= best_score:
                best = clf
                best_name = clf_name
                best_score = score

        print("------\n Model to use: %s" % best_name)
        return clf

    def train(self):
        u"""Perform learning. Determine which model to use before the actual learning and let it be selected automatically."""
        self.clf = self.best_score_clf()
        self.clf.fit(self._features(), self._labels())

    def _features(self):
        u"""Returns training data."""
        weather = self.train_data()
        return weather["data"]

    def _labels(self):
        u"""Returns the resulting label."""
        weather = self.train_data()
        return weather["target"]

    def judge(self, datetime_str):
        u"""Receives the date character string and determines the snow cover."""
        print("------")
        result = forecaster.predict_with_date(datetime_str)
        print("%s:Expected:%s Actual:%s" % (datetime_str, result, forecaster.is_snow_exist(datetime_str)))

        if result == 1:
            print("Snow piles up")
        else:
            print("Snow doesn't pile up")

if __name__ == "__main__":
    forecaster = SnowForecast()
    forecaster.train()
    
    #####################################################
    #Judgment is made by specifying the date and giving the parameters used for learning.
    #####################################################
    forecaster.judge("2006/2/19 00:00:00")
    forecaster.judge("2012/2/2 00:00:00")
    forecaster.judge("2014/2/2 13:00:00")
    forecaster.judge("2015/2/28 00:00:00")
    
    #######################################
    #Let's give a parameter directly and make a prediction.
    #######################################
    print("------")
    temp = 0.0
    precipitation = 0
    temp_yeaterday = 3.0
    accumulation_yesterday = 2 
    result = forecaster.predict(temp, precipitation, temp_yeaterday, accumulation_yesterday)
    print("[temperature:%s] [Precipitation:%s] [昨日のtemperature:%s] [Yesterday's snowfall:%s]" %
          (temp, precipitation, temp_yeaterday, accumulation_yesterday))
    
    print("judgment result: %s" % result)
    
    if result == 1:
        print("Snow piles up")
    else:
        print("Snow doesn't pile up")
        
    #########################################################
    #Give parameters directly and try to predict(Yesterday's temperature-3.Change to 0 ° C)。
    #########################################################
    print("------")
    temp_yeaterday = -3.0
    result = forecaster.predict(temp, precipitation, temp_yeaterday, accumulation_yesterday)
    print("[temperature:%s] [Precipitation:%s] [昨日のtemperature:%s] [Yesterday's snowfall:%s]" %
          (temp, precipitation, temp_yeaterday, accumulation_yesterday))
    
    print("judgment result: %s" % result)
    
    if result == 1:
        print("Snow piles up")
    else:
        print("Snow doesn't pile up")
        
    print("------")

Since this is my first time using Python, I would appreciate it if you could point out any strange points.

Execution is as follows

$ python snow_forecaster.py

The execution result is as follows.

LinearSVC score:0.965627801273
AdaBoost Classifier score:0.969820996581
ExtraTrees Classifier score:0.961194223678
GradientBoostingClassifier score:0.966826266875
RandomForestClassifier score:0.958078728911
------
Model to use: AdaBoostClassifier
------
2006/2/19 00:00:00:Expected:1 Actual:1
Snow piles up
------
2012/2/2 00:00:00:Expected:1 Actual:1
Snow piles up
------
2014/2/2 13:00:00:Expected:0 Actual:0
Snow doesn't pile up
------
2015/2/28 00:00:00:Expected:0 Actual:0
Snow doesn't pile up
------
[temperature:0.0] [Precipitation:0] [昨日のtemperature:3.0] [Yesterday's snowfall:2]
judgment result: 0
Snow doesn't pile up
------
[temperature:0.0] [Precipitation:0] [昨日のtemperature:-3.0] [Yesterday's snowfall:2]
judgment result: 1
Snow piles up
------

First, the scores are calculated on several models, and the training is performed on the model with the best score. It seems that ʻAdaBoostClassifier` is adopted here.

By the way, the code of the part that selects the model is the code of I learned it in 2 months until I released the product as a machine learning fucking amateur I used it.

Then, the learning is executed and the judgment is performed. In the second half, I try to change the temperature yesterday by making the temperature, precipitation, and the amount of snow yesterday the same. If there was 2 cm of snow yesterday and the temperature was 3.0 ° C yesterday, it is judged that there is no snow today, but when I changed the temperature to -3.0 ° C yesterday and predicted, "there is snow". It changed to the judgment. This seems intuitively correct, as it was snowy yesterday and is likely to remain unmelted if the temperature is low.

Impressions etc.

I don't understand any theory, but for the time being, even beginners could try the machine learning app in a few hours. By the way, regarding the snow cover forecast, I was trying to make a forecast using only today's data, but it didn't work, and when I gave yesterday's data, the prediction accuracy suddenly increased to nearly 97%. Certainly, if snow is piled up as of yesterday, it is highly likely that it will be piled up the next day, so if you bring what kind of parameters to give in such a place from a realistic place, the accuracy will improve. thought. It's interesting, so I'll try a little more, and when I get used to it, I'll study the theory a little.