[PYTHON] I tried to predict the presence or absence of snow by machine learning.

What I wanted to do

As a machine learning beginner, I wanted to use Python's machine learning library scikit-learn to study machine learning. I was wondering what to predict for a while, but I started thinking that it would be interesting to predict whether or not snow is piled up on the ground given specific conditions.

By the way, this is my first time writing a Python app. I was wondering if I could do something with the familiar Ruby, but Python seems to be strong in this field, and I started with Python because I am a lazy person who does not want to stumble and struggle.

Collect data for learning

Well, first of all, prepare the actual snow cover data to train the machine learning engine. Here, the actual observation data is dropped from the Meteorological Data Download Site of the Japan Meteorological Agency. This time, we used the meteorological data of Tonami City, Toyama Prefecture, which has a lot of snow, as learning data.

data_2013_2015.csv


Download time: 2016/03/20 20:31:19

,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami,Tonami
Date and time,temperature(℃),temperature(℃),temperature(℃),Snow cover(cm),Snow cover(cm),Snow cover(cm),wind speed(m/s),wind speed(m/s),wind speed(m/s),wind speed(m/s),wind speed(m/s),Precipitation(mm),Precipitation(mm),Precipitation(mm)
,,,,,,,,,Wind direction,Wind direction,,,,
,,quality information,Homogeneous number,,quality information,Homogeneous number,,quality information,,quality information,Homogeneous number,,quality information,Homogeneous number
2013/2/1 1:00:00,-3.3,8,1,3,8,1,0.4,8,West,8,1,0.0,8,1
2013/2/1 2:00:00,-3.7,8,1,3,8,1,0.3,8,North,8,1,0.0,8,1
2013/2/1 3:00:00,-4.0,8,1,3,8,1,0.2,8,Quiet,8,1,0.0,8,1
2013/2/1 4:00:00,-4.8,8,1,3,8,1,0.9,8,South-southeast,8,1,0.0,8,1
...

The data looks like this. On the above website of the Japan Meteorological Agency, you can select the necessary items, but I selected temperature, snow cover, wind speed, wind direction, and precipitation. However, I didn't use the wind speed and direction ...

Also, since the amount of data that can be dropped at one time is limited on the Japan Meteorological Agency site, the data for February and March from 2004 to 2015 was dropped in four parts.

this

iconv -f Shift-JIS -t UTF-8 sample_data_sjis/data_2004_2006.csv >> sample_data/data.csv

I converted from SJIS to UTF-8 in this way, and deleted unnecessary lines in one file [here](https://github.com/hiroeorz/snow-forecast/blob/master/ sample_data / data.csv).

From execution of learning to prediction

After that, we will train using scikit-learn and make predictions using the trained model. The script is as follows.

snow_forecaster.py


import csv
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier,ExtraTreesClassifier,GradientBoostingClassifier,RandomForestClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn import datasets
from sklearn.cross_validation import cross_val_score

class SnowForecast:

    CLF_NAMES = ["LinearSVC","AdaBoostClassifier","ExtraTreesClassifier" ,
                 "GradientBoostingClassifier","RandomForestClassifier"]

    def __init__(self):
        u"""Initialize each instance variable"""
        self.clf = None
        self.data = {"target" : [], "data" : []}
        self.weather_data = None
        self.days_data = {}
        self.days_snow = {}

    def load_csv(self):
        u"""Read a CSV file for learning"""
        with open("sample_data/data.csv", "r") as f:
            reader = csv.reader(f)
            accumulation_yesterday = 0
            temp_yeaterday = 0
            date_yesterday = ""
            
            for row in reader:
                if row[4] == "":
                    continue

                daytime = row[0]
                date = daytime.split(" ")[0]
                temp = int(float(row[1]))
                accumulation = int(row[4])
                wind_speed = float(row[7])
                precipitation = float(row[12])

                if date_yesterday != "":
                    # [temperature,Precipitation, 昨日のtemperature,Yesterday's snowfall]
                    sample = [temp, precipitation, temp_yeaterday, accumulation_yesterday]
                    exist = self.accumulation_exist(accumulation)
                    self.data["data"].append(sample)
                    self.data["target"].append(exist)
                    self.days_data[daytime] = sample
                    self.days_snow[daytime] = exist

                if date_yesterday != date:
                    accumulation_yesterday = accumulation
                    temp_yeaterday = temp
                    date_yesterday = date

        return self.data

    def is_snow_exist(self, daytime_str):
        u"""Returns 1 if snow is piled up, 0 if not piled up."""
        return self.days_snow[daytime_str]

    def predict_with_date(self, daytime_str):
        u"""Predict the presence or absence of snow using the data of a given date."""
        sample = self.days_data[daytime_str]
        temp = sample[0]
        precipitation = sample[1]
        temp_yeaterday = sample[2]
        accumulation_yesterday = sample[3]
        return self.predict(temp, precipitation, temp_yeaterday, accumulation_yesterday)

    def predict(self, temp, precipitation, temp_yeaterday, accumulation_yesterday):
        u"""Predict the presence or absence of snow using the given parameters."""
        return self.clf.predict([[temp, precipitation, temp_yeaterday, accumulation_yesterday]])[0]

    def train_data(self):
        u"""Returns the data for learning. Returns it if it has already been read, if not yet read from the CVS file"""
        if self.weather_data is None:
            self.weather_data = self.load_csv()

        return self.weather_data

    def accumulation_exist(self, accumulation):
        u"""Snowfall(cm)And returns 1 if there is snow, 0 if not"""
        if accumulation > 0:
            return 1
        else:
            return 0

    def best_score_clf(self):
        u"""Calculate the score for each type of training model and store the object with the highest score as an instance variable.."""
        features = self._features()
        labels = self._labels()

        #This time, only 4 features are used to calculate the features, so the features will not be reduced. Therefore, the following is a comment.
        # lsa = TruncatedSVD(3)
        # reduced_features = lsa.fit_transform(features)

        best = LinearSVC()
        best_name = self.CLF_NAMES[0]
        best_score = 0

        for clf_name in self.CLF_NAMES:
            clf    = eval("%s()" % clf_name) 
            scores = cross_val_score(clf, features, labels, cv=5) #Reduced when the feature amount is reduced_Use features
            score  = sum(scores) / len(scores)  #Measure the correct answer rate of the model
            print("%s score:%s" % (clf_name,score))
            if score >= best_score:
                best = clf
                best_name = clf_name
                best_score = score

        print("------\n Model to use: %s" % best_name)
        return clf

    def train(self):
        u"""Perform learning. Determine which model to use before the actual learning and let it be selected automatically."""
        self.clf = self.best_score_clf()
        self.clf.fit(self._features(), self._labels())

    def _features(self):
        u"""Returns training data."""
        weather = self.train_data()
        return weather["data"]

    def _labels(self):
        u"""Returns the resulting label."""
        weather = self.train_data()
        return weather["target"]

    def judge(self, datetime_str):
        u"""Receives the date character string and determines the snow cover."""
        print("------")
        result = forecaster.predict_with_date(datetime_str)
        print("%s:Expected:%s Actual:%s" % (datetime_str, result, forecaster.is_snow_exist(datetime_str)))

        if result == 1:
            print("Snow piles up")
        else:
            print("Snow doesn't pile up")

if __name__ == "__main__":
    forecaster = SnowForecast()
    forecaster.train()
    
    #####################################################
    #Judgment is made by specifying the date and giving the parameters used for learning.
    #####################################################
    forecaster.judge("2006/2/19 00:00:00")
    forecaster.judge("2012/2/2 00:00:00")
    forecaster.judge("2014/2/2 13:00:00")
    forecaster.judge("2015/2/28 00:00:00")
    
    #######################################
    #Let's give a parameter directly and make a prediction.
    #######################################
    print("------")
    temp = 0.0
    precipitation = 0
    temp_yeaterday = 3.0
    accumulation_yesterday = 2 
    result = forecaster.predict(temp, precipitation, temp_yeaterday, accumulation_yesterday)
    print("[temperature:%s] [Precipitation:%s] [昨日のtemperature:%s] [Yesterday's snowfall:%s]" %
          (temp, precipitation, temp_yeaterday, accumulation_yesterday))
    
    print("judgment result: %s" % result)
    
    if result == 1:
        print("Snow piles up")
    else:
        print("Snow doesn't pile up")
        
    #########################################################
    #Give parameters directly and try to predict(Yesterday's temperature-3.Change to 0 ° C)。
    #########################################################
    print("------")
    temp_yeaterday = -3.0
    result = forecaster.predict(temp, precipitation, temp_yeaterday, accumulation_yesterday)
    print("[temperature:%s] [Precipitation:%s] [昨日のtemperature:%s] [Yesterday's snowfall:%s]" %
          (temp, precipitation, temp_yeaterday, accumulation_yesterday))
    
    print("judgment result: %s" % result)
    
    if result == 1:
        print("Snow piles up")
    else:
        print("Snow doesn't pile up")
        
    print("------")

Since this is my first time using Python, I would appreciate it if you could point out any strange points.

Execution is as follows

$ python snow_forecaster.py

The execution result is as follows.

LinearSVC score:0.965627801273
AdaBoost Classifier score:0.969820996581
ExtraTrees Classifier score:0.961194223678
GradientBoostingClassifier score:0.966826266875
RandomForestClassifier score:0.958078728911
------
Model to use: AdaBoostClassifier
------
2006/2/19 00:00:00:Expected:1 Actual:1
Snow piles up
------
2012/2/2 00:00:00:Expected:1 Actual:1
Snow piles up
------
2014/2/2 13:00:00:Expected:0 Actual:0
Snow doesn't pile up
------
2015/2/28 00:00:00:Expected:0 Actual:0
Snow doesn't pile up
------
[temperature:0.0] [Precipitation:0] [昨日のtemperature:3.0] [Yesterday's snowfall:2]
judgment result: 0
Snow doesn't pile up
------
[temperature:0.0] [Precipitation:0] [昨日のtemperature:-3.0] [Yesterday's snowfall:2]
judgment result: 1
Snow piles up
------

First, the scores are calculated on several models, and the training is performed on the model with the best score. It seems that ʻAdaBoostClassifier` is adopted here.

By the way, the code of the part that selects the model is the code of I learned it in 2 months until I released the product as a machine learning fucking amateur I used it.

Then, the learning is executed and the judgment is performed. In the second half, I try to change the temperature yesterday by making the temperature, precipitation, and the amount of snow yesterday the same. If there was 2 cm of snow yesterday and the temperature was 3.0 ° C yesterday, it is judged that there is no snow today, but when I changed the temperature to -3.0 ° C yesterday and predicted, "there is snow". It changed to the judgment. This seems intuitively correct, as it was snowy yesterday and is likely to remain unmelted if the temperature is low.

Impressions etc.

I don't understand any theory, but for the time being, even beginners could try the machine learning app in a few hours. By the way, regarding the snow cover forecast, I was trying to make a forecast using only today's data, but it didn't work, and when I gave yesterday's data, the prediction accuracy suddenly increased to nearly 97%. Certainly, if snow is piled up as of yesterday, it is highly likely that it will be piled up the next day, so if you bring what kind of parameters to give in such a place from a realistic place, the accuracy will improve. thought. It's interesting, so I'll try a little more, and when I get used to it, I'll study the theory a little.

Recommended Posts

I tried to predict the presence or absence of snow by machine learning.
Predict the presence or absence of infidelity by machine learning
I tried to predict the change in snowfall for 2 years by machine learning
[Machine learning] I tried to summarize the theory of Adaboost
I tried to verify the yin and yang classification of Hololive members by machine learning
I tried to predict the price of ETF
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried to compress the image using machine learning
I tried to predict the victory or defeat of the Premier League using the Qore SDK
I tried to compare the accuracy of machine learning models using kaggle as a theme.
I tried to predict by letting RNN learn the sine wave
Try to predict the triplet of boat race by ranking learning
I tried to predict the sales of game software with VARISTA by referring to the article of Codexa
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Implementation ~
I tried calling the prediction API of the machine learning model from WordPress
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to verify and analyze the acceleration of Python by Cython
A beginner of machine learning tried to predict Arima Kinen with python
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried to verify the result of A / B test by chi-square test
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to vectorize the lyrics of Hinatazaka46!
I tried to make Othello AI with tensorflow without understanding the theory of machine learning ~ Battle Edition ~
Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae
I tried to predict the deterioration of the lithium ion battery using the Qore SDK
I tried to organize the evaluation indexes used in machine learning (regression model)
I tried to process and transform the image and expand the data for machine learning
I tried to rescue the data of the laptop by booting it on Ubuntu
python beginners tried to predict the number of criminals
I tried to move machine learning (ObjectDetection) with TouchDesigner
I tried to predict the J-League match (data analysis)
Predict the gender of Twitter users with machine learning
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried increasing or decreasing the number by programming
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
I tried to predict the number of people infected with coronavirus in Japan by the method of the latest paper in China
[Python] I tried to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
[Keras] I tried to solve a donut-type region classification problem by machine learning [Study]
I tried to predict horse racing by doing everything from data collection to deep learning
I tried "Lobe" which can easily train the machine learning model published by Microsoft.
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
Try to evaluate the performance of machine learning / regression model
I tried to find the average of the sequence with TensorFlow
I tried to implement anomaly detection by sparse structure learning
Try to evaluate the performance of machine learning / classification model
I tried machine learning to convert sentences into XX style
How to increase the number of machine learning dataset images
[Python] I tried to visualize the follow relationship of Twitter
I tried to implement ListNet of rank learning with Chainer
[TF] I tried to visualize the learning result using Tensorboard
I tried to fight the Local Minimum of Goldstein-Price Function
I tried to predict the infection of new pneumonia using the SIR model: ☓ Wuhan edition ○ Hubei edition
I tried to predict the genre of music from the song title on the Recurrent Neural Network
Confirmed the difference in the presence or absence of random processing during mini-batch learning with chainer
I tried to understand it carefully while implementing the algorithm Adaboost in machine learning (+ I deepened my understanding of array calculation)