[PYTHON] Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae

Introduction

When I was looking for some good material as an example of using open data, I found that water level data was published on the site of Data City Sabae, so I tried machine learning using this. It was.

http://data.city.sabae.lg.jp/top_page/

Download data

On the "Open Data" page on the above site, the "Disaster Prevention" group has the following notation.

Water level data(Sabae City, Fukui Prefecture)
Rontegawa drainage pump station[CSV]
It is the data of the water level gauge in Sabae city. Water level unit:cm data:1000 cases

スクリーンショット 2016-10-28 12.41.43.png

By default, it is said that there are 1,000 data items, but I will use it because I was able to get a little more data.

In addition, past weather data can be downloaded from the Japan Meteorological Agency, so download the precipitation data of nearby Fukui City.

http://www.data.jma.go.jp/gmd/risk/obsdl/index.php

Loading the library

Use Jupyter Notebook to load the following libraries.

python


from ipywidgets import FloatProgress
from IPython.display import display

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
import datetime

Reading water level data

python


filename = "sparql.csv"
df = pd.read_csv(filename, header=None)

Let's display it as a graph.

python


tmp = []
for i in range(len(df)):
    pos = len(df) - 1 - i
    tmp.append(df.ix[pos][2])

pd.DataFrame({'level': np.array(tmp)}).plot(figsize=(15,5))

Unknown.png

Water level data is acquired every 5 minutes, and the data is processed to match the time series with the data of the Japan Meteorological Agency.

python


#Get data start and end dates
dt1 = datetime.datetime.strptime(df[1][len(df)-1],"%Y-%m-%dT%H:%M:%S+09:00")
dt1 = datetime.datetime(dt1.year,dt1.month,dt1.day,0,0)
dt2 = datetime.datetime.strptime(df[1][0],"%Y-%m-%dT%H:%M:%S+09:00")

print("dt1:",dt1)
print("dt2:",dt2)

#Get the number of days of data
dt = (dt2-dt1).days + 1

#Prepare an array to store hourly data
level = [0] * dt * 24
dt_al = [0] * dt * 24

#Progress bar settings
fp = FloatProgress(min=0, max=len(df))
display(fp)

for i in range(len(df)):
    wk = datetime.datetime.strptime(df[1][len(df)-i-1],"%Y-%m-%dT%H:%M:%S+09:00")
    pos = (wk - dt1).days * 24 + wk.hour
    dt_al[pos] = datetime.datetime(wk.year,wk.month,wk.day,wk.hour,0)

    if wk.minute == 0:
        level[pos] = df[2][len(df)-1-i]
    
    fp.value = i

Reading precipitation data

Read the data paying attention to the fact that the CSV contains data that is not counted and that the character code is Shift JIS. Also, try displaying the read data as a graph.

python


filename = "data.csv"
df = pd.read_csv(filename,encoding="SHIFT-JIS",skiprows=4)
df.plot(figsize=(15,5))

Unknown.png

Store water level and precipitation data in the same format array

To make the data easier to handle, store it in an array and then display it as a graph.

python


#Array preparation
rain = [0]*len(level)

for i in range(len(df)):
    wk = datetime.datetime.strptime(df.ix[i][0],"%Y/%m/%d %H:%M:%S")
    if (wk < dt2) and (wk - dt1).days >= 0:
        pos = (wk - dt1).days * 24 + wk.hour
        rain[pos] = df.ix[i][1]

#Check the data on the graph
pp = pd.DataFrame({'level': np.array(level), 'rain': np.array(rain)*15})
pp.plot(figsize=(15,5))

Unknown-1.png

There seems to be a lot of missing data ... (sweat)

Examination of learning data

Looking at the graph, it seems that the water level tends to increase after it rains, so let's input the precipitation information from 48 hours ago to that time and use the water level as the output teacher data.

python


#Get 48 hours of precipitation in a two-dimensional array
row = len(level)
tmp = np.zeros((row,48))

fp = FloatProgress(min=0, max=row)
display(fp)

for i in range(row):
    for j in range(len(tmp[0])):
        pos = row - 1 - i - j
        tmp[row-1-i][j] = rain[pos]
    fp.value = i

Trimming missing data

If the water level data has not been obtained, it is not necessary and will be removed.

python


#Check the number of missing data
num = 0
for i in range(len(level)):
    if level[i] == 0:
        num += 1

#Preparing for data storage
X = np.empty((0,48))
y = []

for i in range(len(level)):
    if level[i] > 0:
        X = np.append(X, np.array([tmp[i]]), axis=0)
        y.append(level[i])

#Check the data on the graph
pp = pd.DataFrame({'level': np.array(y), 'rain': X[:,0]*20})
pp.plot(figsize=(15,5))

Unknown-2.png

If you look at the graph, you can see that it has become quite beautiful.

Machine learning

Learn from the cleaned data and check the score of the predicted result.

python


#Load the cross-validation module
from sklearn import cross_validation

#Training set with labeled data(X_train, y_train)And test set(X_test, y_test)Divided into
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=.2, random_state=42)

#Normalization
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#Model settings (random forest)
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=200, max_depth=50, random_state=42)

#Learning and prediction
model.fit(X_train, y_train)
result = model.predict(X_test)
result.shape

#Score
print(model.score(X_test,y_test))

Result is...

python


0.185628742515

... no!

Verification of results

The score is low, but let's check the result with a graph.

python


pp = pd.DataFrame({'act': np.array(y_test), "pred": np.array(result)})
pp.plot(figsize=(15,5))

Unknown-1.png

... Hmm, subtle.

With a little ingenuity, the data is divided into time series for learning and prediction as shown below.

python


num = int(len(X) * 0.8)
print(len(X), num, len(X)-num)

X_train = X[:num]
X_test = X[num:]
y_train = y[:num]
y_test = y[num:]

Unknown-2.png

... what! A little nice feeling (^-^)

Then, thinking about what can be done from this result, I think it can be used to detect a sudden rise in water level and give an evacuation warning by continuously predicting the water level from precipitation.

With that in mind, I hope more local governments will release such data.

What should I do next?

Postscript

I improved the accuracy by a learning method different from this article, and I was able to predict the water level one hour later, so I wrote it again. If you are interested, please also see the following URL.

Using open data from Data City Sabae to predict water level gauge values by machine learning Part 2

Recommended Posts

Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae
Using open data from Data City Sabae to predict water level gauge values by machine learning Part 2
Try to predict the triplet of boat race by ranking learning
I tried to predict the presence or absence of snow by machine learning.
Predict the presence or absence of infidelity by machine learning
Try to evaluate the performance of machine learning / classification model
Try to write code from 1 using the machine learning framework chainer (mnist edition)
Try to draw a "weather map-like front" by machine learning based on weather data (5)
Try to forecast power demand by machine learning
Try to draw a "weather map-like front" by machine learning based on weather data (1)
Try to draw a "weather map-like front" by machine learning based on weather data (4)
Try to draw a "weather map-like front" by machine learning based on weather data (2)
I tried to predict the change in snowfall for 2 years by machine learning
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 1: Data analysis)
Try using Jupyter Notebook of Azure Machine Learning
Predict the gender of Twitter users with machine learning
[Machine learning] Try to detect objects using Selective Search
I tried to compare the accuracy of machine learning models using kaggle as a theme.
[DanceDanceRevolution] Is it possible to predict the difficulty level (foot) from the value of the groove radar?
I tried to verify the yin and yang classification of Hololive members by machine learning
How to increase the number of machine learning dataset images
[Machine learning] I tried to summarize the theory of Adaboost
Try to predict if tweets will burn with machine learning
Try to extract the features of the sensor data with CNN
Learn accounting data and try to predict accounts from the content of the description when entering journals
[Note] Let's try to predict the amount of electricity used! (Part 1)
Let's visualize the river water level data released by Shimane Prefecture
[Machine learning] Check the performance of the classifier with handwritten character data
How to use machine learning for work? 01_ Understand the purpose of machine learning
I tried to open the latest data of the Excel file managed by date in the folder with Python
How to collect machine learning data
Try to image the elevation data of the Geographical Survey Institute with Python
One-click data prediction for the field realized by fully automatic machine learning
A beginner of machine learning tried to predict Arima Kinen with python
Python learning memo for machine learning by Chainer until the end of Chapter 2
Judge the authenticity of posted articles by machine learning (Google Prediction API).
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
I'm an amateur on the 14th day of python, but I want to try machine learning with scikit-learn