[PYTHON] Using open data from Data City Sabae to predict water level gauge values by machine learning Part 2

Introduction

Previously, I explained how to predict the water level from precipitation, but after that, when I investigated various things, it became possible to predict the water level one hour later with an accuracy of about 95%, so I will reorganize it. I will make an article.

Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae

Operating environment

item Contents
Machine MacBook Air (13-inch, Early 2015)
Processor 2.2 GHz Intel Core i7
memory 8 GB 1600 MHz DDR3
Python 3.6.0 :: Anaconda 4.3.1 (x86_64)
Jupyter Notebook 4.2.1

Environment construction procedure

Please refer to the following URL for the usual front miso.

Procedure to quickly create a deep learning environment on Mac with TensorFlow and OpenCV

Download data

Open Data List | Data City Sabae Portal Site

If you select the "Disaster prevention" group on the above website, you will see the following notation. Click the "CSV" button and download the CSV from the displayed link.

c9ad29a6-514d-cde3-f970-47dafaac9eff.png

In addition, past weather data can be downloaded from the Japan Meteorological Agency, so we will download hourly precipitation data in Fukui City.

Japan Meteorological Agency | Past Meteorological Data Download

Loading the library

Use Jupyter Notebook to load the following libraries.

python


from ipywidgets import FloatProgress
from IPython.display import display

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
import datetime

Reading water level data

python


#Read file
filename = "sparql.csv"
df_level = pd.read_csv(filename, header=None, skiprows=1)

#Rename column
df_level.columns = ["url","datetime","level"]

#Convert date and time to timestamp
df_level["datetime"] = df_level.datetime.map(lambda _: pd.to_datetime(_))

#Set date and time as index
df_level.index = df_level.pop("datetime")

#Sort by date and time(...I think it will work without it, but I'll leave it)
df_level = df_level.sort_index()

#graph display
df_level["level"].plot(figsize=(15,5))

When executed, the following graph will be displayed.

Unknown.png

Reading precipitation data

Please read the data and display it on the graph, paying attention to the fact that the CSV contains data that is not counted and that the character code is Shift JIS.

python


#Read file
filename = "data.csv"
df = pd.read_csv(filename,encoding="SHIFT-JIS",skiprows=4)

#Rename column
df_rain.columns = ["datetime", "rain", "Information without phenomenon","quality information","Homogeneous number"]

#Convert date and time to timestamp
df_rain["datetime"] = df_rain.datetime.map(lambda _: pd.to_datetime(_))

#Set date and time as index
df_rain.index = df_rain.pop("datetime")

#graph display
df_level.level.plot(figsize=(15,5))
df_rain.rain.plot(figsize=(15,5))

When executed, the following graph will be displayed. By the way, orange is the amount of precipitation.

Unknown-1.png

Data processing

This time, since we are predicting the water level one hour later, I would like to predict the maximum water level one hour later using the change in water level one hour ago and the amount of precipitation.

For that, the training data is as follows.

input output
Precipitation 1 hour ago
Water level every 5 minutes 1 hour ago(10 points)
Maximum water level after 1 hour

Since the water level data is data at 5-minute intervals, there should be 12 points of data every 60 minutes, but there are some missing data, and some of them have 12 points or less depending on the timing. After trial and error, the score is 10 points.

In addition, since the precipitation data is described as "1 hour before" on the Japan Meteorological Agency's website, it is considered to be the data 1 hour before the date and time set in the index.

Based on this, the data processing method is as follows.

python


#Get Precipitation Index
ixs = df_rain.index

#Creating an array for data acquisition
df = []
y = []

for i in range(len(ixs)-2):
    
    #Get date and time from index
    dt1 = ixs[i]
    dt2 = ixs[i + 1]
    dt3 = ixs[i + 2]
    
    #Get water level data from date and time data
    d1 = df_level[dt1:dt2].level.tolist()
    d2 = df_level[dt2:dt3].level.tolist()

    if len(d1) > 10 and len(d2) > 10:
        #Get the maximum water level after 1 hour
        y.append(max(d2))

        #Sort the water level data one hour ago in descending order
        d1.sort()
        d1.reverse()
        #Get 10 points of data
        d1 = d1[:10]
        #Get precipitation data
        d1.append(df_rain.ix[i].rain)
        #Get an array of input data
        df.append(d1)
        
#Convert to data frame
df = pd.DataFrame(df)
df["y"] = y

#Check the number of data
print(df.shape)

When I executed it, (6863, 12) was displayed and I was able to get 6863 rows of data.

Machine learning

We will learn the first half 90% of the data by machine learning and verify the learning result in the second half 10%.

python


#Divide data into input and output
y = df.pop("y").as_matrix().astype("int").flatten()
X = df.as_matrix().astype("float")

#Divided to use 90% for learning and 10% for verification
num = int(len(X) * 0.9)
print(len(X), num, len(X)-num)

X_train = X[:num]
X_test = X[num:]
y_train = y[:num]
y_test = y[num:]

#Set a random forest as a learning model
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(random_state=42)

#Learning and verification
model.fit(X_train, y_train)
result = model.predict(X_test)

#Score
print(model.score(X_test,y_test))

When I ran it, the prediction accuracy was "0.952915078747".

I'm not sure about the numbers, so I'll draw a graph.

python


pp = pd.DataFrame({'act': np.array(y_test), "pred": np.array(result), "rain": X_test[:,-1]})
pp.rain = pp.rain * 5
plt.figure(figsize=(15,5))
plt.ylim(0,250)
plt.plot(pp)

Unknown-4.png

Blue is the actual water level, orange is the predicted water level, and it overlaps so much that the blue line can hardly be seen (^-^)

Wow!

Forecast

Now, let's change the precipitation from the water level at a certain point in time and predict the water level one hour later.

python


import random

#Randomly select index
i = random.randint(0,len(df))
d = df.ix[i].as_matrix().tolist()
print(d)

#Get a test array
df_test = []

#Create test data by changing precipitation from 0 to 20
for i in range(21):
    temp = d[:10]
    temp.append(i)
    df_test.append(temp)
    
#Forecast
test = model.predict(np.array(df_test).astype("float"))

#graph display
plt.plot(test)

The data used were the following values.

python


[150.0, 149.0, 149.0, 148.0, 147.0, 147.0, 147.0, 146.0, 146.0, 146.0, 8.0, 147.0]

The graph of the forecast results is as follows.

Unknown-3.png

The X-axis is the precipitation and the Y-axis is the water level. Looking at this graph, although the water level gradually rises in proportion to the precipitation, it rises sharply after 10 mm and falls at 13 mm. ..

I tried a few other tests, but all of them had a slightly distorted graph. Even if the prediction accuracy of time series data is high, this is not useful ... (-_-;)

Consideration

I thought that the water level would rise as the amount of precipitation increased, but the prediction based on the test data was a little different from what I expected, and it did not increase uniformly. This is probably because it is not possible to correctly predict what is not included in the training data.

Alright, let's consider the next method with this in mind!

Use of neural networks

Now let's try a recently popular algorithm. The process up to data processing is the same, and the machine learning part is changed as follows.

By the way, neural networks are also known as multi-layer perceptrons. In addition, since neural networks mainly handle numerical values from -1 to 1, normalize the training data.

python


#Divide data into input and output
y = df.pop("y").as_matrix().astype("int").flatten()
X = df.as_matrix().astype("float")

#Divided to use 90% for learning and 10% for verification
num = int(len(X) * 0.9)
print(len(X), num, len(X)-num)

X_train = X[:num]
X_test = X[num:]
y_train = y[:num]
y_test = y[num:]

#Data normalization
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

#Set a neural network as a learning model
from sklearn.neural_network import MLPRegressor
model = MLPRegressor(random_state=42)

#Learning and verification
model.fit(X_train, y_train)
result = model.predict(X_test)

#Score
print(model.score(X_test,y_test))

When executed, the prediction accuracy is "0.947163962045", which is a little worse than Random Forest (-_-;)

But for the time being, I will try until the end.

python


import random

#Randomly select index
i = random.randint(0,len(df))
d = df.ix[i].as_matrix().tolist()
print(d)

df_test = []

#Create test data by changing precipitation from 0 to 20
for i in range(21):
    temp = d[:10]
    temp.append(i)
    df_test.append(temp)
    
#Input data normalization
d = scaler.transform(np.array(df_test).astype("float"))

#Forecast
test = model.predict(d)

plt.plot(test)

I will try it.

[54.0, 54.0, 54.0, 53.0, 53.0, 53.0, 53.0, 53.0, 53.0, 53.0, 0.0, 53.0]

Unknown.png

Kita --------! !!

Neural network is amazing! !!

Acknowledgments

Thank you to everyone involved in open data in Sabae City for their valuable data. We look forward to working with you in the future.

Postscript

We have released a document that summarizes the data of Jupyter Notebook that executed the above contents, so please refer to it as well.

Water level prediction using open data in Sabae City, Fukui Prefecture-2017 version

Recommended Posts

Using open data from Data City Sabae to predict water level gauge values by machine learning Part 2
Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 1: Data analysis)
I tried to predict horse racing by doing everything from data collection to deep learning
Collect machine learning data by scraping from bio-based public databases
Creating artificial intelligence by machine learning using TensorFlow from zero knowledge-Introduction 1
How to make a face image data set used in machine learning (3: Face image generation from candidate images Part 1)
How to collect machine learning data
Aiming to become a machine learning engineer from sales positions using MOOCs
Is it possible to eat by forecasting stock prices by machine learning [Machine learning part 1]
Python beginners publish web applications using machine learning [Part 2] Introduction to explosive Python !!
Predict power demand with machine learning Part 2
Try to draw a "weather map-like front" by machine learning based on weather data (5)
Try to draw a "weather map-like front" by machine learning based on weather data (3)
Predict short-lived works of Weekly Shonen Jump by machine learning (Part 2: Learning and evaluation)
Try to draw a "weather map-like front" by machine learning based on weather data (1)
Try to draw a "weather map-like front" by machine learning based on weather data (4)
Try to draw a "weather map-like front" by machine learning based on weather data (2)
I tried to predict the presence or absence of snow by machine learning.
I tried to predict the change in snowfall for 2 years by machine learning