[PYTHON] [Note] Let's try to predict the amount of electricity used! (Part 1)

at first

I have been studying Deep Learning for about half a year, and recently I am interested in analyzing time series data and am currently studying. At the following site, I would like to do what former Tableau employee Mr. Iwahashi is doing while also studying python / chainer.

URL: Deep learning starting from data understanding by Tableau

Purpose: Create a model that predicts power consumption from meteorological data --Explanatory variable: Meteorological data + α --Explained variable: Power consumption

It's still something that beginners are doing, so I don't think it's a smart way at all, so I'm waiting for comments like "It's better to do this" and "I can enjoy this more!" (Laughs) It's a little long, but I hope you'll read it.

1. Data acquisition

The data can be downloaded as a csv file from the following TEPCO and Japan Meteorological Agency sites.

TEPCO: Download past power usage record data Japan Meteorological Agency: Past weather data download

From here, get the data for 2018 (one year's worth). For meteorological data

--Temperature

I decided to use (the same as Mr. Iwahashi's).

After downloading the data, let's load it with pandas. We used Google Colaboratory for the analysis. (I used Jupyter lab at first, but it took an unusual amount of time to draw the graph, so I changed to Google Colaboratory in the middle)

Google Drive connection

To read the file, upload the downloaded csv file to Google Drive and use it by connecting to Google Colaboratory.

First, import the library. After this, I will import the library as needed (you can import all at once, but I did it by groping, so I imported it when needed).


#Library import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

You can use the code below to integrate Google Colaboratory with Google Drive. If you google, you will find a detailed method, so please refer to that.

Reference: The story that mounting on Google Drive in Colaboratory was easy


#Connect with Google Drive
from google.colab import drive
drive.mount('/content/drive')

Well, when you come to this point, it's time to read the file. "Xxx" below can be any folder / file name on Google Drive.


#Read file/Displaying DataFrame
df = pd.read_csv('drive/My Drive/xxx/xxx.csv')
df

2. Preprocessing 1: Read >> Check data >> Combine data

(a) Meteorological data

** Preprocessing with Python ** Mr. Iwahashi mentioned above is doing preprocessing and visualization with Tableau (DeskTop / Prep), but since this time I am studying Python, I would like to do the same (similar) work only with Python.

First, let's take a look at the meteorological data (data.csv) downloaded from the Japan Meteorological Agency website. By the way, in Excel, it is as follows. There are rows and columns that you don't need.

2020-05-12.png

Data shaping

Extract the matrix you want to use from this dataset. (However, if you have a dataset of this amount, it's easier to play with it in Excel. But here I dare to extract the data I want to use in python (I'm just studying python)).

df_weather = pd.read_csv('xxx\weather_data_1_2.csv', header=[0], skiprows=3, engine='python') #,index_col=0
df_weather = df_weather.iloc[1:,[0, 1,4,8,12,18,21]]
df_weather.head()
Date and time temperature(℃) Precipitation(mm) Daylight hours(time) Insolation(MJ/㎡) Relative humidity(%) Local barometric pressure(hPa)
2018/1/1 1:00:00 1.5 0.0 0.0 0.0 82.0 1009.6
2018/1/1 2:00:00 1.0 0.0 0.0 0.0 83.0 1009.5
2018/1/1 3:00:00 1.2 0.0 0.0 0.0 80.0 1009.0
2018/1/1 4:00:00 0.6 0.0 0.0 0.0 85.0 1008.6
2018/1/1 5:00:00 1.9 0.0 0.0 0.0 80.0 1008.8

And the output will be like this.

When reading the data on the first line, use ** index specification (index_col =), header specification (header =), and skip unnecessary lines (skiprows) ** to make it roughly the shape you want to use. In the second row, **. Iloc ([]) ** is used to get the required rows / columns.

Next, I would like to change the column name because it is difficult to use if the column name is Japanese. This can be converted at once by creating a variable called columns and using the dictionary type {'before change','after change'} code as follows.

#Convert column names to English
columns = {'Date and time':'Date_Time',
           'temperature(℃)':'Temperature',
          'Precipitation(mm)':'Precipitation',
          'Daylight hours(time)':'Daylighthours',
          'Insolation(MJ/㎡)':'SolarRadiation',
          'Relative humidity(%)':'Humidity',
          'Local barometric pressure(hPa)':'Airpressure'}

df_weather.rename(columns=columns, inplace=True)
df_weather.head(3)

Next, let's check the data type. I always check the following three things (please let me know if there are other things that should be checked).

#Data type confirmation
print(df_weather.shape)
print(df_weather.dtypes)
print(df_weather.info)
print(df_weather.describe())

Combine multiple files in column direction (pd.concat)

When downloading data from the Meteorological Agency, the size of the data that can be downloaded is fixed, and if the number of data for one year is the number of downloads, it is necessary to download it in multiple times (in my case, 2 months per file). I was only able to download minutes of data). So you need to join multiple files.

Here, the file names are df_weather1 ~ 6, respectively.


df_weather_integrate = pd.concat([df_weather1, df_weather2, df_weather3, df_weather4, df_weather5, df_weather6],ignore_index=True)

I named the combined files df_weather_integrated (it's a bit long ...) and used ** pd.concat () ** to combine the six files vertically.

Check for missing values (.isnull (). Any / .isnull (). Sum ())

Next is the confirmation of missing values. Check for missing values with the familiar ** dataframe.isnull (). Any () **. It returns as a Boolean type (True or False).

df_weather_integrated.isnull().any()
1 2
Date_Time False
Temperature False
Precipitation False
Daylighthours True
SolarRadiation True
Humidity True
Airpressure False

Now that we know that there are missing values in Daylight hours, SolarRadiation, and Humidity, let's check how many missing values there are. Check with **. isnull (). Sum () **.

df_weather_integrated.isnull().sum()
1 2
Date_Time 0
Temperature 0
Precipitation 0
Daylighthours 1
SolarRadiation 1
Humidity 27
Airpressure 0

Now you know the number of missing values in the meteorological data. Later, we will process the missing values together with the power consumption data.

(b) Power consumption data

Next, let's take a look at the power data downloaded from the TEPCO website. If it is Excel, it will be downloaded as follows.

2020-05-13.png

df_Elec = pd.read_csv('xxx\elec_2018.csv',header=[0],skiprows=1, engine='python')

Here, as before, header and skip rows are used to remove unnecessary rows. For power consumption data, change column names (Japanese >> English), check data types, and check missing values (details omitted), just like meteorological data.

[After conversion]

DATE TIME kW
0 2018/1/1 0:00 2962
1 2018/1/1 1:00 2797
2 2018/1/1 2:00 2669
・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・

As you may have noticed here, the 0th line is 2018/1/1 0:00. The weather data will start from 1:00 on January 1, 2018, so delete the 0th line here and re-index.


df_Elec = df_Elec.drop(df_Elec.index[0])
df_Elec =df_Elec.reset_index(drop=True)
df_Elec.head(3)

[After reconversion]

DATE TIME kW
0 2018/1/1 1:00 2962
1 2018/1/1 2:00 2797
2 2018/1/1 3:00 2669
・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・ ・

Next, since the date and time of the meteorological data are combined into one column, combine the first and second columns of the power consumption data so that they have the same shape as the meteorological data. Here, ** pd.to_datetime ** creates a new column (DATE_TIME) with datetime type.

#Combine DATE and TIME to create a new DATE_Make TIME
df_Elec['DATE_TIME'] = pd.to_datetime(df_Elec['DATE'] +' '+ df_Elec['TIME'])
df_Elec = df_Elec[['DATE_TIME', 'kW']]

You are now ready to combine your weather and power consumption data. Now let's combine these two data. Here we are using ** pd.merge () **.

Vertical merge of two files (pd.merge)

#Combine two files
df = pd.merge(df_weather_integrate, df_Elec ,left_on='Date_Time', right_on='DATE_TIME', how='left')

You have now merged the two files (horizontally)! Looking at this data frame here, there are unnecessary columns (DATE_TIME) remaining and NaN in kW in the last row, so we will process these at the end.

#Leave only the lines you use
df = df[['Date_Time','Temperature', 'Precipitation', 'Daylighthours', 'SolarRadiation', 'Humidity', 'Airpressure', 'kW']]
#8759 is deleted because there is a gap between power consumption data and meteorological data in the last line
df.drop(8759,inplace=True)

Now you can finally analyze the data! (It was long...)

Next time, I would like to analyze the preprocessed data with chainer!

Summary (Note)

This time I tried the following 4 points.

--Data acquisition --Data reading & confirmation --Data shaping --Data combination

Next time, I would like to handle missing values.

Recommended Posts

[Note] Let's try to predict the amount of electricity used! (Part 1)
Predict the amount of electricity used in 2 days and publish it in CSV
[Django] Let's try to clarify the part of Django that was somehow through in the test
Try to predict the triplet of boat race by ranking learning
Try to simulate the movement of the solar system
I tried to predict the price of ETF
[Verification] Try to align the point cloud with the optimization function of pytorch Part 1
python beginners tried to predict the number of criminals
Try to solve the problems / problems of "Matrix Programmer" (Chapter 1)
Try to visualize the room with Raspberry Pi, part 1
Try to estimate the number of likes on Twitter
Try to get the contents of Word with Golang
I tried to erase the negative part of Meros
Python Note: The mystery of assigning a variable to a variable
Try to get the function list of Python> os package
Try to evaluate the performance of machine learning / regression model
Try to evaluate the performance of machine learning / classification model
Let's predict the timing of the bals and enjoy the movie slowly
How to calculate the amount of calculation learned from ABC134-D
(Note) How to pass the path of your own module
Try to improve the accuracy of Twitter like number estimation
Try to solve the problems / problems of "Matrix Programmer" (Chapter 0 Functions)
Try to automate the operation of network devices with Python
Attempt to automatically adjust the speed of time-lapse movies (Part 2)
Try to extract the features of the sensor data with CNN
Learn accounting data and try to predict accounts from the content of the description when entering journals
Vertically visualize the amount corresponding to the vertices of networkx using Axes3D
I used Facebook's Prophet to predict the Dow Jones Industrial Average.
[Cloudian # 9] Try to display the metadata of the object in Python (boto3)
[Python] Try to graph from the image of Ring Fit [OCR]
First python ② Try to write code while examining the features of python
Try to solve the N Queens problem with SA of PyQUBO
Use PyCaret to predict the price of pre-owned apartments in Tokyo!
Try to model the cumulative return of rollovers in futures trading
I tried to summarize the frequently used implementation method of pytest-mock
I used the worldcup command to check the outcome of the World Cup.
Can the Kalman filter be used to predict stock price trends?
[python] A note that started to understand the behavior of matplotlib.pyplot
Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae
The day of docker run (note)
Try to introduce the theme to Pelican
Cython to try in the shortest
The fastest way to try EfficientNet
Supplement to the explanation of vscode
Let's decide the winner of bingo
The easiest way to try PyQtGraph
About the order of learning programming languages (from beginner to intermediate) Part 2
Let's write a program to solve the Rubik's Cube (Part 2: IDA * Search)
[Note] I want to completely preprocess the data of the Titanic issue-Age version-
How to crop the lower right part of the image with Python OpenCV
Try to estimate the parameters of the gamma distribution while simply implementing MCMC
Python Note: When you want to know the attributes of an object
Try to image the elevation data of the Geographical Survey Institute with Python
Try to get the road surface condition using big data of road surface management
Try using n to downgrade the version of Node.js you have installed
Try to react only the carbon at the end of the chain with SMARTS
I want to use the Qore SDK to predict the success of NBA players
[Introduction to Python] Thorough explanation of the character string type used in Python!
I tried the common story of using Deep Learning to predict the Nikkei 225
Try to predict FX with LSTM using Keras + Tensorflow Part 2 (Calculate with GPU)
I tried to predict the behavior of the new coronavirus with the SEIR model.