[PYTHON] Let's visualize the rainfall data released by Shimane Prefecture

Introduction

Continuing from the day before yesterday, I was wondering if I could do something with the data released by Shimane Prefecture, and it seems that rainfall data has been released over a wide range, so I tried to visualize this.

[Shimane Prefecture] Daily rainfall data (for 40 days)

Check the procedure

View the structure of the public page

Catalog page

First, there is the catalog page.

https://shimane-opendata.jp/db/organization/main

Rainfall page

There is a "rainfall data" page in the catalog page.

https://shimane-opendata.jp/db/dataset/010009

Daily data page

It seems that the rainfall data saved every 10 minutes on a daily basis is saved in CSV. For example, if you want to download the data for June 30, access the following URL.

https://shimane-opendata.jp/db/dataset/010009/resource/1a8248dd-cd5e-4985-b01f-6ac79fe72140

July 1st ...

https://shimane-opendata.jp/db/dataset/010009/resource/0c9ba4db-b8eb-4b90-8e38-10abf0fd01ee

that? URLs vary greatly from day to day.

Daily CSV

Furthermore, the CSV URL is ...

https://shimane-opendata.jp/storage/download/1ddaef55-cc94-490c-bd3f-7efeec17fcf9/uryo_10min_20200701.csv

Yes, it's hard to use!

procedure

So, let's try the visualization work by the following procedure.

  1. Get the URL of the daily page from the rainfall data page
  2. Get the CSV URL from the daily URL page
  3. Get the data from the obtained CSV URL
  4. Data processing
  5. Visualization

By the way, this time too, we will use Colaboratory.

Get the URL of the daily page

Get the URL of the daily page with the following script.

python


import requests
from bs4 import BeautifulSoup

urlBase = "https://shimane-opendata.jp"
urlName = urlBase + "/db/dataset/010009"

def get_tag_from_html(urlName, tag):
  url = requests.get(urlName)
  soup = BeautifulSoup(url.content, "html.parser")
  return soup.find_all(tag)

def get_page_urls_from_catalogpage(urlName):
  urlNames = []
  elems = get_tag_from_html(urlName, "a")
  for elem in elems:
    try:
      string = elem.get("class")[0]
      if string in "heading":
        href = elem.get("href")
        if href.find("resource") > 0:
          urlNames.append(urlBase + href)
    except:
      pass
  return urlNames

urlNames = get_page_urls_from_catalogpage(urlName)
print(urlNames)

Get CSV URL

Get the CSV URL with the following script.

python


def get_csv_urls_from_url(urlName):
  urlNames = []
  elems = get_tag_from_html(urlName, "a")
  for elem in elems:
    try:
      href = elem.get("href")
      if href.find(".csv") > 0:
        urlNames.append(href)
    except:
      pass
  return urlNames[0]

urls = []

for urlName in urlNames:
  urls.append(get_csv_urls_from_url(urlName))

print(urls)

Get data from URL and create data frame

Read the data directly from the URL obtained above. However, since CSV is mixed every 10 minutes and every hour, only every 10 minutes is targeted here. By the way, note that the character code is Shift JIS, and the first two lines contain information other than data, so exclude it.

python


import pandas as pd

df = pd.DataFrame()

for url in urls:
    if url.find("10min") > 0:
        df = pd.concat([df, pd.read_csv(url, encoding="Shift_JIS").iloc[2:]])

df.shape

Data confirmation and processing

python


df.info()

You can get the column information by executing the above.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2880 entries, 2 to 145
Columns: 345 entries,Observatory to Unnamed: 344
dtypes: object(345)
memory usage: 7.6+ MB

... there are also 345 columns.

If you look at the downloaded data in Excel, you can see that there is 10-minute rainfall and cumulative rainfall for each observatory, and the column for cumulative rainfall is blank, so I decided to exclude the column for cumulative rainfall. I will. スクリーンショット 2020-07-15 3.12.52.png

By the way, the explanation of cumulative rainfall is as follows.

Cumulative rainfall is the cumulative amount of rainfall from the time when it starts to rain to the time when it ends. The definition of the beginning of rainfall is when the rainfall is 0.0 mm to 0.5 mm or more, and the definition of the end of rainfall is when it exceeds 6 hours after the rainfall is no longer counted, and the cumulative rainfall is calculated at the end of the rainfall. Reset.

Since everyone's Dtype is object, it seems that the numerical data is a character string ...

Also, if you take a look inside, it seems that the strings "Uncollected", "Missing", and "Maintenance" are included. After removing those character information, it is converted to a real value. Since the date and time data is also a character string, this also has to be converted to a serial value.

So, execute the following script.

python


for col in df.columns:
  if col.find("name") > 0:
    df.pop(col)

df.index = df["Observatory"].map(lambda _: pd.to_datetime(_))
df = df.sort_index()

df = df.replace('Not collected', '-1')
df = df.replace('Missing', '-1')
df = df.replace('Maintenance', '-1')

cols = df.columns[1:]

for col in cols:
  df[col] = df[col].astype("float")

Visualization

Try drawing the graph after setting the environment so that the Japanese display does not become strange.

python


!pip install japanize_matplotlib

import matplotlib.pyplot as plt
import japanize_matplotlib 
import seaborn as sns

sns.set(font="IPAexGothic")

df[cols[:5]].plot(figsize=(15,5))
plt.show()

df["2020-07-12":][cols[:5]].plot(figsize=(15,5))
plt.show()

Unknown.png

Unknown-2.png

You can see the rain in the last few days at a glance.

Well, what are we going to do now?

Recommended Posts

Let's visualize the rainfall data released by Shimane Prefecture
Let's visualize the river water level data released by Shimane Prefecture
Visualization of data by prefecture
Let's play with the corporate analysis data set "CoARiJ" created by TIS ①
Let's play with the corporate analysis data set "CoARiJ" created by TIS ②
Let's calculate the transition of the basic reproduction number of the new coronavirus by prefecture
[Python] Visualize the information acquired by Wireshark
Visualize the export data of Piyo log
Let's visualize the relationship between average salary and industry with XBRL data and seaborn! (7/10)
Let's decide the date course by combinatorial optimization
List the AMIs used by AWS Data Pipeline
Let's put out a ranking of the number of effective reproductions of the new coronavirus by prefecture