Meteorology x Python ~ From weather data acquisition to spectrum analysis ~

In the previous articles, I have covered some ways of handling Japan Meteorological Agency data, but there is another data acquisition method (current status). , This method is the most efficient), and I will lighten the analysis of the acquired data.


Past articles

・ Meteorology x Python ~ Automatic acquisition of AMeDAS location data ~ https://qiita.com/OSAKO/items/264c77b70843045bc12b ・ Meteorology x Python ~ Automatic acquisition of AMeDAS point data (extra edition) ~ https://qiita.com/OSAKO/items/505ecee67df424963e53 ・ Meteorology x Ruby ~ Ruby scraping using Mechanize ~ https://qiita.com/OSAKO/items/3c1cac0b5448be9ab243

1. Crawling

▶ I want to get the numerical values or character strings embedded in the following table format at once. (Example: January 1, 2019 data in Tokyo) image.png https://www.data.jma.go.jp/obd/stats/etrn/view/daily_s1.php?prec_no=44&block_no=47662&year=2019&month=01&day=1&view=

▶ First, implement a crawler to get all the urls of more than 1000 points and list them. https://www.data.jma.go.jp/obd/stats/etrn/index.php?prec_no=44&block_no=47662&year=&month=&day=&view= ↑ Example of url you want to get ( ** prec_no and block_no ids differ depending on the location ** </ font>)

procedure

① First of all, from [this url](https://www.data.jma.go.jp/obd/stats/etrn/select/prefecture00.php?prec_no½block_no Everyone & year = & month = & day = & view =) I want a url to access to, so I search for the href attribute under the area tag and list it. (Get_area_link method in the code below) image.png (2) Using the url of each area listed above as a key, if you search the href attribute of each point in the same way, you can get the target url that includes id in prec_no and block_no. (Get_station_link method in the code below) image.png ③ After that, format the listed url a little and save it in csv. (Data_arange method in the code below) image.png

program

get_amedas_station_url.py


# -*- coding: utf-8 -*-
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup

class Get_amedas_station:
  def __init__(self):
    url = 'https://www.data.jma.go.jp/obd/stats/etrn/select/prefecture00.php?prec_no=&block_no=&year=&month=&day=&view='
    html = urllib.request.urlopen(url)
    self.soup = BeautifulSoup(html, 'html.parser')
  
  def get_area_link(self):
    elements = self.soup.find_all('area')
    self.area_list = [element['alt'] for element in elements]
    self.area_link_list = [element['href'] for element in elements]

  def get_station_link(self):
    out = pd.DataFrame(columns=['station','url','area'])
    for area, area_link in zip(self.area_list, self.area_link_list):
      url = 'https://www.data.jma.go.jp/obd/stats/etrn/select/'+ area_link
      html = urllib.request.urlopen(url)
      soup = BeautifulSoup(html, 'html.parser')
      elements = soup.find_all('area')
      station_list = [element['alt'] for element in elements]
      station_link_list = [element['href'].strip('../') for element in elements]
      df1 = pd.DataFrame(station_list,columns=['station'])
      df2 = pd.DataFrame(station_link_list,columns=['url'])
      df = pd.concat([df1, df2],axis=1).assign(area=area)
      out = pd.concat([out,df])
      print(area)
    self.out = out
  
  def data_arange(self):
    out = self.out[~self.out.duplicated()].assign(append='https://www.data.jma.go.jp/obd/stats/etrn/')
    out['amedas_url'] = out['append'] + out['url']
    out = out.loc[:,['area','station','amedas_url']]
    out.to_csv('amedas_url_list.csv',index=None, encoding='SJIS')

if __name__ == '__main__':
  amedas = Get_amedas_station()
  amedas.get_area_link()
  amedas.get_station_link()
  amedas.data_arange()

2. Scraping

▶ Regarding the acquisition of AMeDAS data, which is the main subject, this time I wrote a program that supports different time scales of 1 day, 1 hour, and 10 minutes data (however, the amount of code is relatively large and it may be wasteful, so it is necessary. Improvement) ▶ As a caveat, the observation item differs depending on which AMeDAS point is selected. ( ** The selected point is different depending on whether it is a weather station or not ** </ font>)


For example, let's compare the observation events between Tokyo and Hachioji on each time scale.

** ・ 1 day data ** Tokyo image.png Hachioji image.png ** ・ 1 hour data ** Tokyo image.png Hachioji image.png ** ・ 10 minutes data ** Tokyo image.png Hachioji image.png

program

get_amedas_data.py


# -*- coding: utf-8 -*-
import re
import time
import numpy as np
import pandas as pd
import urllib.request
from bs4 import BeautifulSoup
from datetime import timedelta
from datetime import datetime as dt
from dateutil.relativedelta import relativedelta

#Method to search and list the data to be acquired from the url of each AMeDAS point
def search_data(url):
  html = urllib.request.urlopen(url)
  time.sleep(1)
  soup = BeautifulSoup(html, 'html.parser')
  element = soup.find_all('tr', attrs={'class':'mtx', 'style':'text-align:right;'})
  out = [list(map(lambda x: x.text, ele)) for ele in element]
  return out

class Get_amedas_data:
  def __init__(self,area_name,station_name):
    #Refer to the target url from the specified area and point
    self.st_name = station_name
    self.a_name = area_name
    amedas_url_list = pd.read_csv('amedas_url_list.csv',encoding='SJIS')
    df = amedas_url_list[(amedas_url_list['area']==self.a_name) & (amedas_url_list['station']==self.st_name)]
    amedas_url = df.iat[0,2]
    #Get a number with a regular expression from a url (prec_no and block_no id required)
    pattern=r'([+-]?[0-9]+\.?[0-9]*)'
    id_list=re.findall(pattern, amedas_url)
    self.pre_id = id_list[0]
    self.s_id = id_list[1]

  #Since 1-day data can be acquired collectively as data for January, specify the start month and end month.
  def set_date1(self, startmonth, endmonth):
    self.start = startmonth
    self.end = endmonth
    strdt = dt.strptime(self.start, '%Y%m')
    enddt = dt.strptime(self.end, '%Y%m')
    months_num = (enddt.year - strdt.year)*12 + enddt.month - strdt.month + 1
    #List the months from the start month to the end month
    self.datelist = map(lambda x, y=strdt: y + relativedelta(months=x), range(months_num))
  
  #For 1 hour or 10 minute data, specify the start date and end date
  def set_date2(self,startdate,enddate):
    self.start = startdate
    self.end = enddate
    strdt = dt.strptime(self.start, '%Y%m%d')
    enddt = dt.strptime(self.end, '%Y%m%d')
    days_num = (enddt - strdt).days + 1
    #List dates from start date to end date
    self.datelist = map(lambda x, y=strdt: y + timedelta(days=x), range(days_num))

  #Create an empty data frame in advance. There are more elements that can be acquired at the point where the weather station is located.
  #Specify the time scale you want to get (type)
  def dl_data(self, type):
    #Empty data frame for daily data
    data1  = pd.DataFrame(columns=['Year / month','Day','Average local pressure','Average sea level pressure','Day降水量','Up to 1 hour precipitation','Precipitation for up to 10 minutes','Average temperature','Highest temperature','Lowest Temperature','Average humidity','Minimum humidity','Average wind speed','Maximum wind speed','Maximum wind direction','Maximum instantaneous wind speed','Maximum instantaneous wind direction','Day照時間','snowfall','Deepest snow','Weather overview (noon)','Weather overview (night)'])
    data1_ = pd.DataFrame(columns=['Year / month','Day','Day降水量','Up to 1 hour precipitation','Precipitation for up to 10 minutes','Average temperature','Highest temperature','Lowest Temperature','Average wind speed','Maximum wind speed','Maximum wind direction','Maximum instantaneous wind speed','Maximum instantaneous wind direction','Most wind direction','Day照時間','snowfall','Deepest snow'])
    #Empty data frame for hourly data
    data2  = pd.DataFrame(columns=['date','Time','Local barometric pressure','Sea level pressure','Precipitation','temperature','Dew point temperature','Vapor pressure','Humidity','wind speed','Wind direction','日照Time間','Total solar radiation','snowfall','Snow cover','weather','Cloud cover','Visibility'])
    data2_ = pd.DataFrame(columns=['date','Time','Precipitation','temperature','wind speed','Wind direction','日照Time間','snowfall','Snow cover'])
    #Empty data frame for 10min data
    data3  = pd.DataFrame(columns=['date','Hour and minute','Local barometric pressure','Sea level pressure','Precipitation','temperature','Relative humidity','Average wind speed','Average wind direction','Maximum instantaneous wind speed','Maximum instantaneous wind direction','Daylight hours'])
    data3_ = pd.DataFrame(columns=['date','Hour and minute','Precipitation','temperature','Average wind speed','Average wind direction','Maximum instantaneous wind speed','Maximum instantaneous wind direction','Daylight hours'])

    #Create a data frame by vertically joining while acquiring data while rotating the listed month or date list
    for dt in self.datelist:
      d = dt.strftime("%Y%m%d")
      yyyy = d[0:4]
      mm = d[4:6]
      dd = d[6:8]

      if type=='daily':
        #Block at the point where the weather station is_no is a 5-digit number
        if len(self.s_id) == 5:
          pattern = 's1'
          url = f'https://www.data.jma.go.jp/obd/stats/etrn/view/{type}_{pattern}.php?prec_no={self.pre_id}&block_no={self.s_id}&year={yyyy}&month={mm}&day={dd}&view=p1'
          out = search_data(url)
          df = (pd.DataFrame(out, columns=['Day','Average local pressure','Average sea level pressure','Day降水量','Up to 1 hour precipitation','Precipitation for up to 10 minutes','Average temperature','Highest temperature','Lowest Temperature','Average humidity','Minimum humidity','Average wind speed','Maximum wind speed','Maximum wind direction','Maximum instantaneous wind speed','Maximum instantaneous wind direction','Day照時間','snowfall','Deepest snow','Weather overview (noon)','Weather overview (night)'])).assign(Year / month=f'{yyyy}{mm}')
          df = df.loc[:,['Year / month','Day','Average local pressure','Average sea level pressure','Day降水量','Up to 1 hour precipitation','Precipitation for up to 10 minutes','Average temperature','Highest temperature','Lowest Temperature','Average humidity','Minimum humidity','Average wind speed','Maximum wind speed','Maximum wind direction','Maximum instantaneous wind speed','Maximum instantaneous wind direction','Day照時間','snowfall','Deepest snow','Weather overview (noon)','Weather overview (night)']]
          data1 = pd.concat([data1, df])
          data1.to_csv(f'{self.a_name}_{self.st_name}_{self.start}-{self.end}_{type}.csv',index=None, encoding='SJIS')
        else:
          pattern = 'a1'
          url = f'https://www.data.jma.go.jp/obd/stats/etrn/view/{type}_{pattern}.php?prec_no={self.pre_id}&block_no={self.s_id}&year={yyyy}&month={mm}&day={dd}&view=p1'
          out = search_data(url)
          df = (pd.DataFrame(out, columns=['Day','Day降水量','Up to 1 hour precipitation','Precipitation for up to 10 minutes','Average temperature','Highest temperature','Lowest Temperature','Average wind speed','Maximum wind speed','Maximum wind direction','Maximum instantaneous wind speed','Maximum instantaneous wind direction','Most wind direction','Day照時間','snowfall','Deepest snow'])).assign(Year / month=f'{yyyy}{mm}')
          df = df.loc[:,['Year / month','Day','Day降水量','Up to 1 hour precipitation','Precipitation for up to 10 minutes','Average temperature','Highest temperature','Lowest Temperature','Average wind speed','Maximum wind speed','Maximum wind direction','Maximum instantaneous wind speed','Maximum instantaneous wind direction','Most wind direction','Day照時間','snowfall','Deepest snow']]
          data1_ = pd.concat([data1_, df])
          data1_.to_csv(f'{self.a_name}_{self.st_name}_{self.start}-{self.end}_{type}.csv',index=None, encoding='SJIS')

      elif type=='hourly':
        if len(self.s_id) == 5:
          pattern = 's1'
          url = f'https://www.data.jma.go.jp/obd/stats/etrn/view/{type}_{pattern}.php?prec_no={self.pre_id}&block_no={self.s_id}&year={yyyy}&month={mm}&day={dd}&view=p1'
          out = search_data(url)
          #Added date column (24)
          date = pd.DataFrame((np.full([24,1], f'{yyyy}{mm}{dd}')),columns=['date'])
          df = pd.DataFrame(out, columns=['Time','Local barometric pressure','Sea level pressure','Precipitation','temperature','Dew point temperature','Vapor pressure','Humidity','wind speed','Wind direction','日照Time間','Total solar radiation','snowfall','Snow cover','weather','Cloud cover','Visibility'])
          df = pd.concat([date, df],axis=1)
          data2 = pd.concat([data2, df])
          data2.to_csv(f'{self.a_name}_{self.st_name}_{self.start}-{self.end}_{type}.csv',index=None, encoding='SJIS')
        else:
          pattern = 'a1'
          url = f'https://www.data.jma.go.jp/obd/stats/etrn/view/{type}_{pattern}.php?prec_no={self.pre_id}&block_no={self.s_id}&year={yyyy}&month={mm}&day={dd}&view=p1'
          out = search_data(url)
          date = pd.DataFrame((np.full([24,1], f'{yyyy}{mm}{dd}')),columns=['date'])
          df = pd.DataFrame(out, columns=['Time','Precipitation','temperature','wind speed','Wind direction','日照Time間','snowfall','Snow cover'])
          df = pd.concat([date, df],axis=1)
          data2_ = pd.concat([data2_, df])
          data2_.to_csv(f'{self.a_name}_{self.st_name}_{self.start}-{self.end}_{type}.csv',index=None, encoding='SJIS')

      elif type=='10min':
        if len(self.s_id) == 5:
          pattern = 's1'
          url = f'https://www.data.jma.go.jp/obd/stats/etrn/view/{type}_{pattern}.php?prec_no={self.pre_id}&block_no={self.s_id}&year={yyyy}&month={mm}&day={dd}&view=p1'
          out = search_data(url)
          #Added date column (6 x 24)
          date = pd.DataFrame((np.full([144,1], f'{yyyy}{mm}{dd}')),columns=['date'])
          df = pd.DataFrame(out, columns=['Hour and minute','Local barometric pressure','Sea level pressure','Precipitation','temperature','Relative humidity','Average wind speed','Average wind direction','Maximum instantaneous wind speed','Maximum instantaneous wind direction','Daylight hours'])
          df = pd.concat([date, df],axis=1)
          data3 = pd.concat([data3, df])
          data3.to_csv(f'{self.a_name}_{self.st_name}_{self.start}-{self.end}_{type}.csv',index=None, encoding='SJIS')
        else:
          pattern = 'a1'
          url = f'https://www.data.jma.go.jp/obd/stats/etrn/view/{type}_{pattern}.php?prec_no={self.pre_id}&block_no={self.s_id}&year={yyyy}&month={mm}&day={dd}&view=p1'
          out = search_data(url)
          date = pd.DataFrame((np.full([144,1], f'{yyyy}{mm}{dd}')),columns=['date'])
          df = pd.DataFrame(out, columns=['Hour and minute','Precipitation','temperature','Average wind speed','Average wind direction','Maximum instantaneous wind speed','Maximum instantaneous wind direction','Daylight hours'])
          df = pd.concat([date, df],axis=1)
          data3_ = pd.concat([data3_, df])
          data3_.to_csv(f'{self.a_name}_{self.st_name}_{self.start}-{self.end}_{type}.csv',index=None, encoding='SJIS')

      print(f'{self.a_name}_{self.st_name}_{yyyy}-{mm}-{dd}_{type}')
    print(f'{self.a_name}_{self.st_name}In{self.start}〜{self.end}of{type}I downloaded the data.')


if __name__ == '__main__':
  a_name = input('Enter the region you want to download:')
  st_name = input('Enter the location you want to download:')
  amedas = Get_amedas_data(a_name,st_name)
  type = input('Select a time scale (daily or hourly or 10min):')
  if type == 'daily':
    start= input('Start month you want to get(yyyymm)Please enter:')
    end = input('End month you want to get(yyyymm)Please enter:')
    amedas.set_date1(start,end)
  else:
    start= input('Start date you want to get(Please enter yyyymmdd:')
    end = input('End date you want to get(yyyymmdd)Please enter:')
    amedas.set_date2(start,end)
  
  amedas.dl_data(type)
  

Region: Tokyo Point: Tokyo Time scale: 10min Start date: 20190401 End date: 20190731

I will try to download the data under the condition.

The contents of the saved "Tokyo_Tokyo_20190401-20190731_10min.csv" file looks like this ![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/354874/10817920-dcbe-ba75-a2f8-34071b89d19d.png) # 3. Spectrum analysis ▶ Since I have acquired detailed data with a time resolution of 10 minutes, I will lightly analyze the time series data. ▶ This time, the power spectrum was calculated by applying the spectrum analysis to the temperature fluctuation during the acquired period. For those who want to understand the feeling of Fourier transform, [this video](https://www.youtube.com/watch?v=bjBZEKdlLD0&t=339s) may be easy to understand.

fft_temp.py


# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy.fft import fftn, ifftn, fftfreq

class FFT:
  def __init__(self):
    df = pd.read_csv('Tokyo_Tokyo_20190401-20190731_10min.csv', encoding='SJIS').replace(['--','×'],-99)
    df['temperature'] = df['temperature'].astype(float)
    y = df['temperature'].replace(-99, (df['temperature'] > -99).mean()).values
    y = (y - np.mean(y)) * np.hamming(len(y))
    '''
Sampling frequency fs → Number of data sampled per second d as the second argument=1.0/fs
Since the sampling frequency this time is 10-minute interval data, 1/600
    '''
    self.z = np.fft.fft(y)
    self.freq = fftfreq(len(y), d=1.0/(1/600))


  def plot(self, n_samples):
    fig, axes = plt.subplots(figsize=(8, 4), ncols=2, sharey=True)
    ax = axes[0]
    ax.plot(self.freq[1:int(n_samples/2)], abs(self.z[1:int(n_samples/2)]))
    ax.set_yscale('log')
    ax.set_xscale('log')
    ax.set_ylabel('Power')
    ax.set_xlabel('Frequency (Hz)')
    ax = axes[1]
    ax.plot((1 / self.freq[1:int(n_samples / 2)])/(60*60), abs(self.z[1:int(n_samples / 2)]))
    ax.set_yscale('log')
    ax.set_xscale('log')
    ax.set_xlabel('Cycle (time)')
    plt.show()
  
  def periodic_characteristics(self, n_samples):
    fft_pow_df = pd.DataFrame([(1 / self.freq[1:int(n_samples / 2)])/(60*60), np.log10(abs(self.z[1:int(n_samples / 2)]))], index=['Cycle (time)', 'log10_power']).T
    fft_pow_df = fft_pow_df.sort_values('log10_power', ascending=False).head(10).reset_index()
    print(fft_pow_df.loc[:, ['Cycle (time)', 'log10_power']])

if __name__ == '__main__':
  fft = FFT()
  fft.periodic_characteristics(512)
  fft.plot(512)

The 122-day cycle, which has the largest power spectrum, isn't quite right, but as expected, the 23,24-hour cycle was also outstanding, so I wonder if it fits. (Because it has a daily physical property that the temperature rises from morning to noon and falls again from night to night) I would appreciate it if you could point out any mistakes.

that's all! !! !!

Recommended Posts