The other day, the schedule for the 2020 Meiji Yasuda Life J League was announced. (Release) This release provides the J1, J2, and J3 League dates in PDF format. In addition, the J League provides various data on different sites based on the game axis, team axis, and player axis.
This post uses `read_html``` provided by
`pandas``` to page the page obtained from the" Schedule / Results "menu provided on the above site, instead of scraping the PDF format. Easy to get.
https://data.j-league.or.jp/SFMS01/search?competition_years=2020&competition_frame_ids=1&competition_ids=477&tv_relay_station_name=
game_schedule.py
# cording:uft-8
import pandas as pd
yyyy = 2020
url = 'https://data.j-league.or.jp/SFMS01/search?'
category = {'1': 477, '2': 478, '3': 479}
schedule = pd.DataFrame(index=None, columns=['year', 'Tournament', 'section', 'Match day', 'K/O time', 'home', 'Score', 'Away', 'Stadium', 'Number of visitors', 'Internet broadcasting / TV broadcasting'])
Create J1, J2, J3 categories and yearly IDs in dic format. Create an empty data frame.
game_schedule.py
for key, value in category.items():
para = 'competition_years=' + str(yyyy)
para1 = '&competition_frame_ids=' + str(key)
para2 = '&competition_ids=' + str(value)
para3 = '&tv_relay_station_name='
full_url = url + para + para1 + para2 + para3
# print(full_url)
df = pd.read_html(full_url, attrs={'class': 'table-base00 search-table'}, skiprows=0)
schedule = pd.concat([schedule, df[0]], sort=False)
The point is pd.read_html (full_url, attrs = {'class':'table-base00 search-table'} ...
, which specifies the target URL and the attributes of <table>
.
Combine the retrieved ones into the schedule
.
game_schedule.py
#If you want to replace NaN
# schedule = schedule.fillna({'KO time': '● Undecided ●', 'Visitors':0})
schedule.to_csv('./csv/Game_Schedule_' + str(yyyy) + '.csv', index=False, sep=',')
Save in csv format in the specified folder.
<table>
can be easily obtained with read_html
of pandas
.Recommended Posts