[PYTHON] Scraping the result of "Schedule-kun"

Overview

The other day, the schedule for the 2020 Meiji Yasuda Life J League was announced. (Release) This release provides the J1, J2, and J3 League dates in PDF format. In addition, the J League provides various data on different sites based on the game axis, team axis, and player axis.

Jleague Data Site

This post uses `read_html``` provided by `pandas``` to page the page obtained from the" Schedule / Results "menu provided on the above site, instead of scraping the PDF format. Easy to get.

URL structure

https://data.j-league.or.jp/SFMS01/search?competition_years=2020&competition_frame_ids=1&competition_ids=477&tv_relay_station_name=

Basic: https://data.j-league.or.jp/
app name? : SFMS01
Parameter 1: competition_years = 2020 #year
Parameter 2: competition_frame_ids = 1 #Category (J1)
Parameter 3: competition_ids = 477 # Yearly ID
Other than that, it remains fixed

Code description

`game_schedule.py`


# cording:uft-8
import pandas as pd
yyyy = 2020
url = 'https://data.j-league.or.jp/SFMS01/search?'
category = {'1': 477, '2': 478, '3': 479}
schedule = pd.DataFrame(index=None, columns=['year', 'Tournament', 'section', 'Match day', 'K/O time', 'home', 'Score', 'Away', 'Stadium', 'Number of visitors', 'Internet broadcasting / TV broadcasting'])

Create J1, J2, J3 categories and yearly IDs in dic format. Create an empty data frame.

`game_schedule.py`


for key, value in category.items():
    para = 'competition_years=' + str(yyyy)
    para1 = '&competition_frame_ids=' + str(key)
    para2 = '&competition_ids=' + str(value)
    para3 = '&tv_relay_station_name='

    full_url = url + para + para1 + para2 + para3
    # print(full_url)
    df = pd.read_html(full_url, attrs={'class': 'table-base00 search-table'}, skiprows=0)
    schedule = pd.concat([schedule, df[0]], sort=False)

The point is pd.read_html (full_url, attrs = {'class':'table-base00 search-table'} ..., which specifies the target URL and the attributes of <table>. Combine the retrieved ones into the schedule.

`game_schedule.py`


#If you want to replace NaN
# schedule = schedule.fillna({'KO time': '● Undecided ●', 'Visitors':0})
schedule.to_csv('./csv/Game_Schedule_' + str(yyyy) + '.csv', index=False, sep=',')

Save in csv format in the specified folder.

Summary

Simple <table> can be easily obtained with read_html of pandas.

Utilization of data

Extract only the teams you support and use them for your expedition schedule.
After the match, the score and the number of visitors will be registered so that you can use it for analysis.

About "Schedule-kun"

Wiki [J League Match Scheduler](https://ja.wikipedia.org/wiki/J%E3%83%AA%E3%83%BC%E3%82%B0%E3%83%BB%E3% 83% 9E% E3% 83% 83% E3% 83% 81% E3% 82% B9% E3% 82% B1% E3% 82% B8% E3% 83% A5% E3% 83% BC% E3% 83% A9% E3% 83% BC)
The guest this time is that schedule! Information program "J League TV" that makes you love J League more December 5, 2019

Recommended Posts

Scraping the result of "Schedule-kun"

Process the result of% time,% timeit

Scraping the usage history of the community cycle

The result of installing python in Anaconda

View the result of geometry processing in Python

Extract only complete from the result of Trinity

Scraping the winning data of Numbers using Docker

The beginning of cif2cell

The meaning of self

Basics of Python scraping basics

the zen of Python