[PYTHON] How to scrape horse racing data using pandas read_html

Purpose

Predict horse racing with machine learning and aim for a recovery rate of 100%.

What to do this time

Scraping all 2019 race results from netkeiba.com. Data with a table tag can be scraped in one line by using pandas read_html, which is convenient.

pd.read_html("https://db.netkeiba.com/race/201902010101")[0]

スクリーンショット 2020-07-04 22.19.07.png

Source code

Since race_id is assigned to each race on netkeiba.com, if you put in a list of race_id, create a function that scrapes each race result together and returns it in a dictionary type.

import pandas as pd
import time
from tqdm.notebook import tqdm

def scrape_race_results(race_id_list, pre_race_results={}):
    race_results = pre_race_results
    for race_id in tqdm(race_id_list):
        if race_id in race_results.keys():
            continue
        try:
            url = "https://db.netkeiba.com/race/" + race_id
            race_results[race_id] = pd.read_html(url)[0]
            time.sleep(1)
        except IndexError:
            continue
        except:
            break
    return race_results

This time, I want to scrape the results of all races in 2019, so I will make a list of all race_ids in 2019.

race_id_list = []
for place in range(1, 11, 1):
    for kai in range(1, 6, 1):
        for day in range(1, 9, 1):
            for r in range(1, 13, 1):
                race_id = (
                    "2019"
                    + str(place).zfill(2)
                    + str(kai).zfill(2)
                    + str(day).zfill(2)
                    + str(r).zfill(2)
                )
                race_id_list.append(race_id)

After scraping, convert it to pandas DataFrame type and save it as a pickle file.

results = scrape_race_results(race_id_list)
for key in results:
    results[key].index = [key] * len(results[key])
results = pd.concat([results[key] for key in results], sort=False)
results.to_pickle('results.pickle')

Next article uses BeautifulSoup to scrape detailed data such as race dates and weather! In addition, we explain in detail in the video! Data analysis and machine learning starting with horse racing prediction スクリーンショット 2020-07-04 22.03.00.png

Recommended Posts

How to scrape horse racing data using pandas read_html
How to scrape horse racing data with BeautifulSoup
I tried to get a database of horse racing using Pandas
How to get article data using Qiita API
How to search HTML data using Beautiful Soup
Scraping 2 How to scrape
How to use Pandas 2
How to scrape image data from flickr with python
How to convert horizontally held data to vertically held data with pandas
How to extract non-missing value nan data with pandas
[Python] How to deal with pandas read_html read error
How to extract non-missing value nan data with pandas
How to use Pandas Rolling
Horse Racing Data Scraping Flow
How to handle data frames
Data analysis using python pandas
How to add new data (lines and plots) using matplotlib
How to get an overview of your data in Pandas
Data science companion in python, how to specify elements in pandas
How to install python using anaconda
How to paste a CSV file into an Excel file using Pandas
[Python] How to FFT mp3 data
How to read e-Stat subregion data
Data visualization method using matplotlib (+ pandas) (5)
How to write soberly in pandas
[Python] How to use Pandas Series
Horse Racing Data Scraping at Colaboratory
How to deal with imbalanced data
How to deal with imbalanced data
<Pandas> How to handle time series data in a pivot table
How to format a table using Pandas apply, pivot and swaplevel
Data visualization method using matplotlib (+ pandas) (3)
How to Data Augmentation with PyTorch
How to update a Tableau packaged workbook data source using Python
Data visualization method using matplotlib (+ pandas) (4)
How to collect machine learning data
How to divide and process a data frame using the groupby function
I learned scraping using selenium to make a horse racing prediction model.
How to plot galaxy visible light data using OpenNGC database in python
How to collect Twitter data without programming
[Pandas] What is set_option [How to use]
How to draw a graph using Matplotlib
How to set up SVM using Optuna
How to install a package using a repository
Use pandas to convert grid data to row-holding (?) Data
How to set xg boost using Optuna
How to reassign index in pandas dataframe
Try converting to tidy data with pandas
How to scrape websites created with SPA
How to use "deque" for Python data
How to download youtube videos using pytube3
How to handle time series data (implementation)
How to read CSV files in Pandas
How to read problem data with paiza
Vectorization of horse racing pedigree using fastText
How to use pandas Timestamp and date_range
How to replace with Pandas DataFrame, which is useful for data analysis (easy)
The first step to log analysis (how to format and put log data in Pandas)
How to display Map using Google Map API (Android)
How to create sample CSV data with hypothesis
Try using django-import-export to add csv data to django