[PYTHON] How to scrape horse racing data with BeautifulSoup


Predict horse racing with machine learning and aim for a recovery rate of 100%.

What to do this time

In Previous article, I scraped the data of all race results in 2019 from netkeiba.com. スクリーンショット 2020-07-04 22.19.07.png This time, in addition to this, I would like to scrape data such as race date information and riding conditions. スクリーンショット 2020-07-05 14.11.44.png

Source code

Like last time, if you put a list of race_id, create a function that returns the scraping result in dictionary type for each race.

import requests
from bs4 import BeautifulSoup
import time
from tqdm.notebook import tqdm
import re

def scrape_race_info(race_id_list):
    race_infos = {}
    for race_id in tqdm(race_id_list):
            url = "https://db.netkeiba.com/race/" + race_id
            html = requests.get(url)
            html.encoding = "EUC-JP"
            soup = BeautifulSoup(html.text, "html.parser")

            texts = (
                soup.find("div", attrs={"class": "data_intro"}).find_all("p")[0].text
                + soup.find("div", attrs={"class": "data_intro"}).find_all("p")[1].text
            info = re.findall(r'\w+', texts) #Hitting a backslash in Qiita causes a bug, so it is capitalized.
            info_dict = {}
            for text in info:
                if text in ["Turf", "dirt"]:
                    info_dict["race_type"] = text
                if "Obstacle" in text:
                    info_dict["race_type"] = "Obstacle"
                if "m" in text:
                    info_dict["course_len"] = int(re.findall(r"\d+", text)[0]) #This is also capitalized.
                if text in ["Good", "Going", "Heavy", "不Good"]:
                    info_dict["ground_state"] = text
                if text in ["Cloudy", "Fine", "rain", "小rain", "Koyuki", "snow"]:
                    info_dict["weather"] = text
                if "Year" in text:
                    info_dict["date"] = text
            race_infos[race_id] = info_dict
        except IndexError:
        except Exception as e:
    return race_infos

Create race_id_list from Last scraped data, make it DataFrame type like last time, and merge it with the original data.

race_id_list = results.index.unique()
race_infos = scrape_race_info(race_id_list)
for key in race_infos:
    race_infos[key].index = [key] * len(race_infos[key])
race_infos = pd.concat([pd.DataFrame(race_infos[key], index=[key]) for key in race_infos])
results = results.merge(race_infos, left_index=True, right_index=True, how='left')

The completed data looks like this. スクリーンショット 2020-07-05 14.31.39.png

We have a detailed explanation in the video! Data analysis and machine learning starting with horse racing prediction

