[PYTHON] Get boat race match information by web scraping

Introduction

This article is an explanation of the internal code of the boat race triple prediction site "Today, do you have a good prediction?" that was created by myself and released on the Web. It will be. This time I will summarize about web scraping.

What kind of information do you want and where do you get it?

I want to make a triple prediction site for boat races by machine learning, so I would like to somehow obtain past race results as learning data. The minimum information I want is ...

--Kyoteiba

Is it? Any other information you want

--How many races --The weather of the day --Motor information

And so on. The latest Boat Race official website has well-maintained data, and you can also refer to past race results.

This time, I would like to get the race results that are the source of the learning data from here!

Understand the URL structure

As a prior knowledge of boat races, races are basically held 365 days a year at some of the 24 racecourses. Therefore, after understanding the URL structure, I decided to acquire race information for the desired number of days x 24 boat racetracks. (If the race is not held, the process will be skipped)

I grasped the URL structure and prepared a box containing the URL as follows. In the code, only the data of 2020/6/22 is acquired, but if you increase the list of year, month, day, it is an image that you can also acquire URLs of other dates.

import pandas as pd
import numpy as np

list = []
year = ['2020']
month = ['06']
day = ['22']
site = ['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24']

for i in year:
    for j in month:
        for k in day:
            for l in site: 
              list.append("url name is described here")

Get boat race results!

Here is the code. When scraping, be sure to set a certain period of time ** to avoid overloading the other web server.

import requests
from time import sleep
from bs4 import BeautifulSoup

path_w = 'bs_2020_0622.txt'
list_errorlog = [] #Make a note of the boat race track where there was no match that day, for the time being.

for m in range(len(list)):
    try:
        res = requests.get(list[m])
        res.encoding = res.apparent_encoding
        res.raise_for_status()
        with open(path_w, mode='a', encoding='utf-8') as f:
            txt = res.text
            soup = BeautifulSoup(txt)
            f.writelines(soup.get_text())
        sleep(5) #Do not erase!
        
    except:
        sleep(5) #Do not erase!
        list_errorlog.append(list[m]+"is not existing")

print(list_errorlog)

In this code

I am doing. This is fine because the reference destination is a fairly simple structure, but I think that it is necessary to scrape the HTML tags well for more elaborate pages.

Click here for acquisition results

it is a good feeling. Next time, I would like to convert this text data into a DataFrame format that allows machine learning. Well, scraping is amazing. (Although there is a feeling of being late for fashion ..) image.png

Finally

It looks like you're happy .. Well, I was able to process the download automatically without having to do it each time, and I was able to study, which is good!

Recommended Posts

Get boat race match information by web scraping
Image collection by web scraping
One-liner web scraping by tse
Get Splunk download link by scraping
Nogizaka46 Get blog images by scraping
Get weather information with Python & scraping
web scraping
Get iPad maintenance by scraping and notify Slack
Try web scraping now and get lottery 6 data
Get a list of Qiita likes by scraping
Output web scraped information by voice with OpenJtalk
web scraping (prototype)
Python beginners get stuck with their first web scraping