Introduction

This article is an explanation of the internal code of the boat race triple prediction site "Today, do you have a good prediction?" that was created by myself and released on the Web. It will be. This time I will summarize about web scraping.

The code is written in my own way, so I would appreciate it if you could give me some advice.

What kind of information do you want and where do you get it?

I want to make a triple prediction site for boat races by machine learning, so I would like to somehow obtain past race results as learning data. The minimum information I want is ...

--Kyoteiba

Player name --Lane information --Race results --Date of the race

Is it? Any other information you want

--How many races --The weather of the day --Motor information

And so on. The latest Boat Race official website has well-maintained data, and you can also refer to past race results.

This time, I would like to get the race results that are the source of the learning data from here!

Understand the URL structure

As a prior knowledge of boat races, races are basically held 365 days a year at some of the 24 racecourses. Therefore, after understanding the URL structure, I decided to acquire race information for the desired number of days x 24 boat racetracks. (If the race is not held, the process will be skipped)

I grasped the URL structure and prepared a box containing the URL as follows. In the code, only the data of 2020/6/22 is acquired, but if you increase the list of year, month, day, it is an image that you can also acquire URLs of other dates.

import pandas as pd
import numpy as np

list = []
year = ['2020']
month = ['06']
day = ['22']
site = ['01','02','03','04','05','06','07','08','09','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24']

for i in year:
    for j in month:
        for k in day:
            for l in site: 
              list.append("url name is described here")

Get boat race results!

Here is the code. When scraping, be sure to set a certain period of time ** to avoid overloading the other web server.

import requests
from time import sleep
from bs4 import BeautifulSoup

path_w = 'bs_2020_0622.txt'
list_errorlog = [] #Make a note of the boat race track where there was no match that day, for the time being.

for m in range(len(list)):
    try:
        res = requests.get(list[m])
        res.encoding = res.apparent_encoding
        res.raise_for_status()
        with open(path_w, mode='a', encoding='utf-8') as f:
            txt = res.text
            soup = BeautifulSoup(txt)
            f.writelines(soup.get_text())
        sleep(5) #Do not erase!
        
    except:
        sleep(5) #Do not erase!
        list_errorlog.append(list[m]+"is not existing")

print(list_errorlog)

In this code

Access page → Get text data and fill in .txt → Repeat after 5 seconds
Process to skip if the web page does not exist (= no race) by try catch

I am doing. This is fine because the reference destination is a fairly simple structure, but I think that it is necessary to scrape the HTML tags well for more elaborate pages.

Click here for acquisition results

it is a good feeling. Next time, I would like to convert this text data into a DataFrame format that allows machine learning. Well, scraping is amazing. (Although there is a feeling of being late for fashion ..)

Finally

I have read the official website site policy before doing web scraping.
In addition, the above text data can be downloaded from the official website ** without scraping (explosion).

It looks like you're happy .. Well, I was able to process the download automatically without having to do it each time, and I was able to study, which is good!