Introduction

Continuing from the last time, I am making a prediction program using horse racing data from netkeiba.com. I learned the contents of scraping by making a prediction program, so I will summarize it in a reminder.

I tried to get a horse racing database using Pandas https://qiita.com/Fumio-eisan/items/1c1c429746a3a0add055

Please see this video for the horse racing prediction program itself. It is explained very carefully, and even beginners can fully understand it.

Data analysis and machine learning starting with horse racing prediction https://www.youtube.com/channel/UCDzwXAWu1zIfJuPTTZyWthw

Scraping of places written in javascript (using Selenimum)

I summarized last time that you can use pandas to get information such as race schedule, horse name, jockey from html. This may not be enough. For the part described in javascript, it is necessary to take some time to scrape.

What is Selenium

Selenium is a framework for automating the operation of web browsers. It seems that it can be used with Chrome, FireFox, ʻIE, etc. This time I will use Chrome`.

http://chromedriver.chromium.org/downloads

Download your Chrome version of selenium here.

from selenium.webdriver import Chrome, ChromeOptions
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

Open URL

Here's how to simply open the URL.


options= ChromeOptions()
driver = Chrome(executable_path=r'(chromedriver.exe)Please specify the path of',options=options)
driver.get(url)

Here, the path of the chorome driver is included. If you specify path of chorome driver in System Preferences, it seems that you don't need to write it in this program, but it didn't work even if you set it. Therefore, I purposely specify ʻexecutable_path`.

Class definition for taking race information

The class that scrapes the race information is defined below. I will put it in the data frame of pandas.


from tqdm import tqdm_notebook as tqdm
import pandas as pd
import time

class ShutubaTable:
    def __init__(self):
        self.shutuba_table = pd.DataFrame()
    
    def scrape_shutuba_table(self, race_id_list):
        options= ChromeOptions()
        driver = Chrome(executable_path=r'C:\Users\lllni\Documents\Python\20200528_keiba\chromedriver_win32\chromedriver.exe',options=options)
        for race_id in tqdm(race_id_list):
            url = 'https://race.netkeiba.com/race/shutuba.html?race_id=' + race_id
            driver.get(url)
            elements = sample_driver.find_elements_by_class_name('HorseList')
            for element in elements:
                tds = element.find_elements_by_tag_name('td')
                row = []
                for td in tds:
                    row.append(td.text)
                    if td.get_attribute('class') in ['HorseInfo', 'Jockey']:
                        href = td.find_element_by_tag_name('a').get_attribute('href')
                        row.append(re.findall(r'\d+', href)[0])
                self.shutuba_table = self.shutuba_table.append(pd.Series(row, name=race_id))
        time.sleep(1)
        driver.close()

As a point

1)'HorseList` class contains the desired data, so retrieve it as follows.

elements = sample_driver.find_elements_by_class_name('HorseList')

Extract information such as horse name and odds for each td tag.

for element in elements:
　　tds = element.find_elements_by_tag_name('td')
　　row = []
　　for td in tds:
   　　row.append(td.text)
　　　 self.shutuba_table = self.shutuba_table.append(pd.Series(row, name=race_id))

By describing each of these td tags, the information described in javascript can also be retrieved. Also, ʻelementis for each horse. In other words, every time the horse changes, therow will be emptied and the td` tag information can be entered from scratch.

Instantiate and load

st = ShutubaTable1()
sample_driver = Chrome(executable_path=r'C:\Users\lllni\Documents\Python\20200528_keiba\chromedriver_win32\chromedriver.exe',options=options)
sample_driver.get(url
st.scrape_shutuba_table(['202005030211'])#Race you want to expect_Enter id
st.shutuba_table

If the race id is netkeiba.com, the number is included at the end of the URL of the race table, so please take out and paste only the number of the race id you want to see.

I was able to take it out safely.

At the end

I got another understanding of scraping.

[PYTHON] I learned scraping using selenium to make a horse racing prediction model.