[PYTHON] I learned scraping using selenium to make a horse racing prediction model.

Introduction

Continuing from the last time, I am making a prediction program using horse racing data from netkeiba.com. I learned the contents of scraping by making a prediction program, so I will summarize it in a reminder.

I tried to get a horse racing database using Pandas https://qiita.com/Fumio-eisan/items/1c1c429746a3a0add055

Please see this video for the horse racing prediction program itself. It is explained very carefully, and even beginners can fully understand it.

Data analysis and machine learning starting with horse racing prediction https://www.youtube.com/channel/UCDzwXAWu1zIfJuPTTZyWthw

Scraping of places written in javascript (using Selenimum)

I summarized last time that you can use pandas to get information such as race schedule, horse name, jockey from html. This may not be enough. For the part described in javascript, it is necessary to take some time to scrape.

What is Selenium

Selenium is a framework for automating the operation of web browsers. It seems that it can be used with Chrome, FireFox, ʻIE, etc. This time I will use Chrome`.

http://chromedriver.chromium.org/downloads

image.png

Download your Chrome version of selenium here.

from selenium.webdriver import Chrome, ChromeOptions
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

Open URL

Here's how to simply open the URL.


options= ChromeOptions()
driver = Chrome(executable_path=r'(chromedriver.exe)Please specify the path of',options=options)
driver.get(url)

Here, the path of the chorome driver is included. If you specify path of chorome driver in System Preferences, it seems that you don't need to write it in this program, but it didn't work even if you set it. Therefore, I purposely specify ʻexecutable_path`.

Class definition for taking race information

The class that scrapes the race information is defined below. I will put it in the data frame of pandas.


from tqdm import tqdm_notebook as tqdm
import pandas as pd
import time

class ShutubaTable:
    def __init__(self):
        self.shutuba_table = pd.DataFrame()
    
    def scrape_shutuba_table(self, race_id_list):
        options= ChromeOptions()
        driver = Chrome(executable_path=r'C:\Users\lllni\Documents\Python\20200528_keiba\chromedriver_win32\chromedriver.exe',options=options)
        for race_id in tqdm(race_id_list):
            url = 'https://race.netkeiba.com/race/shutuba.html?race_id=' + race_id
            driver.get(url)
            elements = sample_driver.find_elements_by_class_name('HorseList')
            for element in elements:
                tds = element.find_elements_by_tag_name('td')
                row = []
                for td in tds:
                    row.append(td.text)
                    if td.get_attribute('class') in ['HorseInfo', 'Jockey']:
                        href = td.find_element_by_tag_name('a').get_attribute('href')
                        row.append(re.findall(r'\d+', href)[0])
                self.shutuba_table = self.shutuba_table.append(pd.Series(row, name=race_id))
        time.sleep(1)
        driver.close()

As a point

1)'HorseList` class contains the desired data, so retrieve it as follows.

elements = sample_driver.find_elements_by_class_name('HorseList')

image.png

  1. Extract information such as horse name and odds for each td tag.
for element in elements:
  tds = element.find_elements_by_tag_name('td')
  row = []
  for td in tds:
     row.append(td.text)
    self.shutuba_table = self.shutuba_table.append(pd.Series(row, name=race_id))

image.png

By describing each of these td tags, the information described in javascript can also be retrieved. Also, ʻelementis for each horse. In other words, every time the horse changes, therow will be emptied and the td` tag information can be entered from scratch.

Instantiate and load

st = ShutubaTable1()
sample_driver = Chrome(executable_path=r'C:\Users\lllni\Documents\Python\20200528_keiba\chromedriver_win32\chromedriver.exe',options=options)
sample_driver.get(url
st.scrape_shutuba_table(['202005030211'])#Race you want to expect_Enter id
st.shutuba_table

If the race id is netkeiba.com, the number is included at the end of the URL of the race table, so please take out and paste only the number of the race id you want to see.

image.png

I was able to take it out safely.

At the end

I got another understanding of scraping.

Recommended Posts

I learned scraping using selenium to make a horse racing prediction model.
Machine learning beginners tried to make a horse racing prediction model with python
I tried to get a database of horse racing using Pandas
I tried to make a ○ ✕ game using TensorFlow
I tried crawling and scraping a horse racing site Part 2
I tried to make a stopwatch using tkinter in python
I tried to make a simple text editor using PyQt
I wrote a code that exceeds 100% recovery rate in horse racing prediction using LightGBM (Part 2)
I tried to make a regular expression of "amount" using Python
I tried to make a regular expression of "time" using Python
I tried to make a regular expression of "date" using Python
I tried to make a periodical process with Selenium and Python
I tried to make a todo application using bottle with python
I tried to make a Web API
I came up with a way to make a 3D model from a photo.
I tried to make PyTorch model API in Azure environment using TorchServe
I want to make a web application using React and Python flask
I want to make matplotlib a dark theme
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
I want to easily create a Noise Model
I tried web scraping using python and selenium
I tried to automate the 100 yen deposit of Rakuten horse racing (python / selenium)
I read "How to make a hacking lab"
I tried to make a translation BOT that works on Discord using googletrans
I tried to implement various methods for machine learning (prediction model) using scikit-learn.
I tried to make a suspicious person MAP quickly using Geolonia address data
I came up with a way to make a 3D model from a photo. 0 Projection to 3D space
I tried to make a "fucking big literary converter"
I tried hosting a Pytorch sample model using TorchServe
I made a code to convert illustration2vec to keras model
I stumbled upon using MoviePy, so make a note
I was addicted to scraping with Selenium (+ Python) in 2020
How to make a Python package using VS Code
[Python] I want to make a nested list a tuple
PyTorch Learning Note 2 (I tried using a pre-trained model)
I tried to draw a configuration diagram using Diagrams
[Git] I tried to make it easier to understand how to use git stash using a concrete example
I tried to make a document search slack command using Kendra announced at re: Invent 2019.
I tried to make a motion detection surveillance camera with OpenCV using a WEB camera with Raspberry Pi
I tried using PI Fu to generate a 3D model of a person from one image
I want to make a voice changer using Python and SPTK with reference to a famous site
I dare to fill out the form without using selenium
I tried to implement a basic Recurrent Neural Network model
[Horse Racing] I tried to quantify the strength of racehorses
I want to make a blog editor with django admin
I want to make a click macro with pyautogui (desire)
I tried hosting a TensorFlow deep learning model using TensorFlow Serving
I made a function to check the model of DCGAN
I want to make a click macro with pyautogui (outlook)
[Python scraping] I tried google search top10 using Beautifulsoup & selenium
I made a VGG16 model using TensorFlow (on the way)
I tried to automate [a certain task] using Raspberry Pi
Try to model a multimodal distribution using the EM algorithm
[Introduction to Tensorflow] Understand Tensorflow properly and try to make a model
I want to make input () a nice complement in python
I tried to divide with a deep learning language model
[Python] I tried to make a simple program that works on the command line using argparse.
Start to Selenium using python
Horse Racing Data Scraping Flow
Web scraping using Selenium (Python)
[5th] I tried to make a certain authenticator-like tool with python