Continuing from the last time, I am making a prediction program using horse racing data from netkeiba.com. I learned the contents of scraping by making a prediction program, so I will summarize it in a reminder.
I tried to get a horse racing database using Pandas https://qiita.com/Fumio-eisan/items/1c1c429746a3a0add055
Please see this video for the horse racing prediction program itself. It is explained very carefully, and even beginners can fully understand it.
Data analysis and machine learning starting with horse racing prediction https://www.youtube.com/channel/UCDzwXAWu1zIfJuPTTZyWthw
I summarized last time that you can use pandas
to get information such as race schedule, horse name, jockey from html
. This may not be enough. For the part described in javascript
, it is necessary to take some time to scrape.
Selenium
is a framework for automating the operation of web browsers. It seems that it can be used with Chrome
, FireFox
, ʻIE, etc. This time I will use
Chrome`.
http://chromedriver.chromium.org/downloads
Download your Chrome
version of selenium
here.
from selenium.webdriver import Chrome, ChromeOptions
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
Here's how to simply open the URL.
options= ChromeOptions()
driver = Chrome(executable_path=r'(chromedriver.exe)Please specify the path of',options=options)
driver.get(url)
Here, the path
of the chorome driver
is included. If you specify path
of chorome driver
in System Preferences, it seems that you don't need to write it in this program, but it didn't work even if you set it. Therefore, I purposely specify ʻexecutable_path`.
The class that scrapes the race information is defined below. I will put it in the data frame of pandas
.
from tqdm import tqdm_notebook as tqdm
import pandas as pd
import time
class ShutubaTable:
def __init__(self):
self.shutuba_table = pd.DataFrame()
def scrape_shutuba_table(self, race_id_list):
options= ChromeOptions()
driver = Chrome(executable_path=r'C:\Users\lllni\Documents\Python\20200528_keiba\chromedriver_win32\chromedriver.exe',options=options)
for race_id in tqdm(race_id_list):
url = 'https://race.netkeiba.com/race/shutuba.html?race_id=' + race_id
driver.get(url)
elements = sample_driver.find_elements_by_class_name('HorseList')
for element in elements:
tds = element.find_elements_by_tag_name('td')
row = []
for td in tds:
row.append(td.text)
if td.get_attribute('class') in ['HorseInfo', 'Jockey']:
href = td.find_element_by_tag_name('a').get_attribute('href')
row.append(re.findall(r'\d+', href)[0])
self.shutuba_table = self.shutuba_table.append(pd.Series(row, name=race_id))
time.sleep(1)
driver.close()
As a point
1)'HorseList` class contains the desired data, so retrieve it as follows.
elements = sample_driver.find_elements_by_class_name('HorseList')
td
tag.for element in elements:
tds = element.find_elements_by_tag_name('td')
row = []
for td in tds:
row.append(td.text)
self.shutuba_table = self.shutuba_table.append(pd.Series(row, name=race_id))
By describing each of these td
tags, the information described in javascript
can also be retrieved. Also, ʻelementis for each horse. In other words, every time the horse changes, the
row will be emptied and the
td` tag information can be entered from scratch.
st = ShutubaTable1()
sample_driver = Chrome(executable_path=r'C:\Users\lllni\Documents\Python\20200528_keiba\chromedriver_win32\chromedriver.exe',options=options)
sample_driver.get(url
st.scrape_shutuba_table(['202005030211'])#Race you want to expect_Enter id
st.shutuba_table
If the race id is netkeiba.com
, the number is included at the end of the URL of the race table, so please take out and paste only the number of the race id you want to see.
I was able to take it out safely.
I got another understanding of scraping.
Recommended Posts