[PYTHON] Horse Racing Data Scraping at Colaboratory

Scraping horse racing data with Colaboratory

If you want to scrape horse racing data and it becomes machine learning, Colaboratory is convenient, so Make a note of the code for scraping horse racing in the Colaboratory.

(Please note that scraping may not be possible due to html changes. 2020.8 / 30 Operation confirmed)

Code below

sample.ipynb


#Install Chromium and selenium
#「!Paste each mark into the Colaboratory code cell.
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

#Import BeautifulSoup library
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
race_date ="2020"
race_course_num="06"
race_info ="03"
race_count ="05"
race_no="01"
url = "https://race.netkeiba.com/race/result.html?race_id="+race_date+race_course_num+race_info+race_count+race_no+"&rf=race_list"

#Get the data of the corresponding URL in HTML format
race_html=requests.get(url)
race_html.encoding = race_html.apparent_encoding  
race_soup=BeautifulSoup(race_html.text,'html.parser')
#Remove unnecessary strings and store in list
def make_data(data):
    data = re.sub(r"\n","",str(data))
    data = re.sub(r" ","",str(data))
    data = re.sub(r"</td>","'",str(data))
    data = re.sub(r"<[^>]*?>","",str(data))
    data = re.sub(r"\[","",str(data))
    return data
#Get and save only the race table
HorseList = race_soup.find_all("tr",class_="HorseList")

#Lace table shaping
#Number of rows in the table=15("Order of arrival,frame,Horse number,Horse name,Sexual age,Weight,Jockey,time,Difference,Popular,Win odds,After 3F,Corner passing order,stable,Horse weight(Increase / decrease))
col = ["Order of arrival","frame","Horse number","Horse name","Sexual age","Weight","Jockey","time","Difference","Popular","Win odds","After 3F","Corner passing order","stable","Horse weight(Increase / decrease)","Number of runners"]

#Count the number of runners
uma_num = len(HorseList)

df_temp = pd.DataFrame(map(make_data,HorseList),columns=["temp"])
df = df_temp["temp"].str.split("'", expand=True)
df.columns= col
df["Number of runners"] = uma_num 
df

スクリーンショット 2020-08-30 19.24.04.png

Finally

After that, you can scrape a lot by changing the date etc. Colaboratory, which does not require environment construction, is convenient after all.

reference

https://qiita.com/Mokutan/items/89c871eac16b8142b5b2 https://qiita.com/ftoyoda/items/fe3e2fe9e962e01ac421

Recommended Posts

Horse Racing Data Scraping at Colaboratory
Horse Racing Data Scraping Flow
Horse Racing Site Web Scraping with Python
How to scrape horse racing data with BeautifulSoup
I tried crawling and scraping a horse racing site Part 2
Get past performance of runners from Python scraping horse racing site