Python scraping Extract racing environment from horse racing site

background

Somehow I became interested in horse racing and machine learning. The site I referred to did not contain what I wanted to be scraped. Information on horses and jockeys is important for horse racing, but the order may change depending on the race environment (dirt or turf, sunny or rainy weather, etc.). I tried to work on it this time, so I will summarize it as a memorandum. I will omit the outline explanation of html, scc, Beautiful Soup, etc. Just check how to use it.

The target is the horse racing site (netkeiba.com) link. Also, the content to be scripted is the item in the red frame. 無題.png

Check the target

Check which element the content enclosed in the above square corresponds to. If it is chrome Rightmost menu → Other tools → Developer tools You can check the concept of the web page by selecting.

無題2.png

Sample code

Since the character string you want to extract is included in the span element, access it directly. There are several extraction methods, and the sample code below may be a detour.

race_condition_scraping.py


#Download HTML
import requests
#Get information with Beautiful Soup
from bs4 import BeautifulSoup 

#Specify URL
r = requests.get("https://nar.netkeiba.com/?pid=race&id=p201942100701")
#HTML and parser(Reading method)Specify
soup = BeautifulSoup(r.content, "html.parser")
#Extract span element from html
tags = soup.find_all('span')
print(tags)
"""Line breaks for easy viewing
[<span>Introduction of my page</span>, <span>Favorite horse</span>, \
<span>Login/Member registration</span>, <span>(s)Login</span>, \
<span>(s)Join Free</span>, <span>(s)Log out</span>, \
<span>Da 1400m/Weather: cloudy/Baba: Shige/Start: 10:30</span>, \
<span>7</span>, <span>6</span>, <span>4</span>, <span>8</span>, \
<span>3</span>, <span>1</span>, <span>2</span>, <span>8</span>, \
<span>6</span>, <span>5</span>, <span>7</span>, <span>5</span>] 
"""
#Inclusive notation When extracting"weather"Is immutable, so specify the condition
names = [t.text for t in tags if "weather" in t.text]
print(names)
"""
['Da 1400m\xa0/\xa0 weather: cloudy\xa0/\xa0 Baba: Shige\xa0/\xa0 start: 10:30']
"""
#"\xa0/\xa0"OK if separated by
weather_ = names[0].split('\xa0/\xa0')
print(weather_)
"""
['Da 1400m', 'Weather: cloudy', 'Baba: Shige', 'Start: 10:30']
"""

Bonus: Get main information

I learned the scraping code of @ akihiro199630, so I studied. I tried crawling and scraping a horse racing site, part 1 I referred to the code in the article.

無題3.png

race_data_scraping.py


import requests
import lxml.html
import csv

#Specify URL
URL = "https://nar.netkeiba.com/?pid=race&id=p201942100701" 
r = requests.get(URL)
r.encoding = r.apparent_encoding #Prevent garbled characters
html = lxml.html.fromstring(r.text) #Acquired character string data

rlt = [] #result

#Convert from HTML string to HtmlElement type
# [(div id="race_main")→(div class)→(table width)→(tr)] →(td)Is packed with information
for h in html.cssselect('#race_main > div > table > tr'):#Specify the scraping location with the CSS selector
    #Get Element number
    h_1 = h
    #Get the text content of the Element number
    h_2 = h_1.text_content()
    #new line("\n")Divide based on
    h_3 = h_2.split("\n")
    #Comprehension notation Excludes empty strings
    h_4 = [tag for tag in h_3 if tag != '']
    #No time is recorded in 1st place
    if len(h_4) != 13:
        #8th 0 that needs to be forcibly added
        h_4.insert(8,0)
    #Row data in the list(Add list)
    rlt.append(h_4) 

#Extraction result
print("h_1",h_1)
"""
h_1 <Element tr at 0x1c35a3d53b8>
"""
print("h_2",h_2)
"""

12
5
5
Supercruise

Male 3
56.0
Kosei Akimoto
1:34.2
7
12
459
(Urawa)Toshio Tomita
469( +1 )
"""
print("h_3",h_3)
"""Line breaks for easy viewing
h_3 ['', '12', '5', '5', 'Supercruise', '', 'Male 3', '56.0', 'Kosei Akimoto', \
'1:34.2', '7', '12', '459', '(Urawa)Toshio Tomita', '469( +1 )', '']
"""
print("h_4", h_4)
"""
h_4 ['12', '5', '5', 'Supercruise', 'Male 3', '56.0', 'Kosei Akimoto', '1:34.2', '7', '12', '459', '(Urawa)Toshio Tomita', '469( +1 )']
"""

#Save to CSV file
with open("result.csv", 'w', newline='') as f: 
    wrt = csv.writer(f) 
    wrt.writerow(rlt.pop(0)) #1st row item writerow
    wrt.writerows(rlt) #Extraction result writerows

Bonus date of the race

When you want to reflect the past performance of the raced horse in the neural network You will need the latest data from the day before the race. If this point is not taken into consideration, the results after the race date will be reflected.

So I wrote a simple code to get the race date of the target site, so I will summarize it.

race_date.py


#Specify URL
URL = "https://nar.netkeiba.com/?pid=race&id=p201942100701" 

#The date is important because it is necessary to go back to the past from the race date
#Split all URLs=p migration 2019 42 100701 → 2019 42 1007 01
#2019 42 1007 01 → Year?(Horse racing venue number?)Date race number

#12 digits of date from URL(201942100701)Extract
url_12_ymd = URL[-12:]
print(url_12_ymd)
# 201942100701
url_12_y = url_12_ymd[:4]
print(url_12_y)
# 2019
url_12_md = url_12_ymd[6:10]
print(url_12_md)
# 1007
url_12_race = url_12_ymd[10:]
print(url_12_race)
# 01

Future flow

Now that we have extracted the necessary information, I wonder if we should quantify da (dirt), clouds, and weights and put them into the neural network. If I am motivated, I would like to continue. Since the site (*) that I referred to can get important information other than the race environment, Would it work if I integrated the above coat into the code on the site?

Reference site

@ akihiro199630 I'm a person who works a lot on scraping. God. Python Crawling & Scraping Chapter 1 Summary Python Crawling & Scraping Chapter 4 Summary [Scrapy] Correct / process the extracted URL

Recommended Posts

Python scraping Extract racing environment from horse racing site
Horse Racing Site Web Scraping with Python
Get past performance of runners from Python scraping horse racing site
Scraping from an authenticated site with python
Horse Racing Data Scraping Flow
Automatic scraping of reCAPTCHA site every day (1/7: python environment construction)
I tried crawling and scraping a horse racing site Part 2
Horse Racing Data Scraping at Colaboratory
Extract text from images in Python
[Python] Scraping lens information from Kakaku.com
Extract strings from files in Python
# 5 [python3] Extract characters from a character string
[Scraping] Python scraping
[Python] (Line) Extract values from graph images
Collecting information from Twitter with Python (Environment construction)
Selenium + WebDriver (Chrome) + Python | Building environment for scraping
Extract text from PowerPoint with Python! (Compatible with tables)
Python explosive environment construction starting from zero (Mac)
Use Python in your environment from Win Automation
[Python] Flow from web scraping to data analysis
[Node-RED] Execute Python on Anaconda virtual environment from Node-RED [Anaconda] [Python]
From Python environment construction to virtual environment construction with anaconda
Extract data from a web page with Python
Python scraping notes
Python Scraping get_ranker_categories
Scraping with Python
Python environment construction
python environment settings
Scraping with Python
python windows environment
Environment construction (python)
Python Scraping eBay
sql from python
Python learning site
Python Scraping get_title
python environment construction
MeCab from Python
Python --Environment construction
Python: Scraping Part 1
Python environment construction
python environment construction
Scraping using Python
Python: Scraping Part 2
Bulk download images from specific site URLs with python
Operate mongoDB from python in ubuntu environment ① Introduction of mongoDB
[Python] Extract the video ID from the YouTube video URL [Note]
[Python beginner] Extract prefectures and cities from addresses (3 lines).
Python development environment construction 2020 [From Python installation to poetry introduction]
The definitive edition of python scraping! (Target site: BicCamera)
Procedure to exe python file from Ubunts environment construction
[Python] Extract only numbers from lists and character strings