Somehow I became interested in horse racing and machine learning. The site I referred to did not contain what I wanted to be scraped. Information on horses and jockeys is important for horse racing, but the order may change depending on the race environment (dirt or turf, sunny or rainy weather, etc.). I tried to work on it this time, so I will summarize it as a memorandum. I will omit the outline explanation of html, scc, Beautiful Soup, etc. Just check how to use it.
The target is the horse racing site (netkeiba.com) link. Also, the content to be scripted is the item in the red frame.
Check which element the content enclosed in the above square corresponds to. If it is chrome Rightmost menu → Other tools → Developer tools You can check the concept of the web page by selecting.
Since the character string you want to extract is included in the span element, access it directly. There are several extraction methods, and the sample code below may be a detour.
race_condition_scraping.py
#Download HTML
import requests
#Get information with Beautiful Soup
from bs4 import BeautifulSoup
#Specify URL
r = requests.get("https://nar.netkeiba.com/?pid=race&id=p201942100701")
#HTML and parser(Reading method)Specify
soup = BeautifulSoup(r.content, "html.parser")
#Extract span element from html
tags = soup.find_all('span')
print(tags)
"""Line breaks for easy viewing
[<span>Introduction of my page</span>, <span>Favorite horse</span>, \
<span>Login/Member registration</span>, <span>(s)Login</span>, \
<span>(s)Join Free</span>, <span>(s)Log out</span>, \
<span>Da 1400m/Weather: cloudy/Baba: Shige/Start: 10:30</span>, \
<span>7</span>, <span>6</span>, <span>4</span>, <span>8</span>, \
<span>3</span>, <span>1</span>, <span>2</span>, <span>8</span>, \
<span>6</span>, <span>5</span>, <span>7</span>, <span>5</span>]
"""
#Inclusive notation When extracting"weather"Is immutable, so specify the condition
names = [t.text for t in tags if "weather" in t.text]
print(names)
"""
['Da 1400m\xa0/\xa0 weather: cloudy\xa0/\xa0 Baba: Shige\xa0/\xa0 start: 10:30']
"""
#"\xa0/\xa0"OK if separated by
weather_ = names[0].split('\xa0/\xa0')
print(weather_)
"""
['Da 1400m', 'Weather: cloudy', 'Baba: Shige', 'Start: 10:30']
"""
I learned the scraping code of @ akihiro199630, so I studied. I tried crawling and scraping a horse racing site, part 1 I referred to the code in the article.
race_data_scraping.py
import requests
import lxml.html
import csv
#Specify URL
URL = "https://nar.netkeiba.com/?pid=race&id=p201942100701"
r = requests.get(URL)
r.encoding = r.apparent_encoding #Prevent garbled characters
html = lxml.html.fromstring(r.text) #Acquired character string data
rlt = [] #result
#Convert from HTML string to HtmlElement type
# [(div id="race_main")→(div class)→(table width)→(tr)] →(td)Is packed with information
for h in html.cssselect('#race_main > div > table > tr'):#Specify the scraping location with the CSS selector
#Get Element number
h_1 = h
#Get the text content of the Element number
h_2 = h_1.text_content()
#new line("\n")Divide based on
h_3 = h_2.split("\n")
#Comprehension notation Excludes empty strings
h_4 = [tag for tag in h_3 if tag != '']
#No time is recorded in 1st place
if len(h_4) != 13:
#8th 0 that needs to be forcibly added
h_4.insert(8,0)
#Row data in the list(Add list)
rlt.append(h_4)
#Extraction result
print("h_1",h_1)
"""
h_1 <Element tr at 0x1c35a3d53b8>
"""
print("h_2",h_2)
"""
12
5
5
Supercruise
Male 3
56.0
Kosei Akimoto
1:34.2
7
12
459
(Urawa)Toshio Tomita
469( +1 )
"""
print("h_3",h_3)
"""Line breaks for easy viewing
h_3 ['', '12', '5', '5', 'Supercruise', '', 'Male 3', '56.0', 'Kosei Akimoto', \
'1:34.2', '7', '12', '459', '(Urawa)Toshio Tomita', '469( +1 )', '']
"""
print("h_4", h_4)
"""
h_4 ['12', '5', '5', 'Supercruise', 'Male 3', '56.0', 'Kosei Akimoto', '1:34.2', '7', '12', '459', '(Urawa)Toshio Tomita', '469( +1 )']
"""
#Save to CSV file
with open("result.csv", 'w', newline='') as f:
wrt = csv.writer(f)
wrt.writerow(rlt.pop(0)) #1st row item writerow
wrt.writerows(rlt) #Extraction result writerows
When you want to reflect the past performance of the raced horse in the neural network You will need the latest data from the day before the race. If this point is not taken into consideration, the results after the race date will be reflected.
So I wrote a simple code to get the race date of the target site, so I will summarize it.
race_date.py
#Specify URL
URL = "https://nar.netkeiba.com/?pid=race&id=p201942100701"
#The date is important because it is necessary to go back to the past from the race date
#Split all URLs=p migration 2019 42 100701 → 2019 42 1007 01
#2019 42 1007 01 → Year?(Horse racing venue number?)Date race number
#12 digits of date from URL(201942100701)Extract
url_12_ymd = URL[-12:]
print(url_12_ymd)
# 201942100701
url_12_y = url_12_ymd[:4]
print(url_12_y)
# 2019
url_12_md = url_12_ymd[6:10]
print(url_12_md)
# 1007
url_12_race = url_12_ymd[10:]
print(url_12_race)
# 01
Now that we have extracted the necessary information, I wonder if we should quantify da (dirt), clouds, and weights and put them into the neural network. If I am motivated, I would like to continue. Since the site (*) that I referred to can get important information other than the race environment, Would it work if I integrated the above coat into the code on the site?
@ akihiro199630 I'm a person who works a lot on scraping. God. Python Crawling & Scraping Chapter 1 Summary Python Crawling & Scraping Chapter 4 Summary [Scrapy] Correct / process the extracted URL
Recommended Posts