I was interested in it as a data analysis theme, so I tried it.
The site I referred to is here.
If you want to build a predictive model from scratch, you need to take the following steps:
This time, I will briefly summarize the scraping related items in 1.
net.keiba.com I scraped from this site.
important point
Retrieving a large amount of data at one time puts a load on the server. By inserting time.sleep (1)
, it waits when requesting race_id_list
every second. It is etiquette to reduce the server load by this.
import pandas pd
from tqdm import tqdm_notebook as tqdm
import time
def scrape_race_results(race_id_list):
race_results={}
for race_id in tqdm(race_id_list):
try:
url = 'https://db.netkeiba.com/race/'+ race_id
race_results[race_id]= pd.read_html(url)[0]
time.sleep(1)
except IndexError:
continue
except:
break
return race_results
Put the race you want to check in this race_id
. For example, suppose you have an ID of 202009020611
.
this is,
2020 → Number of years
09 → Location(If it is 09, it is Hanshin, if it is 10, it is Kokura, etc.)
02 → month
06 → Sun
11 → Number of races
Is shown.
You can see it in this way as a trial.
We will analyze the data using basic pandas. For peace of mind, save it as a pickle
file and csv
.
Assuming that the acquired data is stored in resluts_new
, it will be as follows.
results_new.to_pickle('results_new2017-2020')
results_new.to_csv('results_new2017-2020.csv',encoding="SHIFT-JIS")
We have summarized the data acquisition method easily.
Recommended Posts