Introduction

I was interested in it as a data analysis theme, so I tried it.

The site I referred to is here.

Data analysis / machine learning starting with horse racing prediction

Flow to horse racing prediction

If you want to build a predictive model from scratch, you need to take the following steps:

Scraping data from horse racing site
Data preprocessing

Determine the explanatory variables and features you want to predict
Convert from string to numeric type
Dummy variable of categorical variable

Divide into training data and test data
Model building

Check the test data with the model trained with the training data
Model optimization if overfitted

Predict the race you want to predict

This time, I will briefly summarize the scraping related items in 1.

About scraping

net.keiba.com I scraped from this site.

important point

Retrieving a large amount of data at one time puts a load on the server. By inserting time.sleep (1), it waits when requesting race_id_list every second. It is etiquette to reduce the server load by this.

import pandas pd
from tqdm import tqdm_notebook as tqdm
import time

def scrape_race_results(race_id_list):
    race_results={}
    for race_id in tqdm(race_id_list):
        try:
            url = 'https://db.netkeiba.com/race/'+ race_id
            race_results[race_id]= pd.read_html(url)[0]
            time.sleep(1)
        except IndexError:
            continue
        except:
            break
    return race_results

Put the race you want to check in this race_id. For example, suppose you have an ID of 202009020611. this is,

2020 → Number of years
09 → Location(If it is 09, it is Hanshin, if it is 10, it is Kokura, etc.)
02 → month
06 → Sun
11 → Number of races

Is shown.

You can see it in this way as a trial.

We will analyze the data using basic pandas. For peace of mind, save it as a pickle file and csv.

Assuming that the acquired data is stored in resluts_new, it will be as follows.

results_new.to_pickle('results_new2017-2020')
results_new.to_csv('results_new2017-2020.csv',encoding="SHIFT-JIS")

At the end

We have summarized the data acquisition method easily.

Recommended Posts

I tried to get a database of horse racing using Pandas

I tried to get a list of AMI Names using Boto3

[Horse Racing] I tried to quantify the strength of racehorses

I tried to get the index of the list using the enumerate function

I tried to make a regular expression of "amount" using Python

I tried to make a regular expression of "time" using Python

I tried to make a regular expression of "date" using Python

I tried to make a function to retrieve data from database column by column using sql with sqlite3 of python [sqlite3, sql, pandas]

I tried using a database (sqlite3) with kivy

I tried to make a ○ ✕ game using TensorFlow

I tried to get the batting results of Hachinai using image processing

I learned scraping using selenium to make a horse racing prediction model.

I tried to perform a cluster analysis of customers using purchasing data

How to scrape horse racing data using pandas read_html

I tried to automate the 100 yen deposit of Rakuten horse racing (python / selenium)

I tried to get an AMI using AWS Lambda

[Python] I tried to get Json of squid ring 2

I tried using Python (3) instead of a scientific calculator

I tried to draw a configuration diagram using Diagrams

I tried to notify the update of "Become a novelist" using "IFTTT" and "Become a novelist API"

Every time I try to read a csv file using pandas, I get a numpy error.

I want to collect a lot of images, so I tried using "google image download"

I tried to get the location information of Odakyu Bus

I tried to get Web information using "Requests" and "lxml"

I tried "How to get a method decorated in Python"

I tried to get started with Hy ・ Define a class

I tried crawling and scraping a horse racing site Part 2

I tried to automate [a certain task] using Raspberry Pi

I tried to make a stopwatch using tkinter in python

I tried to make a simple text editor using PyQt

I tried to get data from AS / 400 quickly using pypyodbc

I tried using GrabCut of OpenCV

I tried to compare the accuracy of machine learning models using kaggle as a theme.

I tried to create a Python script to get the value of a cell in Microsoft Excel

I tried using PI Fu to generate a 3D model of a person from one image

I tried to automate the construction of a hands-on environment using IBM Cloud's SoftLayer API

A memorandum when I tried to get it automatically with selenium

I tried to implement anomaly detection using a hidden Markov model

[Python] A memo that I tried to get started with asyncio

I tried to create a list of prime numbers with python

I tried to make a todo application using bottle with python

Create a function to get the contents of the database in Go

[Python] I tried to get various information using YouTube Data API!

I tried to get data from AS / 400 quickly using pypyodbc Preparation 1

I tried to make a mechanism of exclusive control with Go

I tried to create a linebot (implementation)

I tried using Azure Speech to Text.

I tried to create a linebot (preparation)

I tried to get started with Hy

I tried playing a ○ ✕ game using TensorFlow

I tried drawing a line using turtle

I tried to classify text using TensorFlow

I tried to make a Web API

I tried using pipenv, so a memo

Vectorization of horse racing pedigree using fastText

I tried 3D detection of a car

I tried to predict Covid-19 using Darts

I tried to transform the face image using sparse_image_warp of TensorFlow Addons

I tried to estimate the similarity of the question intent using gensim's Doc2Vec

A memorandum of how to write pandas that I tend to forget personally

I tried to get the authentication code of Qiita API with Python.

[PYTHON] I tried to get a database of horse racing using Pandas

Introduction

Flow to horse racing prediction

About scraping

At the end