[Python] Practical Beautiful Soup ~ Scraping the triple single odds table on the official website of Kyotei ~

Purpose

Data collection is troublesome, isn't it? I wanted to analyze boat races in the future, so as a practice of collecting data odds table on the official boat race website = 02 & hd = 20200511) is scraped.

Overview

--Since I can only think of a language that seems to be useful for scraping = python, I use python3.7. --Beatutiful soup in python3.7 ~~ It looks a little erotic ~~ The library with the name seems to be useful for scraping. --By using the css selector of beautifulsoup, you can specify the location of the table without decoding the html one by one! --Copy the CSS selector using the verification tool installed in the browser (easy) ――Because it is a practice beautifulsoup, I will not explain the method in detail (there are many other good articles!) ――Do your best to pull out the information of the triple table and put it in the dictionary type this time.

What you want to scrape and how to output

ファイル名 ↑ This is a triple table! If you like gambling, you've probably seen it! From this figure, I want to get the value, make it a python dictionary type, and call it as follows!
print('sample1:', three_rentan_odds_dict['1']['2']['3'])
print('sample2:', three_rentan_odds_dict['6']['5']['4'])

# output:
# sample1: 47.2
# sample2: 285.7

Of course, the list type is fine, but the dictionary type is faster to access, and the order in the array doesn't matter, so don't dig in too much here!

Advance preparation

Development uses python3.7, but I think you can go with any 3 series! It might be easier to copy and paste using jupyter or something!

Packages to install

Enter with pip

#For pip
pip install request, beautifulsoup4, numpy

#For pipenv
pipenv install request beautifulsoup4 numpy

What you shouldn't do at all

Acts that put a load on the other server

Unintentionally or unintentionally ** Never put a load on the other server **. There are also cases of arrest. Specifically, it is okay to copy and paste the source code introduced here as it is and execute it only once, but if you try to scrape the information of the entire schedule using the ** for statement, the load on the server will be increased. Please do not do this as it may cause inconvenience **. It seems that scraping itself for the purpose of data analysis is not illegal. For more information here

Implementation

Import HTML by specifying URL

from urllib.request import urlopen
from bs4 import BeautifulSoup
#Using the triple single odds table of the 12th race of Toda Racecourse on May 11, 2020 as an example
target_url = \
    'https://www.boatrace.jp/owpc/pc/race/odds3t?rno=12&jcd=02&hd=20200511'
#load html
html_content = urlopen(target_url).read()
print(type(html_content))

# output
# <class 'bytes'>

Load with beautifulsoup for scraping

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
print(type(soup))

# output
# <class 'bs4.BeautifulSoup'>

Use the select method provided by beautifulsoup.

In the select method, you can scrape by specifying the location of the html tag using the specified css selector. This time, I would like to take out the odds table part of the triplet. Therefore, use the verification tool of the browser to get the css selector of the target part. In particular

  1. Move the cursor to the table and right-click to open ** Verification ** (I think F12 is fine, but I use the mouse because tilix starts up with F12) ファイル名
  2. When you move the mouse over the html of the verification tool, the relevant part of the html description will be knitted on the page, so find the place where the table part is selected. filename
  3. Right-click at this timing and select Copy-> Copy Selector file name
  4. Prepare the variable target_table_selector and paste it.
#Paste the copied css selector
target_table_selector = \
    'body > main > div > div > div > '\
    'div.contentsFrame1_inner > div:nth-child(6) > table'

# select_Fetch the html specified by the one method
odds_table = soup.select_one(target_table_selector)
print(type(odds_table))

# output:
# <class 'bs4.element.Tag'>
# print(odds_table)When you execute, the html of only the specified table part is displayed.

Extract elements from the odds table

Looking at the browser verification tool we saw earlier, in order to extract the elements, only the 'tbody' part is required in ʻodds_table, so specify it with select_one. Then, in order to store each row as a list, use select to specify the'tr'` part and make it a list.

#specification of tbody
odds_table_elements = odds_table.select_one('tbody')

#Specify tr and store as a list
row_list = odds_table_elements.select('tr')
print(len(row_list))

# output:
# 20 :Matches the number of rows in the table

Next, paying attention to the tag that stores the value of the odds that are the elements, we can see that it is a class called ʻoddsPoint` in the td tag. ファイル名 Since we want to extract this for each line, we will create a function first.

#Processing to be performed for each line
def getoddsPoint2floatlist(odds_tr):
    #Get the list of html where the odds values are stored
    html_list = odds_tr.select('td.oddsPoint')
    print(html_list[0])
    # example output:
    # <td class="oddsPoint">47.2</td>
    #By using text, you can extract only the elements surrounded by tags
    text_list = list(map(lambda x: x.text, html_list))
    # print(text_list)
    # example output:
    # ['47.2', '60.3', '588.7', '52.8', '66.0', '248.7']
    #Odds are decimal numbers, so cast to float type
    float_list = list(map(
        lambda x: float(x), text_list))
    return float_list

Use the map function to generate a matrix that extracts only the elements of the entire table

odds_matrix = list(map(
    lambda x: getoddsPoint2floatlist(x),
    row_list
))

print(odds_matrix)

# output
# [[47.2, 60.3, 588.7, 52.8, 66.0, 248.7],
#  [14.7, 13.3, 994.9, 361.6, 363.8, 1276.0],
#  [12.0, 11.1, 747.7, 67.1, 137.8, 503.6],
#  [26.7, 26.6, 1155.0, 96.5, 123.7, 414.5],
#  [157.0, 188.8, 566.8, 50.4, 64.3, 241.5],
#  [242.2, 215.7, 660.5, 261.5, 314.5, 1037.0],
#  [237.5, 190.8, 561.6, 36.4, 66.8, 183.4],
#  [403.5, 281.1, 926.8, 49.2, 73.1, 183.6],
#  [35.0, 25.4, 1276.0, 750.0, 930.3, 2462.0],
#  [219.2, 152.2, 959.6, 517.5, 799.1, 1950.0],
#  [59.6, 23.6, 963.4, 650.0, 1139.0, 1779.0],
#  [89.4, 38.4, 1433.0, 639.7, 1237.0, 2321.0],
#  [34.6, 23.8, 1019.0, 63.9, 119.7, 387.5],
#  [212.5, 143.8, 752.3, 36.9, 64.1, 174.3],
#  [76.3, 30.5, 1231.0, 270.8, 452.2, 952.1],
#  [79.6, 35.8, 1614.0, 44.9, 84.1, 244.4],
#  [83.7, 90.6, 2031.0, 110.1, 171.1, 391.8],
#  [356.3, 308.5, 1552.0, 63.2, 103.9, 201.7],
#  [159.7, 77.7, 1408.0, 326.7, 560.3, 1346.0],
#  [136.0, 69.0, 1562.0, 71.4, 148.1, 285.7]]

** This completes scraping! !! ** **

Bonus: Store in dictionary type

This is not an essential part of scraping, so I will omit detailed explanations.

import numpy as np
#numpy array
odds_matrix = np.array(odds_matrix)
#Take transposes, connect and list
odds_list = list(odds_matrix.T.reshape(-1))

#Store in dictionary
three_rentan_odds_dict = {}
for fst in range(1, 7):
    if fst not in three_rentan_odds_dict.keys():
        three_rentan_odds_dict[str(fst)] = {}
    for snd in range(1, 7):
        if snd != fst:
            if snd not in three_rentan_odds_dict[str(fst)].keys():
                three_rentan_odds_dict[str(fst)][str(snd)] = {}
            for trd in range(1, 7):
                if trd != fst and trd != snd:
                    three_rentan_odds_dict[str(fst)][str(snd)][str(trd)] = \
                        odds_list.pop(0)

print('sample1:', three_rentan_odds_dict['1']['2']['3'])
print('sample2:', three_rentan_odds_dict['6']['5']['4'])

# output:
# sample1: 47.2
# sample2: 285.7

Complete

Recommended Posts

[Python] Practical Beautiful Soup ~ Scraping the triple single odds table on the official website of Kyotei ~
[Python3] Understand the basics of Beautiful Soup
[Python] Scraping a table using Beautiful Soup
Scraping member images from the official website of Sakamichi Group
Table scraping with Beautiful Soup
[Python] A memorandum of beautiful soup4
Scraping with Python and Beautiful Soup
Website scraping with Python's Beautiful Soup
[Python] Get the last updated date of the website
At the time of python update on ubuntu