[PYTHON] I tried crawling and scraping a horse racing site Part 2

Introduction

"Python Crawling & Scraping [Augmented and Revised Edition] -Practical Development Guide for Data Collection and Analysis-" Create an original program using the knowledge up to Chapter 3.

This time, I created a program that acquires the URL of an individual page from the search results of the racehorse search function of netkeiba, accesses the URL, acquires the information of each horse, and saves it in the database.

Search result page structure

The search window is at the top of the basic netkeiba, so I will omit it. With the advanced search function, you can specify various search conditions such as pedigree, coat color, display order, etc. This time, without using the search window, I jumped to the search result page from the breeding performance column of Deep Impact Details Page (this area will be described later). Related to JavaScript entanglement) (Click the link circled in red)

First, here is the search result page. Search results are sorted by prize by default. This time, the page of each horse linked to this search result will be scraped.

Information such as horse name, stable, pedigree, owner, etc. can be found from this search result page, but first of all, we will concentrate on getting the URL of the detail page of each horse. Each horse name is a link to the horse details page, and the URL ends with the year of birth + 6 digits.

In the URL parameter of the search result page, the number "2002100816" at the end of the Deep Impact detail page is specified as sire_id, and the horses that have Deep Impact as their father are narrowed down. (There is "Search result" at the top, but if you search by horse name, this part will be "Search result of (horse name)".)

This time, I want to get not only the first result page but also the second and subsequent pages that can be jumped by clicking "Next", but the link of that part is

<a href="javascript:paging('2')">Next</a>

Although it looks like, when you click it, the next page is displayed The URL is https://db.netkeiba.com/, and you can see that the screen transition is performed using POST. (The same applies to the search result page when using the advanced search form.) The part of the actual paging function is

function paging(page)
{
document.sort.page.value = page;
document.sort.submit();
}

You can see that it is submitted using javascript. When you start the browser verification tool and click "Next" 2019-11-03 19.04.56 db.netkeiba.com 06565a7f139a.png Since the value like this is POSTed, db.netkeiba.com/?pid=horse_list&_sire_id=2002100816&page=(ページ番号) If you specify the page parameter at the end of the URL of the search result page, you can get the second and subsequent pages by GET.

Detail page structure

This time, the horse name, active or erased, gender, coat color at the top of the main content on the detail page, Analyze the table containing information such as date of birth and the table of pedigree under it, and save the result in the database.

2019-10-20 23.17.03 db.netkeiba.com f349a0d9a4b6.png 2019-10-20 23.23.46 db.netkeiba.com a15e20bf34a7.png

The top two are detailed pages of the two Deep Impact prize money earned at the time of writing the article, the first and second prize money, but I notice that the number of items in the table where data such as date of birth is displayed is different. In netkeiba, the item of "recruitment information" is added to the page of horses owned by so-called bite clubs, so the number of items in the table is different from non-club horses, so this must be kept in mind when scraping. ..

In addition, horses belonging to local horse racing and foreign horses have symbols such as □ ground and ○ outside of the horse name, respectively (although the target deep impact production piece is almost irrelevant), so only the horse name is acquired except for this. To do. Also, as you can see from the first image below, there is no indication of active or erasure on the □ ground (horses belonging to local horse racing), so scraping should be taken into consideration here as well. 2019-10-27 23.08.38 db.netkeiba.com 2cc6e24ab680.png 2019-10-27 23.08.53 db.netkeiba.com 86af2843684b.png

Completed code

`keiba_scraping.py`


import requests
import lxml.html
import time
from pymongo import MongoClient
import re
import sys

def main(sire_id,n):
    client = MongoClient('localhost', 27017) #Connect to MongoDB on localhost.
    collection = client.scraping.horse_data #scraping database. Create if not
    collection.create_index('key', unique=True) #Create a unique index in the key field that stores the key that uniquely identifies the data.

    session = requests.Session()

    for i in range(n):
        response = session.get("https://db.netkeiba.com/?pid=horse_list&_sire_id=" + sire_id + "&page=" + str(i))
        response.encoding = response.apparent_encoding #Encoding is appearent_Change to what was guessed by encoding
        urls = scrape_list_page(response) #Get the URL list of the detail page
        for url in urls:
            key = extract_key(url) #Get the number at the end of the URL as a key
            h = collection.find_one({'key': key}) #Search the data of the corresponding key
            if not h: #If it does not exist in the DB
                time.sleep(1) #Run every second(Reducing the burden on the acquisition site)
                response = session.get(url)#Get details page
                horse = scrape_horse_page(response)#Scraping detail page
                collection.insert_one(horse)#Save horse information in DB

def scrape_list_page(response):#Generator function to extract URL of detail page
    html = lxml.html.fromstring(response.text)
    html.make_links_absolute(response.url)
    for a in html.cssselect('#contents_liquid > div > form > table > tr > td.xml.txt_l > a'):
        url = a.get("href")
        yield url

def scrape_horse_page(response):#Analyze detail page
    response.encoding = response.apparent_encoding #Specify encoding
    html = lxml.html.fromstring(response.text)

    #Name, active or deleted from the top of the page,sex,Get coat color information
    for title in html.cssselect('#db_main_box > div.db_head.fc > div.db_head_name.fc > div.horse_title'):
        name = parse_name(title.cssselect('h1')[0].text.strip()) #Obtained a horse name. Remove extra whitespace with strip and parse_Pass to name
        #Since active or deleted, gender, and coat color are character strings separated by spaces, they are divided by split and stored in variables by map.
        data = title.cssselect('p.txt_01')[0].text.split()
        if len(data) > 2:
            status,gender,color = map(str,data)
        else:
            gender,color = map(str,data) #Since there is no information on active peripherals for local horses
            status = None

    #Get father / mother / mother father information from the pedigree table
    for bloodline in html.cssselect('#db_main_box > div.db_main_deta > div > div.db_prof_area_02 > div > dl > dd > table'):
        sire = bloodline.cssselect('tr:nth-child(1) > td:nth-child(1) > a')[0].text
        dam = bloodline.cssselect('tr:nth-child(3) > td.b_fml > a')[0].text
        broodmare_sire = bloodline.cssselect('tr:nth-child(3) > td.b_ml > a')[0].text

    club_info = html.cssselect('#owner_info_td > a') #Show club horse offer price
    for data in html.cssselect('#db_main_box > div.db_main_deta > div > div.db_prof_area_02 > table'):
        birthday = data.cssselect('tr:nth-child(1) > td')[0].text #Get birthday information and convert to date type
        trainer = data.cssselect('tr:nth-child(2) > td > a')[0].text #Get trainer information
        owner = data.cssselect('tr:nth-child(3) > td > a')[0].text #Get information about owners
        #For club horses, below the producer::nth-Since the number of child shifts one by one, club_Add 1 if there is an element of info
        if len(club_info) > 0:
            breeder = data.cssselect('tr:nth-child(5) > td > a')[0].text #Producer
            prize_money = data.cssselect('tr:nth-child(8) > td')[0].text.strip().replace(' ','') #Prize strip removes whitespace at both ends, replase removes spaces in text
        else:
            breeder = data.cssselect('tr:nth-child(4) > td > a')[0].text
            prize_money = data.cssselect('tr:nth-child(7) > td')[0].text.strip().replace(' ','')


    horse = {
        'url': response.url,
        'key': extract_key(response.url),
        'name':name,
        'status': status,
        'gender':gender,
        'color':color,
        'birthday':birthday,
        'sire':sire,#father
        'dam':dam,#mother
        'broodmare_sire':broodmare_sire,#Mother father
        'owner':owner,
        'breeder':breeder,
        'trainer':trainer,
        'prize_money' : prize_money
    }

    return horse

def extract_key(url):
    m = re.search(r'\d{10}', url).group() #Last/Get from to the end of the string with a regular expression.
    return m

def parse_name(name):
    m = re.search(r'[\u30A1-\u30FF]+', name).group() #○ Ground or □ Take out only the horse name from the horse on the ground. If you take out the part that matches the katakana regular expression pattern, k
    return m

if __name__ == "__main__":
    main(sys.argv[1],int(sys.argv[2]))#Get the number at the end of the URL of the horse you want to get the search result list of the production piece from the command line argument and the number of pages to get the search page result and call the main function

python keiba_scraping.py 2002100816 4

Execute the list of Deep Impact production pieces under the condition of acquiring 4 pages, If you display all horse_data collections from mongo's shell ... Kita━━━━━━ (゜ ∀ ゜) ━━━━━━ !!!!! You can also narrow down by specifying various search conditions

That ’s why it ’s completed for the time being.

in conclusion

Although it is not a big program, it took a long time to post an article due to motivation and time issues. As a function I want to add, I think it's about getting a recursive page like Deep Impact production piece to that production piece. I would appreciate it if you could refer to it or if you find it interesting. Also, if you have any questions or concerns, please leave a comment.