[PYTHON] Scraping using lxml and saving to MySQL

As the title suggests, the information scraped using requests and lxml is saved in MySQL. Since there were few sites that explained cross-cuttingly, record it as a memorandum. Obviously, scraping puts a burden on the other server, so it is necessary to generate delays and check robot.txt.

environment

Execution environment

Library used

--requests 2.13.0 --HTTP library --lxml 3.7.3 --XML library --PyMySQL 0.7.10 --MySQL driver --fake-useragent 0.1.5 --request_header creation library

Other libraries commonly used for scraping include urllib, htmllib, BeautifulSoup and Scrapy. However, this time I went with this configuration. Since the fake-useragent was only in pip, I used the pip environment this time.

Scraping as an example

Scraping the article title of the article, the url of the link to the individual article, the category name, and the article creation date from the IT category of Yahoo News (http://news.yahoo.co.jp/list/?c=computer).

1.tiff

As for the page structure, each index page has 20 individual articles, and the index page itself has 714 pages. The url structure of the index page is http://news.yahoo.co.jp/list/?c=computer&p=1, like http://news.yahoo.co.jp/list/?c=computer&p=. It is the one with the number added. Each individual article is divided into three types: articles with thumbnails, articles without thumbnails, and deleted articles.

Article example with thumbnail 2.tiff

Article example without thumbnail 3.tiff

Deleted article example 4.tiff

How to get xPath

lxml uses xPath format to specify the part you want to parse. Use the chrome verification function as a method to get the xPath of the information you want to get. Go to the index page in chrome and select Validate from the right-click menu. After that, open the html tab while referring to the highlights on the page. When you finally reach the information you want to scrape, you can select that information, bring up the right-click menu, and select Copy> Copy XPath to get the xPath.

5.tiff

XPath example obtained based on individual article classification

--Example article with thumbnail (first from top) --Article title: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [1] / a / span [2] --Link url: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [1] / a --Category: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [1] / a / span [3] / span [1] --Article created: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [1] / a / span [3] / span [2] --Example article with thumbnail (20th from the top) --Article title: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [20] / a / span [2] --Link url: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [20] / a --Category: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [20] / a / span [3] / span [1] --Article creation date: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [20] / a / span [3] / span [2] --Example article without thumbnails (8th from the top) --Article title: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [8] / a / span [1] --Link url: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [8] / a --Category: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [8] / a / span [2] / span [1] --Article creation date: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [8] / a / span [2] / span [2] --Example of deleted article (9th from the top) --Article title: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [9] / div / span [1] --Link: None --Category: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [9] / div / span [2] / span [1] --Article creation date: // * [@ id = "main"] / div [1] / div [1] / div [4] / ul / li [9] / div / span [2] / span [2]

Consider the xPath regularity of each piece of information from the xPath samples collected based on the individual article categories.

--The position of the individual article is specified by li [num] at the numth position from the top. --The span [num] immediately after / a / changes to 2,3,3,1,2,2 depending on the presence or absence of thumbnails. --For deleted articles, the / a part changes to / div.

Etc. will be considered. Based on this consideration, we will generate an actual xPath and pass it to lxml.

The actual code part and how to use each library

requests,fake-useragent

import requests  #Get html by GET request to url
from fake_useragent import UserAgent  #Generate header

def make_scraping_url(page_num):
    scraping_url = "http://news.yahoo.co.jp/list/?c=computer&p=" + str(page_num)
    return scraping_url

counter_200 = 0
counter_404 = 0

def make_html_text(scraping_url):
    global counter_200
    global counter_404

    ua = UserAgent()  # fakeuser-agent object
    ran_header = {"User-Agent": ua.random}  #Generate a random header each time it is called
    html = requests.get(scraping_url, headers=ran_header)  #HTML object obtained by requests
    html_text = html.text

    if html.status_code == 200:
        counter_200 += 1
    elif html.status_code == 400:
        counter_404 += 1

    return html_text

The header of requests is randomly generated by fake_useragent. Prepare counter_200 as a global variable as a counter when the get request of requests is successful and counter_404 as a counter when the end processing and get request are not successful. lxml

import lxml.html  #xml parser

#Generated by converting Document Object Model and html to xml format
def make_dom(html_text):
    dom = lxml.html.fromstring(html_text)
    return dom

#Takes dom and xpath and returns text
def make_text_from_xpath(dom, xpath):
    text = dom.xpath(xpath)[0].text
    return text

#Takes dom and xpath and returns link
def make_link_from_xpath(dom, xpath):
    link = dom.xpath(xpath)[0].attrib["href"]
    return link

#The text you want to get is text()If it is contained in
def make_text_from_xpath_function(dom, xpath):
    text = dom.xpath(xpath)[1]
    return text

You can create a DOM (Documents Object Model) by passing html in text format to lxml. You can get each information by passing XPath to this object. The method of specifying XPath differs depending on the type of information. It seems good to check the XPath description method each time. If you pass XPath, it will basically be returned in list format, so you need to check where the information is in list by turning a for loop or the like.

For loop example to find the information you want to get from the list format

def debug_check_where_link_number(page_num, article_num):
    scraping_url = make_scraping_url(page_num)
    html_text = make_html_text(scraping_url)
    dom = make_dom(html_text)
    xpath = Xpath(article_num)

    thumbnail_link = check_thumbnail_link(dom, article_num)
    if thumbnail_link[0] is True and thumbnail_link[1] is True:
        xpath.with_thumbnail()
    elif thumbnail_link[0] is False:
        xpath.no_thumbnail()
    elif thumbnail_link[1] is False:
        xpath.deleted()

    get_link = make_link_from_xpath(dom, xpath.info_list[1])
    l_get_link = get_link.split("/")
    counter = 0
    for item in l_get_link:
        print(counter)
        counter += 1
        print(item)

thumbnail_link will be explained a little further down.

XPath class

#A class that summarizes the information items you want to scrape and their xpaths, and first pass the number of the individual article from the top
class Xpath:
    def __init__(self, article_num):
        self.article_num = article_num
        self.info_list = []

#For articles with thumbnails
    def with_thumbnail(self):
        self.info_list = ["//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[2]",  # title
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a", # link
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[3]/span[1]", # category
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[3]/span[2]"] # timestamp

    #For articles without thumbnails
    def no_thumbnail(self):
        self.info_list = ["//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[1]",  # title
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a", # link
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[2]/span[1]", # category
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[2]/span[2]"] # timestamp

    #For deleted articles
    def deleted(self):
        self.info_list = ["//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/div/span[1]",  # title
                          None,  # link
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/div/span[2]/span[1]", # category
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/div/span[2]/span[2]"]  # timestamp

Create an XPath class based on the XPath discussion we made earlier.

Distinguishing article classification

#Determine if there is a thumbnail or link for an individual article
def check_thumbnail_link(dom, article_num):
    #An array showing the presence or absence of thumbnails and links
    thumbnail_link = [True, True]

    #Throw the xpath of the article category without thumbnails and get an error if thumbnails exist
    try:  #Not an error, that is, the thumbnail does not exist
        make_text_from_xpath(dom, "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(article_num) + "]/a/span[2]/span[1]")
        thumbnail_link[0] = False
    except IndexError:  #Error, ie thumbnail exists
        pass

    #Throw the xpath of the link of the article with thumbnails, and if the link does not exist, an error will be returned
    try:  #No error, i.e. link exists
        make_link_from_xpath(dom, "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(article_num) + "]/a")
    except IndexError:  #Error, that is, link does not exist
        thumbnail_link[1] = False

    return thumbnail_link

Determine if you try throwing an XPath and return an error. It's quite a skill, so I'm thinking of a better way. PyMySQL

Preparation on the MySQL side and frequently used MySQL statements

$ mysql.server start #Start MySQL server
$ mysql -u root -p #Launch MySQL Cursor

mysql> show databases; #Check the database
mysql> create database database_name #Creating a database
mysql> use database_name; #Select database to use
mysql> show tables; #Checking the table in the database
mysql> show columns from table_name; #Check column in table
mysql> select * from table_name; #Show all elements in the table
mysql> delete from table_name; #Delete all elements in the table, use when scraping wrong
mysql> create table table_name(column_name_1 column_type_1, column_name_2 column_type_1, ...); #Creating a table

#Table to be created this time
mysql> create table yahoo_news(page_num int not null,
                               title varchar(255),
                               link_num varchar(255),
                               category varchar(255),
                               article_time varchar(255));

I created a database named scraping and a table named yahoo_news. The reason why link_num is varchar type is that if link_num starts with 0, 0 disappears when it is put in int type.

Processing on the python side

import pymysql.cursors  #MySQL client

#MySQL connection information
connection = pymysql.connect(host="localhost",
                             user="root",
                             password="hogehoge",
                             db="scraping",
                             charset="utf8")

#Cursor for throwing a query and each statement
cur = connection.cursor()
sql_insert = "INSERT INTO yahoo_news VALUES (%s, %s, %s, %s, %s)"   #Match the number of substitution strings with the number of columns
sql_check = "SELECT * FROM yahoo_news WHERE page_num = %s"

# page_Check num and skip if it is included in the database
cur.execute(sql_check, page_num)
check_num = cur.fetchall()  #Get the information that came back
if check_num:   #When it is not an empty string, that is, when it is included in the database
    continue

#Pass information for 20 items to MySQL
cur.executemany(sql_insert, l_all_get_text)  #execute many when passing a list containing tuples, execute when passing just a tuple
connection.commit()  #Insert is not done without commit

#Cursor and database termination
cur.close()
connection.close()

Using the substitution string $ s makes it easier to create statements. It was surprisingly easy to use ʻexecute when passing int type, str type, and tuple type, and ʻexecutemany when passing list type.

All code that summarizes each

# coding: utf-8
import requests  #Get html by GET request to url
from fake_useragent import UserAgent  #Generate header
import lxml.html  #xml parser
import pymysql.cursors  #MySQL client
import time  #For delay generation


def make_scraping_url(page_num):
    scraping_url = "http://news.yahoo.co.jp/list/?c=computer&p=" + str(page_num)
    return scraping_url


counter_200 = 0
counter_404 = 0


def make_html_text(scraping_url):
    global counter_200
    global counter_404

    ua = UserAgent()  # fakeuser-agent object
    ran_header = {"User-Agent": ua.random}  #Generate a random header each time it is called
    html = requests.get(scraping_url, headers=ran_header)  #HTML object obtained by requests
    html_text = html.text

    if html.status_code == 200:
        counter_200 += 1
    elif html.status_code == 400:
        counter_404 += 1

    return html_text


#Generated by converting Document Object Model and html to xml format
def make_dom(html_text):
    dom = lxml.html.fromstring(html_text)
    return dom


#Takes dom and xpath and returns text
def make_text_from_xpath(dom, xpath):
    text = dom.xpath(xpath)[0].text
    return text


#Takes dom and xpath and returns link
def make_link_from_xpath(dom, xpath):
    link = dom.xpath(xpath)[0].attrib["href"]
    return link


#The text you want to get is text()If it is contained in
def make_text_from_xpath_function(dom, xpath):
    text = dom.xpath(xpath)[1]
    return text


#A class that summarizes the information items you want to scrape and their xpaths, and first pass the number of the individual article from the top
class Xpath:
    def __init__(self, article_num):
        self.article_num = article_num
        self.info_list = []


#For articles with thumbnails
    def with_thumbnail(self):
        self.info_list = ["//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[2]",  # title
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a", # link
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[3]/span[1]", # category
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[3]/span[2]"] # timestamp

    #For articles without thumbnails
    def no_thumbnail(self):
        self.info_list = ["//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[1]",  # title
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a", # link
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[2]/span[1]", # category
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/a/span[2]/span[2]"] # timestamp

    #For deleted articles
    def deleted(self):
        self.info_list = ["//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/div/span[1]",  # title
                          None,  # link
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/div/span[2]/span[1]", # category
                          "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(self.article_num) + "]/div/span[2]/span[2]"]  # timestamp


#Determine if there is a thumbnail or link for an individual article
def check_thumbnail_link(dom, article_num):
    #An array showing the presence or absence of thumbnails and links
    thumbnail_link = [True, True]

    #Throw the xpath of the article category without thumbnails and get an error if thumbnails exist
    try:  #Not an error, that is, the thumbnail does not exist
        make_text_from_xpath(dom, "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(article_num) + "]/a/span[2]/span[1]")
        thumbnail_link[0] = False
    except IndexError:  #Error, ie thumbnail exists
        pass

    #Throw the xpath of the link of the article with thumbnails, and if the link does not exist, an error will be returned
    try:  #No error, i.e. link exists
        make_link_from_xpath(dom, "//*[@id=\"main\"]/div[1]/div[1]/div[4]/ul/li[" + str(article_num) + "]/a")
    except IndexError:  #Error, that is, link does not exist
        thumbnail_link[1] = False

    return thumbnail_link


#crawling operation, dom and page_num and article_Receives num and returns a tuple summarizing scraping information
def crawling(dom, page_num, article_num):
    list_get_text = []  #List for collecting information for one article, tuple type is used for insert into MySQL, so convert it later
    list_get_text.append(page_num)
    #Create an Xpath object
    xpath = Xpath(article_num)

    #Check for thumbnails and links, info_Generate list
    thumbnail_link = check_thumbnail_link(dom, article_num)
    if thumbnail_link[0] is True and thumbnail_link[1] is True:
        xpath.with_thumbnail()
    elif thumbnail_link[0] is False:
        xpath.no_thumbnail()
    elif thumbnail_link[1] is False:
        xpath.deleted()

    for index, xpath_info in enumerate(xpath.info_list):
        if index == 1 and thumbnail_link[1] is True:    #For link
            get_link = make_link_from_xpath(dom, xpath_info)
            l_get_link = get_link.split("/")
            list_get_text.append(l_get_link[4])  #The index of the link here is actually debug_check_where_link_number()Find out in
        elif index == 1 and thumbnail_link[1] is False:
            list_get_text.append(None)  #Pass NULL in MySQL as None type
        else:   #For text
            get_text = make_text_from_xpath(dom, xpath_info)
            list_get_text.append(get_text)

    tuple_get_text = tuple(list_get_text)

    return tuple_get_text


#MySQL connection information
connection = pymysql.connect(host="localhost",
                             user="root",
                             password="hogehoge",
                             db="scraping",
                             charset="utf8")

#Cursor for throwing a query and each statement
cur = connection.cursor()
sql_insert = "INSERT INTO yahoo_news VALUES (%s, %s, %s, %s, %s)"   #Match the number of substitution strings with the number of columns
sql_check = "SELECT * FROM yahoo_news WHERE page_num = %s"


def main():
    global counter_200

    start_page = 1
    end_page = 5
    for page_num in range(start_page, end_page + 1):
        # page_Check num and skip if it is included in the database
        cur.execute(sql_check, page_num)
        check_num = cur.fetchall()  #Get the information that came back
        if check_num:   #When it is not an empty string, that is, when it is included in the database
            continue

        #End processing
        if counter_404 == 5:
            break

        l_all_get_text = []  #List for collecting information for 20 articles

        #Various generation and access to url are also done here
        scraping_url = make_scraping_url(page_num)
        html_text = make_html_text(scraping_url)
        dom = make_dom(html_text)

        for article_num in range(1, 21):
            tuple_get_text = crawling(dom, page_num, article_num)
            l_all_get_text.append(tuple_get_text)

        #Pass information for 20 items to MySQL
        cur.executemany(sql_insert, l_all_get_text)  #execute many when passing a list containing tuples, execute when passing just a tuple
        connection.commit()  #Insert is not done without commit

        #Delay generation for each crawling
        time.sleep(3)   #Specified in seconds

        #Delay generation for each crawling multiple times
        if counter_200 == 10:
            counter_200 = 0
            time.sleep(60)  #Specified in seconds

    #Cursor and database termination
    cur.close()
    connection.close()


def debug_check_where_link_number(page_num, article_num):
    scraping_url = make_scraping_url(page_num)
    html_text = make_html_text(scraping_url)
    dom = make_dom(html_text)
    xpath = Xpath(article_num)

    thumbnail_link = check_thumbnail_link(dom, article_num)
    if thumbnail_link[0] is True and thumbnail_link[1] is True:
        xpath.with_thumbnail()
    elif thumbnail_link[0] is False:
        xpath.no_thumbnail()
    elif thumbnail_link[1] is False:
        xpath.deleted()

    get_link = make_link_from_xpath(dom, xpath.info_list[1])
    l_get_link = get_link.split("/")
    counter = 0
    for item in l_get_link:
        print(counter)
        counter += 1
        print(item)


if __name__ == "__main__":
    main()

Finally

If anything, it became an implementation to create a site map within the site. It seems that each individual page should be accessed using the link_num obtained by this script. It would be helpful if you could point out any mistakes or improvements.

Recommended Posts

Scraping using lxml and saving to MySQL
I tried to get Web information using "Requests" and "lxml"
Scraping, preprocessing and writing to postgreSQL
Connect to MySQL using Flask SQLAlchemy
[EC2] Introduction to scraping using selenium (text extraction and screen capture)
From Python to using MeCab (and CaboCha)
Dump SQLite3 data and migrate to MySQL
A story about trying to connect to MySQL using Heroku and giving up
Scraping tabelog with python and outputting to CSV
I tried web scraping using python and selenium
Three things I was addicted to when using Python and MySQL with Docker
Connect to mysql
Scraping using Python
Searching for pixiv tags and saving illustrations using Python
Scraping 2 How to scrape
Introduction to Web Scraping
Try to make it using GUI and PyQt in Python
How to add new data (lines and plots) using matplotlib
Web crawling, web scraping, character acquisition and image saving with python
Convert pixiv to mp4 and download from pixiv using python's pixivpy