"Scraping & machine learning with Python" Learning memo

Introduction

Learning notes from Chapters 1 to 3 of "Scraping & Machine Learning with Python". Topics related to scraping are from Chapter 1 to Chapter 3. The fourth and subsequent chapters are the machine learning part.

Chapter 1

1-1. Data download

What is the urllib library?

A package of modules that handle URLs. Below is an example of the method.

--urlretrieve () ・ ・ ・ Download data directly (file is saved locally) --urlopen () ・ ・ ・ Get in memory. If you want to get it by ftp, just change the parameter https: // passed to urlopen () to ftp: //

If you want to send a request with get parameter, create key / value parameter data with dictionary type variable.

Use the urllib.parse module to url-encode variables. Add the encoded variable to the url string (don't forget the "?" In between).

Import the sys module to get command line arguments.

1-2. Scraping with Beautiful Soup

What is Beautiful Soup?

A library that parses HTML and XML. Data cannot be downloaded. If you want to download it, use urllib.

What is pip

Python package management system.

What is PyPI

Abbreviation for Python Package Index.

There are various ways to get HTML elements

--Trace the hierarchy from the tag using dots (.) --Find by id using find () method --Use the find_all () method to get all the elements specified by the parameters --Use css selector

If you know the HTML structure and the basics of css, you can basically get any data. However, if the page structure changes, it needs to be corrected.

1-3. About CSS selector

Example: Aozora Bunko's Natsume Soseki page https://www.aozora.gr.jp/index_pages/person148.html

The result obtained by css selector for the li tag at the top of the work list is as follows.

body > ol:nth-child(8) > li:nth-child(1)

nth-child (n) ・ ・ ・ Meaning of the nth element

Looking at the work page, the \

    tag is not used elsewhere, so in that case the ol: nth-child (8) part can be omitted.

    If you write an elegant CSS selector, you can retrieve a specific element in one shot.

    ** It is important to remember the format of the selector. Same as remembering regular expressions ** </ font>

    The find () method is characterized by being able to specify multiple conditions at once.

    It is also possible to extract elements in combination with regular expressions.

    1-4. Download the entire link

    If the link destination of the \ tag is a relative path, convert it to an absolute path using the urllib.parse.urljoin () method.

    To download the whole thing, you need to download the link recursively.

    To use a regular expression, import the re module.

    Chapter 2

    2-1. Download from sites that require login

    A package called requests is convenient for access using cookies.

    Start a session with the requests.session () method.

    To check the data sent at login, use the developer tool of the browser.

    Check from the "Network" tab of the developer tools. To see the submitted form data, check "Form Data" on the "Header" tab.

    2-2. Scraping via browser

    "Selenium" is famous as a tool for remotely controlling a web browser.

    If you use it from the command line headless (no screen display), the browser will not start up one by one.

    In addition to Chrome, Firefox, Opera, etc., iOS and Android browsers can also be operated.

    When accessing with selenium, it is the same as accessing with a browser, so session management is not required.

    You can do quite a lot with selenium. Most things that people do with a browser can be done automatically.

    Furthermore, if you use the execute_script () method, you can execute any js.

    Benefits of selenium

    --You can freely manipulate DOM elements in HTML pages → It is possible to use it such as removing decorative elements unrelated to the element you want to acquire in advance. --You can call the Javascript function in the page at any time → You can get any data on the page

    2-3. Scraping Dojo

    Wikipedia prohibits crawling, so direct scraping is NG. There is a site where you can get dump data instead, so use that data.

    
    for row in result:
        print("," .join(row))
    

    About BeautifulSoup methods

    --find_all () ・ ・ ・ If you give a list to the method, you can get multiple tags at once. --find_elements_by_css_selector () ・ ・ ・ "elements" and ** plural **, so get multiple elements at once --find_element_by_css_selector () ・ ・ ・ "element" and ** singular **, so get only one element at a time. If you call this method with the intention of acquiring multiple methods, it is not a grammatical error, but it does not work as expected, so be careful.

    You can also take a screenshot using the browser.save_screenshot () method. This is useful when you want to know what the actual screen looks like when operating in headless mode.

    There are many parts of scraping that can only be understood by actually testing it. Think about what kind of operation is possible while analyzing the actual screen (HTML).

    ** It is important to understand the structure of the site. Also, knowledge of CSS is required. ** </ font>

    2-4. Data acquisition from Web API

    What is Web API (Web Application Programming Interface)?

    A function that a site has is published so that it can be used from the outside. Exchange via HTTP communication and acquire data in XML or JSON format.

    Be aware that the specifications of Web API may change due to the convenience of the operation side.

    format () method

    A part of the character string can be changed later as a variable value.

    (Example)

    
    str = "hogehoge{name}fugafuga"
    str.format(name="Eric Clapton")
    

    What is a lambda expression?

    ** Function name = lambda argument: Processing content ** A function that can be written in the format. (Example)

    
    k2c = lambda k: k - 273.5
    

    2-5. Cron and regular crawling

    On macOS and Linux, a daemon process called "cron" is used. In Windows, use "task scheduler".

    What is a daemon?

    A program that resides in main memory on a UNIX-like OS and provides specific functions. A type of background process that processes independently of user operations.

    ** Main periodic execution processing **

    1. Data collection
    2. Backup, log backup
    3. Life and death monitoring

    To set cron, execute the "crontab" command and edit the file opened there. If you want to edit cron on mac, nano editor is convenient.

    2-6. Scraping with Scrapy

    What is Scrapy?

    A framework for crawling and scraping.

    ** Basic work flow **

    1. Create a project using the scrapy command
    2. Write Spider class to create crawl and data acquisition process
    3. Run scrapy from the command line

    Create a subclass that inherits the Spider class. The location is the spiders directory.

    Main methods of Spider class

    --parse () ・ ・ ・ Describes the analysis process of the text to be performed after acquiring the data. --css () ・ ・ ・ Extract DOM elements using CSS selectors --extract () ・ ・ ・ Get multiple elements contained in it in list format --extract_first () ・ ・ ・ Method that returns the first element included in the result

    Scrapy execution command example

    scrapy crawl soseki --nolog
    

    If "--nolog" is described, the operation log is omitted. If not attached, the operation log will be output to the console.

    The return value of the method is yield. The rule is to return by yield instead of return.

    • Meaning of yield: give birth, bring, give birth, cause

    What is a scrapy shell?

    A shell that can run Scrapy interactively. This is useful when verifying whether data can be acquired correctly with CSS selectors.

    2-7. Download all Natsume Soseki's works with Scrapy

    A command to create a subclass of the Spider class.

    scrapy genspider soseki3 www.aozora.gr.jp
    

    The parse () method is called automatically after getting the URL specified in start_urls.

    Use the response.follow () method to get the linked page.

    To download the file, use the scrapy.Request () method. You can specify the method after processing is completed in the callback parameter.

    2-8. Download dynamic website with Scrapy and Selenium

    Scrapy can extend its functionality by introducing middleware. You can introduce that mechanism and incorporate Selenium.

    ** Format when specified ** (Project directory name). (Middleware file name). (Middleware class name)

    
    #Example of middleware registration
    custom_settings = {
        "DOWNLOADER_MIDDLEWARES": {
            "sakusibbs.selenium_middleware.SeleniumMiddleware": 0
        }
    }
    

    The start_requests () method is a method that defines the process that is automatically executed immediately before the request.

    Chapter 3

    3-1. Web data is in various formats

    The data distributed on the Web is roughly divided into two types: text data and binary data.

    --Text data Format example: text file, XML, JSON, YAML, CSV For text data, it is necessary to program with consideration for character code and encoding.

    --Binary data Format example: Image (png, jpeg, gif, etc.), Excel format The data size is smaller than the text data.

    Please note that the URL of the disaster prevention related data of Yokohama City in the book has been changed. As of October 22, 2020: https://www.city.yokohama.lg.jp/kurashi/bousai-kyukyu-bohan/bousai-saigai/bosai/data/data.files/0006_20180911.xml

    Note that all uppercase letters are converted to lowercase when parsing XML with BeautifulSoup.

    (Example)

    
    <LocationInformation>
    	<Type>Regional disaster prevention base</Type>
    	<Definition>It is a base equipped with a place for evacuation of the affected residents, information transmission and transmission, and stockpiling functions.</Definition>
    	<Name>Namamugi Elementary School</Name>
    	<Address>4-15 Namamugi, Tsurumi-ku, Yokohama-shi, Kanagawa-1</Address>
    	<Lat>35.49547584</Lat>
    	<Lon>139.6710972</Lon>
    	<Kana>Namamugi Shogakko</Kana>
    	<Ward>Tsurumi Ward</Ward>
    	<WardCode>01</WardCode>
    </LocationInformation>
    

    If you want to get "Location Information" from the above data, it will be as follows.

    Wrong: soup.find_all ("LocationInformation"): Correct: soup.find_all ("location information"):

    When dealing with Excel files, install xlrd in addition to openpyxl.

    3-2. About the database

    Python supports various DBs. SQLite is built into the standard library and can be used immediately by importing sqlite3.

    in book P160 is incorrect.

    Installation is required in advance to use MySQL. Install with apt-get in Linux environment. For Mac and Windows, it is convenient to install MAMP, which also bundles the MySQL management tool "phpMyAdmin".

    Difference in how to specify variable items in SQL.

    --SQLite ...? --MySQL ・ ・ ・% s

    What is MongoDB

    One of the document type databases. A relational database management system (RDBMS) requires a schema definition using CREATE TABLE, but a document database does not require a schema definition.

    What is TinyDB

    A library for using document-oriented databases. It is easier to use from python than MongoDB. (MongoDB requires the installation of MongoDB itself, but TinyDB can be used by installing the package using pip)

    If you're dealing with data of a certain size, MongoDB is a better choice.

    Recommended Posts