[PYTHON] Web scraping technology and concerns

1. WEB scraping

In the recent trend of big data, how much data should be collected is required.

As one of the technologies, web scraping may be adopted.

In this article, I would like to summarize the methods of web scraping and precautions regarding their use.

2. What is web scraping?

Web scraping (English: Web scraping) is a computer software technology that extracts information from websites. Also known as a web crawler [1] or web spider [2]. Such software programs typically acquire WWW content by implementing low-level HTTP or by embedding a web browser. (From Wikipedia)

3. Challenges in web scraping

3-1. IP blocking 3-2. Compliance with company ethics and compliance

3-2. Compliance with company ethics and compliance

3-2-1. Copyright Law

From the conclusion, the load on the server due to scraping, if there is no corresponding description in the terms of use of the relevant site, ** if the purpose is to analyze the information **, the copyright is exceptional. It seems that the general view is that there is no problem in recording or adapting other companies' information obtained by scraping on a recording medium without obtaining the consent of the person. (As of 02/23/2020)

Article 47-5 of the Copyright Act (Information processing by computer and minor use accompanying the provision of the result, etc.)

Article 47-5 Contribute to promoting the use of copyrighted works by creating new knowledge or information through information processing using computers. Persons who perform the acts listed in the following items (including those who perform a part of the act and limited to those who perform the act in accordance with the standards specified by Cabinet Order) are provided or presented to the public (enable transmission). (Hereinafter, the same shall apply in this Article) (hereinafter referred to as "publicly provided presentation work" in this Article and the following Article, Paragraph 2, Item 2) (Published work or transmission enablement) (Limited to copyrighted works), to the extent deemed necessary for the purpose of the acts listed in each item, regardless of which method is used in connection with the act (the copyrighted work presented to the public) Of these, the proportion of the part used for the use, the amount of the part used for the use, the accuracy of the display when the part is used, and other factors are limited to those that are minor. "Minor use") can be performed. However, if the provision or presentation of the work presented to the public to the public infringes the copyright (if the provision or presentation to the public made overseas is made domestically) If you make a minor use while knowing that it should be an infringement of copyright), it will unduly harm the interests of the copyright holder in light of the type and use of the copyrighted work presented to the public and the mode of the minor use. If this is the case, this does not apply.

(1) The title or author name of the work in which the information obtained by searching using a computer (hereinafter referred to as "search information" in this issue) is recorded, and the sender identification code related to the search information enabled for transmission. (A character, number, symbol or other code for identifying the source of automatic public transmission.) Searching for information regarding the identification or location of other search information, and providing the results.

(Ii) ** Perform information analysis by computer and provide the results. ** **

(Iii) In addition to the items listed in the preceding two items, it is an act of creating new knowledge or information by information processing by a computer and providing the result, and it is a Cabinet Order that contributes to the improvement of convenience of people's lives. What to define

2 ** Persons who prepare for the acts listed in each item of the preceding paragraph (limited to those who collect, organize, and provide information for the preparation of the acts in accordance with the standards specified by Cabinet Order) are copyrighted works presented to the public. Regarding, to the extent deemed necessary for preparation for minor use pursuant to the provisions of the same paragraph, duplication or public transmission (in the case of automatic public transmission, transmission enablement is included. Hereinafter, this paragraph and the following Article 2 The same shall apply in item 2), or a copy thereof may be distributed. ** However, this shall apply if it would unreasonably harm the interests of the copyright holder in light of the type and use of the copyrighted work presented to the public, the number of copies of the copy or distribution, and the mode of reproduction, public transmission or distribution. Not.

3-2-2. Case law

The suspect who was arrested explained the case because he ran a program to automatically acquire new book data from the Okazaki Municipal Central Library website and made some functions of the site unavailable (2010/5/25)

Ethics of scraping ① Overseas trouble cases

3-2-3. Confirmation of terms of use

There are services that prohibit scraping in order to protect personal information and prevent vandalism.

3-2-3-1. Example 1: Matching app Pairs

For example, the matching app Pairs explicitly prohibits scraping and crawling in its terms of use.

The Company does not permit the use of posted content to other users or other third parties except the user himself, and the user acts infringing the rights of the posted content of other users. Must not be. In addition, the user shall not automatically collect and analyze the posted content by crawling or the like. (Terms of Service | Pairs)

3-2-3-2. Example 2: Twitter

Similarly, Twitter prohibits scraping in its terms of service.

Access or search Twitter by any other means (automatically or otherwise) without going through (and complying with its terms of use) our currently available public interface provided by Twitter. Do or try to access or search. However, this does not apply if a separate contract with Twitter specifically allows this to be done. Twitter crawling is permitted as required by the robots.txt file. However, scraping without the prior consent of Twitter is expressly prohibited. (Rules | Twitter)

4. Web scraping method

--Human copy and paste --Full-text search and regular expression match --HTTP programming --Data mining algorithm --DOM analysis --HTML parser --Web scraping software --Vertical integration platform --Meaning annotation recognition

5. The method considered

5-1. Scraping using web scraping software (OctoParse)

Open the specified Web page with the built-in browser of OctoParse, select the data you want to extract, and a crawler will be created. No programming knowledge required, anyone can use it easily. When you run the crawler, you can output various data on the website in the desired format.

As for support, it supports Japanese and responds quickly.

merit

--Since it is a browser operation, it is easy to respond to sites that require infinite scrolling and login. --Difficult to get caught in bot judgment

Demerit

--Processing is heavy due to browser operation. Not suitable for high-speed processing.

5-2. Scraping using Python

merit

--Abundant frameworks and libraries --Processing is fast

Demerit

--Customization is required to support browser operations (infinite scrolling, processing by JavaScript)

6. Sites that you want to refer to practically

[Introduction to Python] Basics of scraping with Beautiful Soup 4 (1/2) Beautiful Soup in 10 minutes Practice / Python scraping style in the field

References

Summary of knowledge when web scraping with Python [Web scraping-Wikipedia](https://ja.wikipedia.org/wiki/%E3%82%A6%E3%82%A7%E3%83%96%E3%82%B9%E3%82%AF% E3% 83% AC% E3% 82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0) Is scraping illegal? Attorneys explain three legal issues and countermeasures in 5 minutes [Preserved version] Thorough explanation for beginners on how to scrape with Python![Sample code available]

Recommended Posts

Web scraping technology and concerns
web scraping
web scraping (prototype)
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
Try web scraping now and get lottery 6 data
I tried web scraping using python and selenium
Python web scraping selenium
Web scraping with python + JupyterLab
Web scraping notes in python3
Web scraping of comedy program information and notification on LINE
Trade-offs in web scraping & crawling
Easy web scraping with Scrapy
Web crawling, web scraping, character acquisition and image saving with python
Image collection by web scraping
Web scraping using Selenium (Python)
Web scraping using AWS lambda
Web scraping beginner with python
Algorithm-based web scraping library Scrapely
One-liner web scraping by tse
Scraping with Node, Ruby and Python
Web scraping with Python ① (Scraping prior knowledge)
Web scraping with BeautifulSoup4 (layered page)
Scraping with Python, Selenium and Chromedriver
Scraping, preprocessing and writing to postgreSQL
Scraping Alexa's web rank with pyQuery
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python and Beautiful Soup
Beginners use Python for web scraping (1)
Web scraping for weather warning notifications.
Fastest and strongest web server architecture
Beginners use Python for web scraping (4) ―― 1
10 questions to check before web scraping
Beginners use Python for web scraping (4) -3 GCE VM instance creation and scraping on VM