[PYTHON] You will be an engineer in 100 days ――Day 70 ――Programming ――About scraping

Click here until yesterday

This time is about scraping.

What is scraping?

What is scraping?

Scraping is a technology to acquire data from websites.

Scraping itself can be done in various languages.

Knowledge required for scraping

Roughly speaking, it is useful to have this knowledge.

** Communication mechanism ** Communication is required to obtain information from the WEB. It is necessary to suppress the HTTP communication mechanism that is the basis of the Internet.

HTML、Javascript、CSS The website consists of HTML, Javascript and CSS. It is necessary to suppress the mechanism of the components in the site.

** Full-text search and regular expression match ** When we get information from our website, we extract only the information we need.

In doing so, you need to determine if you have the information you need and if it matches the information you need. Knowledge of regular expressions is required to judge the condition.

** Programming language ** How to access websites and perform parsing efficiently Knowledge of programming in general and knowledge of the characteristics of programming languages are required.

Library There are usually tools (libraries) for scraping in any programming language. Creating a program from scratch is inefficient and requires learning how to use the library.

** Data mining algorithm ** Knowledge of data analysis is required to acquire information and efficiently output only the necessary parts.

** DOM analysis ** The DOM (Document Object model) is a standard specification for manipulating XML documents. Allows programming languages to manipulate elements and text in XML documents. The DOM is a method of reading an entire XML document and parsing every element in the document as a tree-structured node.

Scraping requires knowledge of the DOM.

** HTML parser (parse) ** Extracting only the text part of HTML or extracting the content of a specific tag

Internet security

Due to the convenience of accessing the website and acquiring information, security issues are inevitable.

If you use it incorrectly, you may give the site a presence or be arrested. You have to be careful.

Scraping is a useful technique, but you should be aware of the following:

** Violation of Terms of Service ** When the terms of use on the website of another person state that "scraping is prohibited" Scraping may violate the Terms of Service and may result in claims for damages.

However, the following measures are required for the terms of use to take effect with the user. Show the terms of use to the user and have them click consent to start the transaction.

If you want to scrape content that anyone can see without having to register as a member It is possible that the above terms of use will not be violated, but please note that the law changes daily.

Also, for scraping sites to restrict access to the crawler website If you crawl while measures (such as robot.txt) have been taken, you may be a civil law tort.

Copyright Since the amount of content acquired by scraping is enormous, it is not realistic to obtain consent for each content.

Therefore, as an exception, it seems that copying for information analysis is permitted without the consent of the copyright holder (Article 47-7 of the Copyright Law).

The act of transferring the collected content to another person (including online distribution) by scraping is considered a violation of copyright law.

If the content has originality, it is protected as a "literary work" under copyright law.

Copying such content or storing it on your company's server is a copyright infringement without the consent of the copyright holder.

** Counterfeit business disruption ** You will access the website at regular intervals, but if the intervals become shorter, The load on the server of the site becomes heavy, which may interfere with normal site operation.

In such a case, it is assumed that it interfered with the business of the site operator. There is a possibility that a crime of obstructing counterfeit business will be established (Article 233 of the Penal Code).

Okazaki Municipal Central Library Case

Around March 2010, from citizens to the collection search system on the Okazaki Municipal Library website
It seems that there was a complaint that I could not connect
After that, it became difficult to browse the website one after another.

On April 15, the same year, the library was receiving annoying access.
A man who submitted a damage report to the Aichi Prefectural Police Okazaki Station and was accessing it on May 25
Assuming that you intentionally sent a high-frequency request to the collection search system
He was arrested on suspicion of obstructing false accounting.

There is no illegality in the crawler created by men There was a problem with the library's collection search system.

However, the website of Okazaki Municipal Central Library is an expert as a local government site. Because it was unimaginably vulnerable This is a combination of the negligence of the municipalities and the ignorance of the person in charge.

Originally, the local government, which was poorly operated, is bad. This may not be the case by law.

Municipal and national infrastructure is very childish and often not properly operated. It may not be preferable as a scraping target. Be careful when scraping.

Other notes

. ʻIt is a violation of the Terms of Service to web scrape and crawling Amazon product pages. Is there a legal problem? ``

Acts that put a load on the server of the other party may correspond to business interruption such as counterfeit business interruption or computer damage.

It is necessary to take precautions such as performing the next processing after receiving a response.

Also, since the page is duplicated, there may be a problem of copyright infringement if it exceeds the scope of private reproduction. You need to keep it within the scope of your own browsing and data analysis purposes.

. ʻCreate a tool for web scraping and crawling Amazon product pages Is it a violation of the Terms of Service to distribute and sell? Is there any legal problem? It depends on how you write the terms of use, but if only the use of the tool is prohibited I think that the act of using the tool after receiving the distribution violates the rules. ``

Depending on how it is used, it may be an aide to business interruption or copyright infringement.

Summary

First of all, let's suppress the precautions before scraping. If you run the code suddenly, it may be difficult.

30 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube： https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter： https://twitter.com/otupython