Web scraping with Python ① (Scraping prior knowledge)

1. Background

** This article is the second summary article that I investigated for the purpose of using python for investment utilization, following the previous SQLite. ** ** ** I will share what I researched about scraping twice in total. For the first time ①, I tried to summarize what I need to know before scraping. ** **

Previous article (first article for investment utilization): Easy handling of database in Python (SQLite3)

2. Basic terms for learning about web scraping

** ・ What is scraping? ** ** Extract the required information from the website.

** ・ What is crawling? ** ** Follow links to collect web pages.

** ・ What is a crawler? ** ** A crawler is a program that patrols the Internet and collects and saves data such as websites, images, and videos. For example, Web search engines such as Google and Bing use crawlers to collect Web pages from all over the world in advance, so they can search at high speed.

3. Can I scrape any web page?

** Web pages are basically copyrighted works ・ Some websites clearly prohibit crawling in the "Terms of Service" and "Help Page". · Do not crawl pages that are rejected by robots.txt or robots meta tags ・ Even if it is permitted, be careful not to put a burden on the Web server. **

** ・ What is robot.txt? ** ** robots.txt is a file that is set up to inform the crawler of access restrictions. You can check robot.txt by putting "/robots.txt" in the URL of the root page (top page of the Web). (Yahoo) root page: https://www.yahoo.co.jp/ (Yahoo) robots.txt: https://www.yahoo.co.jp/robots.txt

** ・ What are (robots) meta tags? ** ** It is used for a purpose similar to robots.txt, and is written in the header part of the HTML file, "something like a description of the site".

[Robots.txt Specifications](https://developers.google.com/search/reference/robots_txt?hl=ja) [What is robots [meta tag]](https://wa3.i-3-i.info/word11810.html)

4. Parse robots.txt

** ・ What is Perth? ** ** Parsing data written according to a certain format or grammar to see if the syntax matches the grammar

** ・ What is a parser? ** ** A program that analyzes structural character data and converts it into a collection of data structures that can be handled by the program.

There is a python library called urllib.robotparser, but I felt that reppy was better for usability, so I will use it here. ** * However, cases where Robots.txt is not described according to the rules may not be read well, so in that case it is necessary to check by adding /robots.txt directly to the URL or by another method. There is **

** How to use reppy's parser library ** (・ Install reppy with pip install reppy) -Which crawler should be checked by specifying the user-agent with the fetch` instruction? To specify. -Is it accessible by specifying the URL to be crawled with the allowed instruction? To confirm -Check the Crawl-delay (the crawl interval specified by the site). * If this is specified, it is necessary to follow it.

Below, the homepage of MSN is taken as an example.

python


from reppy.robots import Robots

#msn robots with fetch.Read txt
robots = Robots.fetch('https://www.msn.com/robots.txt')
#Wildcard to check('*')Specified by * In other words, all crawlers are targeted
agent = robots.agent('*')

#Robots using allowed for each specified URL.Is txt accessible? To confirm
print(agent.allowed('https://www.msn.com/ja-jp/news'))
print(agent.allowed('https://www.msn.com/ja-jp/health/search/filter'))

Execution result


True
False

It can be confirmed that the first site is OK and the second site is NG.

We will continue to introduce cases where there is a crawl interval specified by the site. Take the home page of Cloudworks as an example.

python


from reppy.robots import Robots

#Crowdworks robots with fetch.Read txt
robots = Robots.fetch('https://crowdworks.jp/robots.txt')

#Check target bingbot(Bing)Specify with and check with delay * bingbot ≒ Microsoft search crawler-
agent = robots.agent("bingbot")
print(agent.delay)

#Check wildcards for comparison
agent = robots.agent("*")
print(agent.delay)

Execution result


10.0
None

You can see that the Bing crawler is written with a 10 second crawl interval.

5. Check the robots meta tag

Also check access restrictions by HTML tags and HTTP headers. If there is a description such as noindex, nofollow, etc. in the meta tag or a tag of the site page, it is prohibited to index or follow the link of the site.

[Meta tag description and search engine](https://info-search.yahoo.co.jp/archives/002841.php)

Specify the URL where you want to check the meta tag using BeautifulSoup4

  • You need to install the library with pip install beautifulsoup4.

python


import requests 
from bs4 import BeautifulSoup

#request.get()Get the web information specified in
url = requests.get("https://www.yahoo.co.jp/")

#Create a BeautifulSoup object(Convert the HTML acquired by text into characters and html.Analyze with parser)
soup = BeautifulSoup(url.text, "html.parser")
# soup = BeautifulSoup(url.content, "html.parser")

#The first matching tag in find<meta>And pass the attribute value robots of the name attribute to attrs
robots = soup.find("meta", attrs={'name': 'robots'})

print(robots)

Execution result


<meta content="noodp" name="robots"/>

Apparently yahoo has a Meta tag called noodp (NO Open Directory Procject).

6. What about various investment sites? I tried to see.

For reference, I searched using the method that introduced the sites that are often talked about in investment scraping. ** The conclusions regarding scraping OK/NG for each HP are not mentioned/stated in this article. What I felt while investigating was that basically finance-related businesses seem to have a lot of NG. ** There are cases where there is a separate API, so it may be a good idea to use it. .. ..


** Example ①: Stock investment memo ** https://kabuoji3.com/ ・ Robots.txt: Allow:/* Allow everything under the root directory ・ Meta tag: Not listed ・ Homepage rules: Not stated

  • Robots.txt on this site could not be obtained successfully with reppy. I think it's probably because there is no space between : and/in Allow:/`. You can get it by other means, but I will omit it this time.

** Example ②: Stock search ** https://kabutan.jp/ -Robots.txt: Disallow: /94446337/ * 94449667 directory and below are not allowed ・ Meta tag: Not listed ・ Homepage rules: It is prohibited to apply an unreasonable load (see below)

Article 4 (** Prohibitions **)

    1. The operator prohibits the user from performing any of the following actions or actions that may cause such actions. : ** (7) Acts that impose an unreasonable burden on the server of this site or interfere with the operation of this service. ** **

Terms of Use for Stock Search


** Example ③: Shikiho Online ** https://shikiho.jp/ ・ Robots.txt: Not listed ・ Meta tag: Not listed ・ Homepage rules: Prohibition (see below)

Article 13 (User's other ** prohibited acts **)

  1. The user shall not perform the following acts in this service in addition to the acts prohibited by other provisions of this agreement. : ** (10) The act of automatically acquiring the information provided by this service using a computer, etc. **

Four Seasons Online Terms


** Example ④: Yahoo! Finance ** https://finance.yahoo.co.jp/ ・ Robots.txt: Not listed ・ Meta tag: Not listed ・ Homepage rules: It is prohibited to destroy or interfere with network functions (see below).

  • However, ** It is stated as prohibited when checking Yahoo! Finance Help **

● Terms 7. Compliance items when using the service When using our services, the following acts (including acts that induce them and preparatory acts) are ** prohibited **. : ** 4. Acts that destroy or interfere with the functionality of our servers or networks **

● Yahoo! Finance Help ** Automatic acquisition (scraping) of Yahoo! Finance posted information is prohibited. ** **

Yahoo! Finance Agreement Yahoo! Finance Help


7. What should I do to prevent it from becoming a load?

It seems that the rules have not been decided. It seems that it will not be a load on the service of the other party, but now it seems that the guideline of once per second is recognized if it is not specified by Crawl-delay. (Did it spread in the Okazaki Municipal Central Library case?)

[Wiki of 2010 Okazaki Municipal Central Library Incident](https://ja.wikipedia.org/wiki/%E5%B2%A1%E5%B4%8E%E5%B8%82%E7%AB%8B%E4%B8%AD%E5%A4%AE%E5%9B%B3%E6%9B%B8%E9%A4%A8%E4%BA%8B%E4%BB%B6) [Collection of Internet materials under the National Diet Library Act (Material)](https://warp.ndl.go.jp/bulk_info.pdf)

8. Finally

Since it also serves as a personal memorandum, it may have been a difficult article for beginners, but it is important, so I summarized it. In summary, there are many gray areas in this problem, and I thought that there was no clear rule when it was not clearly stated as NG. Anyway, it is NG if the other party's server operation is inconvenienced, so it is important to give up if you do not understand well.

It also seems necessary to consider that problems may occur depending on how the data collected by the crawler is used (as another problem).

** Next article on specific scraping. ** **


[Create a robots.txt file](https://developers.google.com/search/docs/advanced/robots/create-robots-txt?hl=ja&visit_id=637434421534486729-1976705014&rd=1) [Loading rules and Python conventions for web scraping](https://vaaaaaanquish.hatenablog.com/entry/2017/12/01/064227) [Notes on scraping and crawling](https://docs.pyq.jp/column/crawler.html) [List of precautions for web scraping](https://qiita.com/nezuq/items/c5e827e1827e7cb29011) [Knowledge when web scraping with Python](https://vaaaaaanquish.hatenablog.com/entry/2017/06/25/202924#%E6%B3%95%E5%BE%8B%E3%81%AE%E8%A9%B1)

Recommended Posts