This is an article "What is scraping?" Written for beginners (or for myself in the past). This is an overview for those who are going to try scraping, so I hope this will be useful as your first step.
"Web scraping (English: Web scraping) is a computer software technology that extracts information from websites (Wikipedia source)"
In other words, the technology that retrieves the information you want from a web page is called "scraping."
There is also "crawling" that is easily confused. This is "The program follows links on the Internet to visit websites and duplicates and saves information on web pages (weblio dictionary. 82% AF% E3% 83% AD% E3% 83% BC% E3% 83% AA% E3% 83% B3% E3% 82% B0) Source) "
What's the difference ...? Together ...? You might think, but that feeling is almost correct. Both technologies are for information gathering. However, the part that emphasizes is a little different. Scraping emphasizes "extracting only necessary information from website information (= extraction)", and crawling emphasizes "visiting multiple websites and collecting information (= collection)". There seems to be. So, if you want to get only the information you need while traversing multiple web pages, you have to "crawl and scrape". It seems that people think of it a little differently, but it's okay to interpret it as "technology that complements each other (= collection and extraction)".
Since crawling automatically acquires website information, it may violate copyright laws and site policies in some cases. Be very careful when investigating anything. Conversely, suppose you don't want your site to be crawled. There are several ways to do this, but it's important to write clearly in your site policy first. However, it may not be noticed by the person who is crawling automatically (so-called bot etc.), so let's create ** robots.txt **. If you write settings such as whether to allow crawling in this file, you can avoid crawling unless you are a malicious person. As a reference site, I would like to introduce "Our Howto Note".
Well, I explained the difference between scraping and crawling earlier, but a good person may have thought this.
"Do I have to do crawling and scraping separately?"
There are many frameworks for crawling and scraping, but in fact, there are frameworks for scraping while crawling. That is ** Scrapy **.
Introducing the reference site "note.nkmk.me" regarding how to use Scrapy. This site has Scrapy Tutorial commentary and easy-to-understand examples, so if you want to try it! If you think, please refer to it. (I also used it as a reference.)
This is the first time I have posted to Qiita, so this time I made it simple as an article that also serves as a practice of writing. Additions / corrections will be made when pointed out or when my knowledge is updated.
Recommended Posts