[PYTHON] What is scraping? [Summary for beginners]


This is an article "What is scraping?" Written for beginners (or for myself in the past). This is an overview for those who are going to try scraping, so I hope this will be useful as your first step.

What is scraping?

"Web scraping (English: Web scraping) is a computer software technology that extracts information from websites (Wikipedia source)"

In other words, the technology that retrieves the information you want from a web page is called "scraping."

There is also "crawling" that is easily confused. This is "The program follows links on the Internet to visit websites and duplicates and saves information on web pages (weblio dictionary. 82% AF% E3% 83% AD% E3% 83% BC% E3% 83% AA% E3% 83% B3% E3% 82% B0) Source) "

What's the difference ...? Together ...? You might think, but that feeling is almost correct. Both technologies are for information gathering. However, the part that emphasizes is a little different. Scraping emphasizes "extracting only necessary information from website information (= extraction)", and crawling emphasizes "visiting multiple websites and collecting information (= collection)". There seems to be. So, if you want to get only the information you need while traversing multiple web pages, you have to "crawl and scrape". It seems that people think of it a little differently, but it's okay to interpret it as "technology that complements each other (= collection and extraction)".

important point

Since crawling automatically acquires website information, it may violate copyright laws and site policies in some cases. Be very careful when investigating anything. Conversely, suppose you don't want your site to be crawled. There are several ways to do this, but it's important to write clearly in your site policy first. However, it may not be noticed by the person who is crawling automatically (so-called bot etc.), so let's create ** robots.txt **. If you write settings such as whether to allow crawling in this file, you can avoid crawling unless you are a malicious person. As a reference site, I would like to introduce "Our Howto Note".


Well, I explained the difference between scraping and crawling earlier, but a good person may have thought this.

"Do I have to do crawling and scraping separately?"

There are many frameworks for crawling and scraping, but in fact, there are frameworks for scraping while crawling. That is ** Scrapy **.

Introducing the reference site "note.nkmk.me" regarding how to use Scrapy. This site has Scrapy Tutorial commentary and easy-to-understand examples, so if you want to try it! If you think, please refer to it. (I also used it as a reference.)

in conclusion

This is the first time I have posted to Qiita, so this time I made it simple as an article that also serves as a practice of writing. Additions / corrections will be made when pointed out or when my knowledge is updated.

Recommended Posts

What is scraping? [Summary for beginners]
What is xg boost (1) (for beginners)
What is Linux for?
What is the interface for ...
Reference resource summary (for beginners)
What is Python? What is it used for?
Python for statement ~ What is iterable ~
Beginners use Python for web scraping (1)
What is the python underscore (_) for?
Beginners use Python for web scraping (4) ―― 1
Pandas basics summary link for beginners
[Linux command summary] Command list [Must-see for beginners]
Django tutorial summary for beginners by beginners ③ (View)
Linux operation for beginners Basic command summary
[Statistics for programmers] What is an event?
Django tutorial summary for beginners by beginners ⑤ (test)
[For beginners] Try web scraping with Python
What is namespace
What is copy.copy ()
What is Django? .. ..
Roadmap for beginners
What is dotenv?
What is POSIX?
What is Linux
What is klass?
[Example of Python improvement] What is the recommended learning site for Python beginners?
What is SALOME?
What is Linux?
What is python
What is hyperopt?
What is Linux
What is pyvenv
What is __call__
What is Linux
What is Python
Django tutorial summary for beginners by beginners ⑦ (Customize Admin)
[For beginners] What to do after installing Anaconda
Django tutorial summary for beginners by beginners ① (project creation ~)
Django tutorial summary for beginners by beginners ④ (Generic View)
[For beginners] After all, what is written in Deep Learning made from scratch?
What is a distribution?
What is Piotroski's F-Score?
Summary about Python scraping
Beginners can use Python for web scraping (1) Improved version
Spacemacs settings (for beginners)
What is Raspberry Pi?
What is Calmar Ratio?
What is a terminal?
[PyTorch Tutorial ①] What is PyTorch?
What is hyperparameter tuning?
Summary of pre-processing practices for Python beginners (Pandas dataframe)
What is a hacker?
python textbook for beginners
What is JSON? .. [Note]
What is a pointer?
What is ensemble learning?
What is TCP / IP?
Dijkstra algorithm for beginners
What is a recommend engine? Summary of the types
Summary for learning RAPIDS
What is Python's __init__.py?