at first

** The Internet is a treasure trove of information! Let's analyze data by making full use of web scraping + crawling! I think there are many people who thought ** </ font>. I am also one of them.

I decided to study web scraping + crawling as an easy way to obtain data, such as collecting data for use in machine learning and studying data science.

This article summarizes what I felt while studying web scraping and crawling.

What is web scraping?

Web scraping is a technology for extracting information from websites. More specifically, it is a technology that extracts ** information from HTML etc. **. Used to analyze information on websites.

With web scraping, if you can download web page data (HTML, etc.), you can also get information on web pages whose API is not open to the public. Also, if you have the technology to access pages that require login, you can scrape even pages that require login.

What is crawling

Crawling is a technology that crawls website links to obtain information on web pages. It is possible to patrol regularly and detect page updates. However, crawling puts a load on the server, so some services are prohibited or bots do not accept crawling.

Programs that perform crawling are called ** crawlers ** or ** spiders **.

One thing to keep in mind when crawling is the ** link path **. When extracting the path, there is no problem if it is an absolute path to the reference URL (URL of the page that starts crawling), but if it is a ** relative path, it is an infinite loop ** </ font> (Infinite loop if there are pages that are linked to each other). Normally, when crawling, processing such as converting the path of the extracted link to an absolute path is required.

Trade-offs in web scraping + crawling

It's surprisingly easy to study web scraping and crawling. In my case, I use Python3, so I am indebted to useful libraries such as ** Beautiful Soup ** and ** Scrapy **.

Once web scraping and crawling are possible, you can access any URL and collect information from the links on that page. However, since the collected data contains a lot of unnecessary garbage information, we must start by removing the garbage.

Therefore, use ** CSS selector ** to get only the necessary information. By using the CSS selector, you can collect only the specified information. For example, if you want to collect the linked address for crawling, collect only the value of the href attribute of the a tag.

But CSS selectors aren't all-purpose either. For example, when collecting destination URLs for crawling, it's better to exclude the link to the contact page, and also the ad link. Therefore, the CSS selector collects information only where it is needed, but the CSS selector is not unified for all Web pages.

Therefore, when collecting information from the Internet by web scraping + crawling, you can either make a program ** that can be used for general purposes but also need to process a large amount of garbage information **, or ** you can collect only the necessary information, but CSS You have to choose between creating a program ** that requires the selector to be examined every time.

If anyone knows how to use it universally and collect only the information you need, please leave a comment.

If you want to use web scraping + crawling for data analysis

I've written about the trade-off between web scraping and crawling, but the original purpose is to analyze the data. (Some people may say that it is for creating a search engine ...) If you use web scraping + crawling as data collection for data analysis, you can decide whether general collection or local collection is better. Isn't it necessary to become like that?

Neither method is absolute, as there is no 100% analysis result in data analysis. I think you should move your hands quickly and work on the analysis, rather than keep thinking about which method to use.

at the end

Here's a summary of what you've learned and noticed about web scraping and crawling. I hope it will be of some help to those who are doing web scraping, crawling, or want to try it.