[PYTHON] 10 questions to check before web scraping

With Web Scraping (https://www.octoparse.jp/), you can get the information you need in seconds and get the great value behind it. But before we do that, there are 10 questions to check.

1. Is Web scraping illegal?

[robot.txt](https://ja.wikipedia.org/wiki/Robots_exclusion_standard If crawls are allowed according to the file rules, we recommend that you read the Target Website Terms of Service (ToS) in advance to assess the legal feasibility of your data project. Some sites have made a clear statement that scraping is prohibited without permission. In that case, you must get permission.

2. Decide which website you want to get data from

What is the purpose of collecting data? Lead generation? Price monitoring? Or is it a sales list? SEO analysis? Where can I find high quality information? How can I find the target data? Making information decisions when choosing a data source is very important because it can have a significant impact on results. You can get hints from introductory articles scraping various information from popular websites in the Help Center.

3. Confirm that the acquisition destination website provides API

The acquisition destination website is APIを提供している場合、提供されているAPIプラットフォームを使用して直接データを取得できます。わざわざ時間をかけてスクレイピングする必要はありません。APIプラットフォームへの接続方法については、次の例を参考してください。

4. Clarify time and financial budget

Scraping small amounts of data is quick and easy with free scraping tools or free Python scripts. However, if you want to scrape a large amount of data with different website structures from multiple pages, you need to automate your business. You can scrape by spending time learning programming or outsourcing. In fact, many dedicated data service providers offer data collection services. Octoparse is one of them. Even if you turn on your computer, you can extract a large amount of data in the cloud in just a meal.

5. What to do if a filter link is set for a website that requires login

Do not dive deep into the web page URLs before (or after entering / selecting parameters) and after setting the filter, as they may be different. Therefore, instead of the pattern entered from the URL of the homepage, the Web page of the direct acquisition destination (link after search / data acquisition after login)にアクセスします。

6. What to do if your website's bot detection system is triggered and it is very likely that your IP address will be banned.

If the web crawler visits frequently within a very short period of time (which is likely not human), the website will track and ban the local IP. The solution can slow down the scraping process as much as possible without triggering bot detection. However, if you want to get the latest data or get it at high speed, use the IP rotation function.

7. How to deal with CAPTCHA

In Octoparse, just like you would normally do when browsing a website, [CAPTCHA](https://helpcenter.octoparse.jp/hc/ja/articles/360015816473-Octoparse%E3%81%AFCAPTCHA-reCAPTHCA % E3% 82% 92% E5% 87% A6% E7% 90% 86% E3% 81% A7% E3% 81% 8D% E3% 81% BE% E3% 81% 99% E3% 81% 8B-? source = search & auth_token = eyJhbGciOiJIUzI1NiJ9.eyJhY2NvdW50X2lkIjo5MDc3MjYzLCJ1c2VyX2lkIjozOTUyNDkzNjQyNzQsInRpY2tldF9pZCI6OTQxLCJjaGFubmVsX2lkIjo2MywidHlwZSI6IlNFQVJDSCIsImV4cCI6MTU3ODAxNTI4NH0.WOZ-IR83jS4KbRxYvM21mEEFBYI338aV022wJyH5yhc) can be solved manually. However, it is better not to touch it from the beginning. Don't scrape your website too much, be human and scrape it.

8. Extracted data export format

You can export the data in the following formats: Excel, JASON, CSV, HTML, MySql, or API (API) https://helpcenter.octoparse.jp/hc/ja/articles/360017791934-API?source=search&auth_token=eyJhbGciOiJIUzI1NiJ9.eyJhY2NvdW50X2lkIjo5MDc3MjYzLCJ1c2VyX2lkIjozOTUyNDkzNjQyNzQsInRpY2tldF9pZCI6OTQxLCJjaGFubmVsX2lkIjo2MywidHlwZSI6IlNFQVJDSCIsImV4cCI6MTU3ODAxNTI4NH0.WOZ-IR83jS4KbRxYvM21mEEFBYI338aV022wJyH5yhc)を使用して独自のシステムにエクスポートします。

9. What to do if your website changes and your data is lost

If you need to keep up to date with the latest data, crawlers written in programming languages are no longer useful due to changes in the structure of your website. Rewriting a script is not an easy task, it can be very tedious and time consuming. Unlike the tedious task of rewriting code, you can keep your crawlers up-to-date by simply clicking the web page again in Octoparse's built-in browser.

10. Analysis of collected data

It's not the data collection that has the big impact on the business, but the analysis of the data. It is very important to be able to make decisions based on those data.

Recommended Posts

10 questions to check before web scraping
Introduction to Web Scraping
web scraping
web scraping (prototype)
I tried web scraping to analyze the lyrics.
[Python] Flow from web scraping to data analysis
Scraping 2 How to scrape
Python web scraping selenium
[Python] Introduction to scraping | Program to open web pages (selenium webdriver)
Create a tool to check scraping rules (robots.txt) in Python
Web scraping with python + JupyterLab
Web scraping notes in python3
Save images with web scraping
Scraping Go To Travel Accommodation
Tool to check code style
Web scraping technology and concerns
Trade-offs in web scraping & crawling
Web scraping using Selenium (Python)
Web scraping using AWS lambda
Web scraping beginner with python
Algorithm-based web scraping library Scrapely
One-liner web scraping by tse
Udemy Impressions: Web Scraping with Python-Introduction- [First Steps to Business Efficiency] Impressions