What is Web Crawler? A database that automatically collects information such as texts, images, and videos published on the Internet. It is a program to be stored in. Various web crawlers play a key role in the big data boom, making it easy for people to scrape data.

Among the various web crawlers, there are many open source web crawler frameworks. Open source web crawlers allow users to program based on source code or frameworks, provide resources for scraping assistance, and simplify data extraction. In this article, we will introduce 10 recommended open source web crawlers.

Scrapy

** Language: Python **

Scrapy is Python's most popular open source web crawler framework. It helps you efficiently extract data from your website, process it as needed, and save it in your preferred format (JSON, XML, CSV). Built on a twisted asynchronous network framework, it can accept requests and process them faster. You can create Scrapy projects to efficiently and flexibly create large-scale crawling scraping.

Feature: --Fast and powerful --There is Detailed documentation --You can add new features without touching the core --Community and abundant resources --Can be executed in a cloud environment

Heritrix

** Language: JAVA **

Heritrix (https://webarchive.jira.com/wiki/spaces/Heritrix/overview) is a highly extensible, Java-based open source web crawler designed for web archiving. We respect robot.txt exclusion directives and metarobot tags very much and collect data at a measured adaptive pace that does not disrupt normal website activity. It provides a web-based user interface accessible in a web browser for operator control and monitoring of crawling.

Feature: --Replaceable plug-compatible module --Web-based interface --Respect robot.txt and meta robot tags --Excellent extensibility

Web-Harvest

** Language: JAVA **

Web-Harvest is an open source web crawler written in Java. You can collect data from the specified page. To do this, we primarily leverage technologies and techniques such as XSLT, XQuery, and regular expressions to manipulate or filter the content of HTML / XML-based websites. It can be easily complemented by customizing the Java library to enhance the extraction capabilities.

Feature: --Powerful text and XML manipulation processor for data processing and control flow --Variable context for storing and using variables --Supports real scripting languages and can be easily integrated into web crawlers

MechanicalSoup

** Language: Python **

MechanicalSoup is a Python library for automating interactions with websites. MechanicalSoup is a Python giant Requests (for HTTP sessions) and BeautifulSoup (For document navigation) provides a similar API built with. You can automatically save and submit a cookie, follow the redirect, follow the link and submit the form. Mechanical Soup is very useful when you want to simulate human behavior rather than just scraping data.

Feature: --Ability to simulate human behavior --You can scrape a fairly simple website at high speed. --Supports CSS and XPath selectors

Apify SDK

** Language: JavaScript **

The Apify SDK (https://sdk.apify.com/) is one of the best web crawlers built with JavaScript. A scalable scraping library enables the development of data extraction and web automation jobs in headless Chrome and Puppeteer. With unique and powerful tools such as RequestQueue and AutoscaledPool, you can start with multiple URLs and recursively follow links to other pages, each performing a scraping task at the maximum capacity of your system.

Feature: --Large scale & high performance scraping --There is a pool of proxies to avoid detection --Supports Node.js plugins such as Cheerio and Puppeteer

Apache Nutch

** Language: JAVA **

Apache Nutch is an open source web crawler framework written in Java. With a sophisticated modular architecture, developers can create plugins for media type analysis, data retrieval, queries, and clustering. A pluggable modular, Nutch also offers an extensible interface for custom implementations.

Feature: --Highly expandable --Follow the txt rule --A vibrant community and active development --Pluggable analytics, protocols, storage, and indexing

Jaunt

** Language: JAVA **

Jaunt is based on JAVA and is designed for web scraping, web automation, and JSON queries. It provides a fast, ultra-lightweight headless browser that provides web scraping capabilities, access to the DOM, and control of each HTTP request / response, but JavaScript does not support it.

Feature: --Process individual HTTP requests / responses --Easy to connect with REST API --Supports HTTP, HTTPS, and basic authentication --Regex query support in DOM and JSON

Node-crawler

** Language: JavaScript **

Node-crawler is a powerful and popular production web crawler based on Node.js. It's fully written in Node.js and supports non-blocking I / O, which makes it very useful for crawler pipeline manipulation mechanisms. At the same time, it supports quick DOM selection (no need to write regular expressions) and improves the efficiency of crawler development.

Feature: --Rate control --URL request has priority --Configurable pool size and retry --Automatic jQuery insertion with server-side DOM and Cheerio (default) or JSDOM

PySpider

** Language: Python **

PySpider is a powerful web crawler framework written in Python. With an easy-to-use web UI and a distributed architecture with components such as scheduler, fetcher, and processor, you can now easily track multiple crawls. Supports various databases for data storage such as MongoDB and MySQL.

Feature: --User-friendly interface --RabbitMQ, Beanstalk, Redis, and Kombu message queues --Distributed architecture

StormCrawler

** Language: JAVA **

StormCrawler is an open source SDK for building distributed web crawlers using Apache Storm. This project is under the Apache license v2 and consists mostly of a collection of reusable resources and components written in Java. It is ideal for use when the URLs to be retrieved and parsed are provided as a stream, but it is also a good solution for large recursive crawls, especially when low latency is required. ..

Feature: --Highly extensible and can be used for large-scale recursive crawls --Additional libraries can be easily extended --Excellent thread management to reduce crawl latency

Summary

Open source web crawlers are very powerful and extensible, but limited to developers. Lots of scraping tools like Octoparse Yes, you can easily extract the data without writing any code. If you are not familiar with programming, these tools are more suitable and will make scraping easier.

Original article: https://www.octoparse.jp/blog/10-best-open-source-web-crawler

[PYTHON] 10 Open Source Web Crawlers for 2020

Summary