[PYTHON] Algorithm-based web scraping library Scrapely

Introduction

This article is mainly a translation of Scrapely. I checked what it was like while moving the contents written in README. If you want to see the rough content in seconds, you should read both Scrapely and Summary in the section of this article.

What is Scrapely

A library for extracting structured data from HTML pages. Given a sample web page example and the data to be extracted, build a parser for all similar pages.

ʻData extraction using an algorithm called Instance Based Learning` ^ 1. ^ 2

Installation

Scrapely works in Python 2.7 or 3.3+. It requires numpy and w3lib Python packages.

pip install scrapely

Use from the command line

$ python -m scrapely.tool myscraper.json
scrapely> help

ocumented commands (type help <topic>):
========================================
a             annotate      ls              s       ta
add_template  del_template  ls_annotations  scrape  td
al            help          ls_templates    t       tl

scrapely> 

The usage of scrapely.tool is as follows

python -m scrapely.tool <scraper_file> [command arg ...]

<scraper_file> is the file name to save the template information

The commands provided, such as ʻaandta, are alias commands such as ʻannotate ʻadd_template`, respectively.

Command name Description
add_template add_template {url} [--encoding ENCODING] - (alias: ta)
annotate annotate {template_id} {data} [-n number] [-f field]- add or test annotation (aliases: a, t)
del_template del_template {template_id} - delete template (alias: td)
ls_annotations ls_annotations {template} - list annotations (alias: al)
ls_templates list templates (aliases: ls, tl)
scrape scrape {url} - scrape url (alias: s)

Create scraper and add template

scrapely> add_template http://pypi.python.org/pypi/w3lib/1.1
[0] http://pypi.python.org/pypi/w3lib/1.1

View a list of templates available from scraper

scrapely> ls_templates
[0] http://pypi.python.org/pypi/w3lib/1.1

Testing selection criteria to add annotations

scrapely> annotate 0 "w3lib 1.1"
[0] '<h1>w3lib 1.1</h1>'
[1] '<title>Python Package Index : w3lib 1.1</title>'

I got two elements with the above command

Specifying the position to acquire

scrapely> annotate 0 "w3lib 1.1" -n 0
[0] '<h1>w3lib 1.1</h1>'

Add annotation field name to template

scrapely> annotate 0 "w3lib 1.1" -n 0 -f name
[new](name) '<h1>w3lib 1.1</h1>'
scrapely> annotate 0 "Scrapy project" -n 0 -f author
[new] '<span>Scrapy project</span>'

Show annotation list in template

scrapely> ls_annotations 0
[0-0](name) '<h1>w3lib 1.1</h1>'
[0-1](author) '<span>Scrapy project</span>'

Scraping similar pages using the added template

scrapely> scrape http://pypi.python.org/pypi/Django/1.3
[{'author': ['Django Software Foundation'], 'name': ['Django 1.3']}]

Although Scrapely and Scrapy have similar names. ..

Scrapy is an application framework for building web crawlers, Scrapely is a library for extracting structured data from HTML pages. Scrapely is more like BeautifulSoup or lxml than Scrapy. ^ 3

Summary

In normal site scraping, you write a little selector specification, In Scrapely, it was possible to scrape similar pages by specifying the sample URL and specifying the sample data. There was a service (open source) that made it possible to scrape sites even by people without knowledge of programs using this characteristic. [^ 4] It was a summary (impression) that I see.

Finally

It was today's Friday I / O. At Wamuu Co., Ltd., every Friday is a day to work on something of interest and output the results in some way. Thank you very much.

Recommended Posts

Algorithm-based web scraping library Scrapely
web scraping
web scraping (prototype)
Introduction to Web Scraping
Python web scraping selenium
Web scraping with python + JupyterLab
Save images with web scraping
Web scraping technology and concerns
Trade-offs in web scraping & crawling
Easy web scraping with Scrapy
Image collection by web scraping
Web scraping using Selenium (Python)
Web scraping using AWS lambda
Web scraping beginner with python
One-liner web scraping by tse
Web scraping with Python ① (Scraping prior knowledge)
Web scraping with BeautifulSoup4 (layered page)
Scraping Alexa's web rank with pyQuery
Web scraping with Python First step
I tried web scraping with python.
GAN: DCGAN Part1 --Scraping Web images
Beginners use Python for web scraping (1)
Web scraping for weather warning notifications.
Beginners use Python for web scraping (4) ―― 1
10 questions to check before web scraping