Scraping with Node, Ruby and Python

I decided to try scraping, so I looked it up in Node, Ruby, and Python. Try it with the content of fetching the title of google.co.jp.

scraping.png

Try it with Node

Go to the page with request and use jQuery-like selectors with cheerio Try to parse and search for elements.

First install the module from the terminal.

$ npm install request cheerio

Create a file and implement it.

scrape.js


var request = require('request'),
    cheerio = require('cheerio');

var url = 'http://google.co.jp';

request(url, function (error, response, body)
{
    if (!error && response.statusCode === 200)
    {
        var $ = cheerio.load(body),
            title = $('title').text();
        console.log(title);
    }
});

Try to run it.

$ node scrape.js Google

cheerio implements not only element acquisition but also some methods of $ .addClass, $ .append, and jQuery, so it seems to be good for cases where you manipulate the DOM.

Try it with Ruby

When I went around, Nokogiri came out first. This is the de facto standard. Bring the page with open-uri and parse it with Nokogiri.

$ gem install nokogiri

Since open-uri is a standard attachment, install Nokogiri. Create a file appropriately.

scrape.rb


require 'open-uri'
require 'nokogiri'

url = 'http://www.google.co.jp/'
html = open(url)
doc = Nokogiri::HTML.parse(html)

puts doc.css('title').text

It seems that the objects returned by HTML.parse can be searched by XPath, CSS, or both. CSS selectors are easy and nice.

Try to run it.

$ ruby scrape.rb "Google"

Very good to do quickly.

Try it with Python

I first found Scrapy, but it's a slightly larger library, so it's a little casual BeautifulSoup. ) To try it. There seems to be a standard HTMLParser, but BeautifulSoup seems to do a lot of good things.

The installation didn't work with pip, so I installed it with easy_install.

$ easy_install BeautifulSoup

The flow of fetching pages with urllib and parsing with BeautifulSoup.

scrape.py


import urllib
import BeautifulSoup

url = 'http://www.google.co.jp/'
html = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(html)

print soup.find('title').string

Try to run it.

$ python scrape.py Google

Try it with Scrapy (Python)

Scrapy seems to be a library that includes crawlers and scraping. I tried it for a while, so make a note.

$ pip install scrapy

Take a quick look at the tutorial in Documentation. First, create a project template with scrapy.

$ scrapy startproject hello

Create a file directly under spiders and describe the crawler and scraping process.

hello/hello/spiders/scrape.py


from scrapy.spider import Spider
from scrapy.selector import Selector

class HelloSpider(Spider):
    name = "hello"
    allowed_domains = ["google.co.jp"]
    start_urls = ["http://www.google.co.jp/"]

    def parse(self, response):
        sel = Selector(response)
        title = sel.css('title::text').extract()
        print title

You can use XPath or CSS selectors to get the elements. So, try running this from the terminal.

$ scrapy crawl hello

Output result

[u'Google']

The crawler is also included in the set, so it looks good to make it solid.

Recommended Posts

Scraping with Node, Ruby and Python
Easy web scraping with Python and Ruby
Scraping with Python
Scraping with Python, Selenium and Chromedriver
Scraping with Python and Beautiful Soup
Version control of Node, Ruby and Python with anyenv
Encrypt with Ruby (Rails) and decrypt with Python
Practice web scraping with Python and Selenium
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Ruby, Python and map
Python and Ruby split
Scraping with Selenium [Python]
Scraping with Python + PyQuery
Scraping RSS with Python
Scraping tabelog with python and outputting to CSV
Programming with Python and Tkinter
I tried scraping with Python
Encryption and decryption with Python
Scraping with selenium in Python
Python on Ruby and angry Ruby on Python
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Python and ruby slice memo
Scraping with Selenium in Python
Zundokokiyoshi with python / ruby / Lua
Scraping with Tor in Python
Ruby and Python syntax ~ branch ~
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
python with pyenv and venv
I tried scraping with python
Web scraping beginner with python
Works with Python and R
Comparison of CoffeeScript with JavaScript, Python and Ruby grammar
Communicate with FX-5204PS with Python and PyUSB
Solving with Ruby and Python AtCoder ARC 059 C Least Squares
Difference between Ruby and Python split
Shining life with Python and OpenCV
Try scraping with Python + Beautiful Soup
Robot running with Arduino and python
Install Python 2.7.9 and Python 3.4.x with pip.
AM modulation and demodulation with python
[Python] font family and font with matplotlib
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
Dynamic proxy with python, ruby, PHP
Solving with Ruby and Python AtCoder ABC151 D Breadth-first search
Solve with Ruby and Python AtCoder ABC133 D Cumulative sum
Web scraping with Python First step
I tried web scraping with python.
Solving with Ruby and Python AtCoder AISING2020 D Iterative Squares
JSON encoding and decoding with python
Solving with Ruby, Perl, Java and Python AtCoder ATC 002 A
Hadoop introduction and MapReduce with Python
[GUI with Python] PyQt5-Drag and drop-
[Scraping] Python scraping
Web crawling, web scraping, character acquisition and image saving with python
Reading and writing NetCDF with Python
Let's do image scraping with Python