Scraping with Python + PyQuery

PyQuery

Python has a handy module called PyQuery that provides a jQuery-like API. Beautiful Soup seems to be popular in the streets, but PyQuery is definitely easier to use. Since the base is lxml, I think that performance and reliability are guaranteed.

If you pass the url to the constructor, it will fetch it for you. You can also pass HTML strings or file objects. After that, if you specify a character string similar to the jQuery selector, you can get all the matching elements.

It is also possible to manipulate each element by passing a lambda expression or function. If you know jQuery, you can imagine what you can do. Please see the Manual for details!

DOM operation example

Give attributes to the selected element with the .each () method. class is a reserved word in Python, so if you set it to class_, it will be an HTML class.

sample.py


from pyquery import PyQuery as pq


html = '''
<ul>
  <li> item 1 </li>
  <li> item 2 </li>
  <li> item 3 </li>
</ul>
'''

dom = pq(html)
dom('li').each(lambda index, node: pq(node).attr(class_='red', x='123'))

print dom

When I executed it, class and mysterious attribute x were set.

<ul>
  <li x="123" class="red"> item 1 </li>
  <li x="123" class="red"> item 2 </li>
  <li x="123" class="red"> item 3 </li>
</ul>

For class you can do the same with dom ('li'). AddClass ('red').

Image URL acquisition sample

I made a sample program that accesses a web page and extracts the URL of an image. Select the img tag and access each element with .items ().

img_scraper.py


#!/usr/bin/env python
from urlparse import urljoin
from pyquery import PyQuery as pq
from pprint import pprint

url = 'http://www.yahoo.co.jp'

dom = pq(url)
result = set()
for img in dom('img').items():
    img_url = img.attr['src']
    if img_url.startswith('http'):
        result.add(img_url)
    else:
        result.add(urljoin(url, img_url))

pprint(result)

The result is as follows

set(['http://i.yimg.jp/images/sicons/box16.gif',
     'http://k.yimg.jp/images/clear.gif',
     'http://k.yimg.jp/images/common/tv.gif',
     'http://k.yimg.jp/images/icon/photo.gif',
     'http://k.yimg.jp/images/new2.gif',
     'http://k.yimg.jp/images/sicons/ybm161.gif',
     'http://k.yimg.jp/images/top/sp/cgrade/iconMail.gif',
     'http://k.yimg.jp/images/top/sp/cgrade/icon_point.gif',
     'http://k.yimg.jp/images/top/sp/cgrade/info_btn-140325.gif',
     'http://k.yimg.jp/images/top/sp/cgrade/logo7.gif',
     'http://lpt.c.yimg.jp/im_sigg6mIfJALB8FuA5LAzp6.HPA---x120-y120/amd/20150208-00010001-dtohoku-000-view.jpg'])

If you select the a tag instead of the img tag and search the list in combination with gevent, you can create a crawler in no time.

Google Finance Scraper

A script for scraping financial statements from Google Finance. Since it is long, I will post only the link to Gist.

https://gist.github.com/knoguchi/6952087

Recommended Posts

Scraping with Python + PyQuery
Scraping with Python
Scraping with Python
Try scraping with Python.
Scraping with Selenium [Python]
Scraping RSS with Python
I tried scraping with Python
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Scraping with Selenium in Python
Scraping with Tor in Python
[Scraping] Python scraping
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
I tried scraping with python
Web scraping beginner with python
Try scraping with Python + Beautiful Soup
Scraping with Node, Ruby and Python
Scraping with Selenium in Python (Basic)
Scraping with Python, Selenium and Chromedriver
Scraping Alexa's web rank with pyQuery
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python and Beautiful Soup
Let's do image scraping with Python
Get Qiita trends with Python scraping
"Scraping & machine learning with Python" Learning memo
Get weather information with Python & scraping
Python scraping notes
Scraping with selenium
FizzBuzz with Python3
Python Scraping get_ranker_categories
Scraping with selenium ~ 2 ~
Statistics with python
Python with Go
Twilio with Python
Integrate with Python
Python Scraping eBay
Play with 2016-Python
AES256 with python
Tested with Python
Scraping with Selenium
python starts with ()
Python Scraping get_title
with syntax (Python)
Python: Scraping Part 1
Bingo with python
Zundokokiyoshi with python
Yosemite + python + pyquery
Scraping using Python
Excel with Python
Microcomputer with Python
Python: Scraping Part 2
Cast with python
Get property information by scraping with python
WEB scraping with Python (for personal notes)
Automate simple tasks with Python Part1 Scraping
Getting Started with Python Web Scraping Practice