[PYTHON] Scraping Alexa's web rank with pyQuery

Thing you want to do

Each country TOP access rank is published on the access statistics site Alexa. (Up to 500th place with HTML x 20 pages for 25th place per page.)
Scrap the HTML page to create a URL list of access ranks.

** I tried using pyQuery. ** ** I also found a library called Scrapy, but it seems to be troublesome because it includes crawlers, so I avoid it. beautifulsoup looks good, but this time I will try pyQuery.

Installation

$ yum install libxml2-devel libxslt-devel
$ pip install pyquery

Since pyQuery uses libxml2, install it first. If you don't have pip, install it as well.

Referenced (try pyQuery sample)

I tried scraping the earthquake information site with the sample code of [here] [Ref1].

`pqsample.py`


import pyquery
query = pyquery.PyQuery("http://www.jma.go.jp/jp/quake/quake_local_index.html", parser='html')
for tr in query('.infotable')('tr'):
    print query(tr).text()

This code prints the contents of the <tr> tag under the class =" infotable " in a for loop. When I checked the configuration of html with the developer tool of chrome, it was as follows.

I got the following earthquake information obediently with python pqsample.py. Certainly easy.

Information announcement date and time Occurrence date and time Epicenter Place name Magnitude Maximum seismic intensity December 03, 2014 14:38 Around 14:32 on March 3, 2014 Northern Nagano Prefecture M1.6 Seismic intensity 1 December 03, 2014 06:03 Around 06:00 on the 3rd Northern Nagano Prefecture M2.0 Seismic intensity 1

Alexa ranking analysis

I found that it works, so I started scraping the favorite site. Open the desired page in chrome, press the magnifying glass mark from the developer tools (CTRL-Shift-I) window, and click the element you want to examine. The DOM tree is displayed as shown below. (If you are firefox, you can check it in the inspector.)

With this tree structure, you should list the <li> tags using the class =" site-listing " as the key. The rank is in count, and the domain is in the<a>tag under desc-paragraph. I wrote the code to output these to csv by turning for.

`alexa.py`


import pyquery

for page in range(20):
    query = pyquery.PyQuery("http://www.alexa.com/topsites/countries;" + str(page) + "/PE", parser='html')
    for li in query('.site-listing')('li'):
        print query(li)('.count').text() + ", " + query(li)('.desc-paragraph')('a').text()

This time I wanted a Peruvian rank, so I specified the country code / PE page. If you specify your favorite country code here, you can get the page of that country. The code loops 20 HTML pages. So run python alexa.py.

csv is done. Great success. After that, it is useful for creating a table with excel using this, or for connection test with curl.

Summary

-With the chrome + pyQuery combo, you can easily scrape the information obtained by cutting and pasting, which is comfortable. -Although Alexa API can be used from AWS, it seems that the TOP list cannot be obtained, so this is good. ・ I may write a volume of easy connection test with curl soon.

Reference site

[Scraping with Python (pyquery)] [Ref1] [Ref1]:http://d.hatena.ne.jp/kouichi501t/20130407/1365328955