[PYTHON] I wrote a script to get a popular site in Japan

Summary

--Scraped with Python + PyQuery --I was able to get the top 525 of popular Japanese sites (most accessed) and output them to CSV. --This allows you to investigate the HTML / design features of popular sites (maybe)

I thought that if I analyzed the HTML and design of popular sites, it would be an interesting result, so I wrote a script to get the URL of popular sites for the time being.

You can see the ranking of each country on Alexa, so I will get it from there.

Alexa the Web Information is a long-established service that has been publishing statistics such as the number of access and usage status of websites since 1996 so that anyone can view them. Since 1999, a child of Amazon Company.

refs. http://freesoft.tvbok.com/cat94/site10/alexa.html

The script I wrote for the time being is on GitHub. https://github.com/saxsir/fjats

By the way, you can get more by using Official API (paid).

Caution

――Since it is a site that is open to the public, scraping itself is not illegal (probably), but if you overdo it too much, it will cause trouble to the other site, so please do it at your own risk.

refs. -List of precautions for web scraping -Let's talk about the law of web scraping!

Operating environment

python 3.4.3
pip 7.0.1

Work procedure

Install PyQuery
Write a simple sample
Rewrite to get up to 525th in Japan ranking
Output to CSV
Wait 1-3 seconds for each request

Install PyQuery

First, put in a library for scraping. A library called PyQuery seems to be popular, so let's use it.

I use pyenv and virtualenv because I want to isolate the environment, but I don't have to use them separately. In that case, just do `pip install pyquery` normally.

$ cd /path/to/your/workspace $ pyenv virtualenv 3.4.3 fjats $ pyenv local fjats $ pip install --upgrade pip $ pip install pyquery

All you have to do is write a script. For the time being, I will paste the completed form.

`main.py`


import csv
from pyquery import PyQuery as pq
from datetime import datetime as dt
from time import sleep
from random import randint

ranks = []
for i in range(21):
  # http://www.alexa.com/topsites/countries;0/JP
  url = 'http://www.alexa.com/topsites/countries;%s/JP' % i
  doc = pq(url, parser='html')
  ul = [doc(li) for li in doc('.site-listing')]
  ranks += [(li('.count').text(), li('.desc-paragraph')('a').text()) for li in ul]
  print('Fetch %s' % url)    # Check script is running
  sleep(randint(1,3))

with open('topsites-jp_%s.csv' % dt.now().strftime('%y-%m-%d-%H-%M'), 'w') as f:
  writer = csv.writer(f, lineterminator='\n')
  writer.writerow(('Ranking', 'URL'))
  writer.writerows(ranks)

From here on down is the explanation of the code, so if you want to move it, just paste the above source and execute it.

First write a simple sample

from pyquery import PyQuery as pq

#Try to get the top 25 sites in the world for the time being
url = 'http://www.alexa.com/topsites'
doc = pq(url, parser='html')

#The class is site from the obtained DOM-Get the elements of listing
#(Check the class name of the part you want to get in advance with Chrome developer tools etc.)
ul = [doc(li) for li in doc('.site-listing')]

#For the time being, the ranking and site name,Try to output by separating with
ranks = ['%s, %s' % (li('.count').text(), li('.desc-paragraph')('a').text()) for li in ul]

print(ranks)

Let's move this with an interpreter. (It's OK if you start and copy and paste)

$ python                                                
Python 3.4.3 (default, Mar 27 2015, 14:54:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pyquery import PyQuery as pq
>>>
>>> #Try to get the top 25 sites in the world for the time being
... url = 'http://www.alexa.com/topsites'
>>> doc = pq(url, parser='html')
>>>
>>> #The class is site from the obtained DOM-Get the elements of listing
... #(Check the class name of the part you want to get in advance with Chrome developer tools etc.)
... ul = [doc(li) for li in doc('.site-listing')]
>>>
>>> #For the time being, the ranking and site name,Try to output by separating with
... ranks = ['%s, %s' % (li('.count').text(), li('.desc-paragraph')('a').text()) for li in ul]
>>>
>>> print(ranks)
['1, Google.com', '2, Facebook.com', '3, Youtube.com', '4, Yahoo.com', '5, Baidu.com', '6, Amazon.com', '7, Wikipedia.org', '8, Taobao.com', '9, Twitter.com', '10, Qq.com', '11, Google.co.in', '12, Live.com', '13, Sina.com.cn', '14, Weibo.com', '15, Linkedin.com', '16, Yahoo.co.jp', '17, Google.co.jp', '18, Ebay.com', '19, Tmall.com', '20, Yandex.ru', '21, Blogspot.com', '22, Vk.com', '23, Google.de', '24, Hao123.com', '25, T.co']

It seems that it has been acquired, so rewrite this to acquire the ranking of Japan.

Rewrite to get up to 525th in Japan ranking

Looking at it with a browser, http://www.alexa.com/topsites/countries;0/JP The 1st to 25th places are displayed with a URL like this. If you increase the 0 part, it seems that you can get up to 20, so use the for statement to loop 20 times and get the data.

from pyquery import PyQuery as pq

ranks = []
for i in range(21):
  # http://www.alexa.com/topsites/countries;0/JP
  url = 'http://www.alexa.com/topsites/countries;%s/JP' % i
  doc = pq(url, parser='html')
  ul = [doc(li) for li in doc('.site-listing')]
  ranks += [(li('.count').text(), li('.desc-paragraph')('a').text()) for li in ul]

this part

[(li('.count').text(), li('.desc-paragraph')('a').text()) for li in ul]

Is a little confusing,

#Tuple like this
('1', 'Site 1') # (li('.count').text(), li('.desc-paragraph')('a').text())

#, List comprehension([... for li in ul])Use to make a list like this and return it
[('1', 'Site 1'), ('2', 'Site 2') ...]

#Concatenation (ranks)+= ...) (Array in JavaScript or Ruby).concat）
[('1', 'Site 1'), ('2', 'Site 2') ... ('525', 'Site 525')]

The reason why I made it like this is because I want to output it to CSV later.

Output to CSV

import csv
from datetime import datetime as dt

with open('topsites-jp_%s.csv' % dt.now().strftime('%y-%m-%d-%H-%M'), 'w') as f:
  writer = csv.writer(f, lineterminator='\n')
  writer.writerow(('Ranking', 'URL'))
  writer.writerows(ranks)

The time is added to the CSV file name so that it is easy to understand when the data is.

Finally, leave a 1-3 second interval for each request

How to act as a crawler (but not ...).

from time import sleep
from random import randint

sleep(randint(1,3))

Randomly wait 1 to 3 seconds.

Referenced site

-Official repository -Scraping Alexa's web rank with pyQuery -Reading and writing CSV with Python -python current time acquisition -List of precautions for web scraping -Let's talk about the law of web scraping!