Download files in any format using Python

This post is for December 24th of Crawler / Scraping Advent Calendar 2014.

Introduction

When browsing websites, you may want to download files (zip, pdf) of any format at once.

You can download it manually, but in this case you can easily write the process by using a scripting language such as Python or Ruby.

This time I wrote a script to download using Python.

Library

Actually, only the standard library is fine, but this time I used the following library.

Library installation

pip install requests
pip install BeautifulSoup

Source code

The processing contents are as follows.

Extract links from URLs
Extract the URL and file name of the corresponding extension from the link
Download files little by little

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
import time

from BeautifulSoup import BeautifulSoup

BASE_URL = u"http://seanlahman.com/"
EXTENSION = u"csv.zip"

urls = [
    u"http://seanlahman.com/baseball-archive/statistics/",
]

for url in urls:

    download_urls = []
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    links = soup.findAll('a')

    #URL extraction
    for link in links:

        href = link.get('href')

        if href and EXTENSION in href:
            download_urls.append(href)

    #File download (limited to 3 for the time being)
    for download_url in download_urls[:3]:
		 
        #1 second sleep
        time.sleep(1)

        file_name = download_url.split("/")[-1]

        if BASE_URL in download_url:
            r = requests.get(download_url)
        else:
            r = requests.get(BASE_URL + download_url)
        
        #Save file
        if r.status_code == 200:
            f = open(file_name, 'w')
            f.write(r.content)
            f.close()

At the end

There are many improvements such as error handling and adjustment of the download URL, but For the time being, you can now download files in any file format (zip, pdf, etc.).

If you use Python etc., you can scrape very easily, so I think it's a good idea to increase the number of scripts you have while improving it according to the site.