This post is for December 24th of Crawler / Scraping Advent Calendar 2014.
When browsing websites, you may want to download files (zip, pdf) of any format at once.
You can download it manually, but in this case you can easily write the process by using a scripting language such as Python or Ruby.
This time I wrote a script to download using Python.
Actually, only the standard library is fine, but this time I used the following library.
pip install requests
pip install BeautifulSoup
The processing contents are as follows.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import time
from BeautifulSoup import BeautifulSoup
BASE_URL = u"http://seanlahman.com/"
EXTENSION = u"csv.zip"
urls = [
u"http://seanlahman.com/baseball-archive/statistics/",
]
for url in urls:
download_urls = []
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.findAll('a')
#URL extraction
for link in links:
href = link.get('href')
if href and EXTENSION in href:
download_urls.append(href)
#File download (limited to 3 for the time being)
for download_url in download_urls[:3]:
#1 second sleep
time.sleep(1)
file_name = download_url.split("/")[-1]
if BASE_URL in download_url:
r = requests.get(download_url)
else:
r = requests.get(BASE_URL + download_url)
#Save file
if r.status_code == 200:
f = open(file_name, 'w')
f.write(r.content)
f.close()
There are many improvements such as error handling and adjustment of the download URL, but For the time being, you can now download files in any file format (zip, pdf, etc.).
If you use Python etc., you can scrape very easily, so I think it's a good idea to increase the number of scripts you have while improving it according to the site.
About Sean Lahman Database
Recommended Posts