[PYTHON] Collect cat images at the speed of a second and aim for the Cat Hills tribe

Overview

The script I made the other day The cat image is automatically picked up by feedparser Thanks to that, the days when cat image collection progresses ...

However, the execution speed is slow. Not fast ... This script isn't fast enough. So I'll try to modify it to see if it can be made faster.

First, analyze the current situation

For the time being, I measured how slow the current script is. Below, the source that incorporates the measurement logic.

get_cat.py


# -*- coding: utf-8 -*-
import feedparser
import urllib
import os
import  time

def download_picture(q, count=10):
    u"""Fetch count images of q."""
    count = str(count)
    feed = feedparser.parse("https://picasaweb.google.com/data/feed/base/all?q=" + q + "&max-results=" + count)
    pic_urls = []
    for entry in feed['entries']:
        url = entry.content[0].src
        if not os.path.exists(os.path.join(os.path.dirname(__file__), q)):
            os.mkdir(os.path.join(os.path.dirname(__file__), q))
        urllib.urlretrieve(url, os.path.join(os.path.dirname(__file__), q, os.path.basename(url)))
        print('download:' + url)

if __name__ == "__main__":
    time1=time.time()
    download_picture("cat", 10)
    time1_2=str(time.time()-time1)
    print("complete!("+time1_2+")")

result


complete!(6.05635690689)

It took 6 seconds to download 10 copies. I tried it several times, but after all it was about 6 seconds. Now you can only download 14400 copies in 24 hours. It's far from ideal.

httplib2 I learned about the existence of a library called httplib2 from wind rumors. Rather, this one seems to be more standard than the standard one. Features of httplib2 (↓).

Isn't it wonderful? Let's use it now.

$ sudo pip install httplib2

Install quickly. And fix the program.

get_cat2.py


# -*- coding: utf-8 -*-
import feedparser
import httplib2
import os
import time


def download_picture(q, count=10):
    u"""Fetch count images of q."""
    count = str(count)
    feed = feedparser.parse("https://picasaweb.google.com/data/feed/base/all?q=" + q + "&max-results=" + count)
    pic_urls = []
    http = httplib2.Http(".chache")
    for entry in feed['entries']:
        url = entry.content[0].src
        open(os.path.join(os.path.join(os.path.dirname(__file__), q),os.path.basename(url)),'wb').write(http.request(url)[1])
        print('download:' + url)

if __name__ == "__main__":
    time1=time.time()
    download_picture("cat", 10)
    time1_2=str(time.time()-time1)
    print("complete!("+time1_2+")")

A strange place.

Run it right away!

Execution result


First of all, the original program. Well like this.
complete!(5.79861092567)

revised edition. Hmm?
complete!(5.06348490715)
The second improved version. Oh...
complete!(1.20627403259)
The third improved version. Uooo!
complete!(0.768098115921)

It's fast! This is practical enough. It's not a lie to say the speed per second.

** Conclusion: Use httplib2 to improve cat image collection. ** **

Recommended Posts

Collect cat images at the speed of a second and aim for the Cat Hills tribe
Image crawling summary performed at the speed of a second
For the time being, the one who creates a homepage with Django at the speed of a second and publishes it on Heroku (Windows compatible)
I measured the speed of list comprehension, for and while with python2.7.
The second night of the loop with for
I want to add silence to the beginning of a wav file for 1 second
Compare the speed of Python append and map
A discussion of the strengths and weaknesses of Python
Create a dataset of images to use for learning
Take a peek at the processing of LightGBM Tuner
Create a batch of images and inflate with ImageDataGenerator
Tasks at the start of a new python project
The story of creating a "spirit and time chat room" exclusively for engineers in the company
Give the history command a date and time and collect the history files of all users with a script
How to insert a specific process at the start and end of spider with scrapy