[PYTHON] I refactored "I tried to make a script that saves posted images at once by going back to the tweets of a specific user on Twitter".

See @ tyokuyoku's "I made a script that goes back to the tweets of a specific user on Twitter and saves the posted images at once" It's not functionalized and the nesting is quite deep, so I think it can be improved. I thought it was a refactoring. I commented on the refactoring results, but I will explain it because I have reviewed it further. I myself am refactoring through trial and error, so if you have any other good ideas, I would appreciate it if you could comment.

First, make the deepest part of the nest a function. Is the processing content "save the found image data to a file from one end"? Since it is a crawl process, the function name is crawl.

def crawl():
    for image, filename in images():
        path = os.path.join(SAVE_DIRECTORY, filename)
        with open(path, 'wb') as localfile:
            localfile.write(image)

I don't have an images function yet, but if there is a function that retrieves (lists) image data and file names one after another, it's just a matter of saving it to a file. Now, let's create an images function that does that. First, implement only the process of retrieving the image data.

def images():
    for url in image_url():
        image = urllib.urlopen(url)
        yield image.read()
        image.close()

There is no image_url function yet, but if there is a function that enumerates the url of the image data, it is just a job to read the image data at the url destination and yield it. If it is return, the result will be returned once and the processing will end, but by using a generator function that uses yield, the results can be notified one after another. Add a process to this function that also notifies the file name. It deviates from the SOLID principle of single responsibility, but ...

def images():
    for urls, timestamp in image_urls():
        date_time = timestamp.strftime("%Y%m%d_%H%M%S")
        for index, url in enumerate(urls):
            _, ext = os.path.splitext(url)
            filename = '%s_%s_%s%s' % (USER_NAME, date_time, index, ext)
            image = urllib.urlopen(url)
            yield image.read(), filename
            image.close()

The file name is created from the tweet time, but since the image does not include the tweet time, I will let the image_urls function tell me. Since multiple images can be pasted on one tweet, one tweet time will be linked to several images. Please tell me the tweet time with python standard datetime and convert it to year / month / day_hour / minute / second used for the file name. To The extension of the file name was fixed to " .jpg " in the original program, but there are also " .png " etc., so I will take out the extension included in the url and use it. Next is the image_urls function.

def image_urls():
    for media, timestamp in medias():
        urls = [content["media_url_https"]
                for content in media
                if content.get("type") == "photo"]
        yield urls, timestamp

The tweet contains media information such as images along with a short sentence (text), and if there is a medias function that enumerates the media information, it is possible to extract only the "photo" type image URL from the media information and notify it. I will. Since the media information does not include the tweet time, I will ask the media function to tell me the tweet time as well. Next is the medias function.

def medias():
    for tweet in tweets():
        created_at = tweet["created_at"]
        timestamp = dateutil.parser.parse(created_at)
        extended_entities = tweet.get("extended_entities", {})
        media = extended_entities.get("media", ())
        yield media, timestamp

If you have a tweets function that enumerates tweets in order, you can retrieve the tweet time and media and notify them. It is possible that there is no extended_entries information or media information, but if there is no information, an empty tuple is notified to let you know that there is no image. Finally, the tweets function.

USER_NAME = 'Username you want to get'
NUM_OF_TWEET_AT_ONCE = 200      # max: 200
NUM_OF_TIMES_TO_CRAWL = 16      # max: 3200 / NUM_OF_TWEET_AT_ONCE
SAVE_DIRECTORY = os.path.join('.', 'images')

def tweets():
    import twitkey
    twitter = OAuth1Session(twitkey.CONSUMER_KEY,
                            twitkey.CONSUMER_SECRET,
                            twitkey.ACCESS_TOKEN,
                            twitkey.ACCESS_TOKEN_SECRET)
    url = ("https://api.twitter.com/1.1/statuses/user_timeline.json"
           "?screen_name=%s&include_rts=false" % USER_NAME)
    params = {"count": NUM_OF_TWEET_AT_ONCE}
    for i in range(NUM_OF_TIMES_TO_CRAWL):
        req = twitter.get(url, params=params)
        if req.status_code != requests.codes.ok:
            return
        timeline = json.loads(req.text)
        for tweet in timeline:
            yield tweet
            params["max_id"] = tweet["id"]

You need to use the twitter API to get tweets, and you need to register as a user with twitter in advance to get the key and token to access the twitter API. Write that information as a variable in a file called twitkey.py. The program source and confidential information are separated. In the original script, it is a dictionary variable, but you can also just assign a variable.

twitkey.py


#coding: UTF-8

CONSUMER_KEY = ""
CONSUMER_SECRET = ""
ACCESS_TOKEN = ""
ACCESS_TOKEN_SECRET = ""

After all, the nesting was deepened in the for statement, but it was made shallower by making it a generator function using yield. By separating the functions, the function name also becomes a comment indicating the processing content, and I think it will be easier to understand the processing.

Finally, I will post the whole script including the main process, import, and progress display print.

#!/usr/bin/env python2
# -*- coding:utf-8 -*-

import sys
import os.path
import dateutil.parser
import urllib
import requests
import json
from requests_oauthlib import OAuth1Session


USER_NAME = 'Username you want to get'
NUM_OF_TWEET_AT_ONCE = 200      # max: 200
NUM_OF_TIMES_TO_CRAWL = 16      # max: 3200 / NUM_OF_TWEET_AT_ONCE
SAVE_DIRECTORY = os.path.join('.', 'images')


def tweets():
    import twitkey
    twitter = OAuth1Session(twitkey.CONSUMER_KEY,
                            twitkey.CONSUMER_SECRET,
                            twitkey.ACCESS_TOKEN,
                            twitkey.ACCESS_TOKEN_SECRET)
    url = ("https://api.twitter.com/1.1/statuses/user_timeline.json"
           "?screen_name=%s&include_rts=false" % USER_NAME)
    params = {"count": NUM_OF_TWEET_AT_ONCE}
    for i in range(NUM_OF_TIMES_TO_CRAWL):
        req = twitter.get(url, params=params)
        if req.status_code != requests.codes.ok:
            print "ERROR:", req.status_code
            return
        timeline = json.loads(req.text)
        for tweet in timeline:
            print "TWEET:", tweet["text"]
            yield tweet
            params["max_id"] = tweet["id"]



def medias():
    for tweet in tweets():
        created_at = tweet["created_at"]
        timestamp = dateutil.parser.parse(created_at)
        extended_entities = tweet.get("extended_entities", {})
        media = extended_entities.get("media", ())
        print "CREATE:", created_at
        yield media, timestamp


def image_urls():
    for media, timestamp in medias():
        urls = [content["media_url_https"]
                for content in media
                if content.get("type") == "photo"]
        print "IMAGE:", len(urls)
        yield urls, timestamp


def images():
    for urls, timestamp in image_urls():
        date_time = timestamp.strftime("%Y%m%d_%H%M%S")
        for index, url in enumerate(urls):
            _, ext = os.path.splitext(url)
            filename = '%s_%s_%s%s' % (USER_NAME, date_time, index, ext)
            image = urllib.urlopen(url)
            print "URL:", url
            yield image.read(), filename
            image.close()


def crawl():
    for image, filename in images():
        path = os.path.join(SAVE_DIRECTORY, filename)
        print "SAVE:", path
        with open(path, 'wb') as localfile:
            localfile.write(image)


if __name__ == '__main__':
    if len(sys.argv) > 1:
        USER_NAME = sys.argv[1]
    crawl()

Recommended Posts

I refactored "I tried to make a script that saves posted images at once by going back to the tweets of a specific user on Twitter".
I tried to make a script that traces the tweets of a specific user on Twitter and saves the posted image at once
I made a twitter app that identifies and saves the image of a specific character on the twitter timeline by pytorch transfer learning
How to make a Raspberry Pi that speaks the tweets of the specified user
I made a program to collect images in tweets that I liked on twitter with Python
I tried to make a bot that randomly acquires Wikipedia articles and tweets once a day
I tried to make a system that fetches only deleted tweets
I tried to make a site that makes it easy to see the update information of Azure
[Python] I tried to make a simple program that works on the command line using argparse.
[Twitter] I want to make the downloaded past tweets (of my account) into a beautiful CSV
[To Twitter gentlemen] I wrote a script to convert .jpg-large to .jpg at once.
I tried to verify the result of A / B test by chi-square test
Use python's pixivpy to download all the works of a specific user from pixiv at once (including moving)
[Python] I tried to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning
The story of IPv6 address that I want to keep at a minimum
I tried to make a translation BOT that works on Discord using googletrans
I tried to rescue the data of the laptop by booting it on Ubuntu
Get a lot of Twitter tweets at once
I tried to make the weather forecast on the official line by referring to the weather forecast bot of "Dialogue system made with python".
I tried to make it on / off by setting "Create a plug-in that highlights double-byte space with Sublime Text 2".
I tried to make the phone ring when it was posted at the IoT post
A script that transfers tweets containing specific Twitter keywords to Slack in real time
I tried to make it easy to change the setting of authenticated Proxy on Jupyter
I made a tool to get the answer links of OpenAI Gym all at once
I tried to create a Python script to get the value of a cell in Microsoft Excel
A Python script that goes from Google search to saving the Search results page at once
I tried to summarize the languages that beginners should learn from now on by purpose
I analyzed tweets about the new coronavirus posted on Twitter
[Python] I tried to visualize the follow relationship of Twitter
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I will publish a shell script created to reduce the trouble of creating LiveUSB on Linux
I tried ranking the user name and password of phpMyAdmin that was targeted by the server attack
I tried to build a SATA software RAID configuration that boots the OS on Ubuntu Server
I tried to make a serial communication single function module that controls the servo motor on the Petit Robo board in C language
I tried to make a regular expression of "amount" using Python
I tried to make a regular expression of "time" using Python
I tried to make a regular expression of "date" using Python
I analyzed the tweets about the new coronavirus posted on Twitter Part 2
I tried to register a station on the IoT platform "Rimotte"
I tried to make a mechanism of exclusive control with Go
I found out by making a python script to record radiko while reading the code of the predecessors
I tried to make a thumbnail image of the best avoidance flag-chan! With RGB values ​​[Histogram] [Visualization]
A super introduction to Django by Python beginners! Part 2 I tried using the convenient functions of the template
How to use the Slack API using Python to delete messages that have passed a certain period of time for a specific user on a specific channel
zoom I tried to quantify the degree of excitement of the story at the meeting
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to verify and analyze the acceleration of Python by Cython
I tried to display the altitude value of DTM in a graph
I tried to make a skill that Alexa will return as cold
The story of creating Botonyan that returns the contents of Google Docs in response to a specific keyword on Slack
I tried to confirm whether the unbiased estimator of standard deviation is really unbiased by "throwing a coin 10,000 times"
I tried to wake up the place name that appears in the lyrics of Masashi Sada on the heat map