[PYTHON] Continuous acquisition by Twitter API (Tips)

As you know, when you get tweets with Twitter API, the number of times you get them in a certain period of time and the number of tweets you get at one time are limited.

In order to deal with this limitation well, here are some tips on how to get it continuously. Please note that this post does not cover POST. (Although it is almost the same)

Please be sure to check the primary information for numerical values. The sample code is published on Gist. (It seems that there is a Twitter package, but I do not use it)

Twitter API Rate Limits (Limits on the number of acquisitions within a certain period of time)

The limits are as follows.

GET endpoints

The standard API rate limits described in this table refer to GET (read) endpoints. Note that endpoints not listed in the chart default to 15 requests per allotted user. All request windows are 15 minutes in length.  These rate limits apply to the standard API endpoints only, does not apply to premium APIs.

Excerpt from Rate limits — Twitter Developers

There is a limit to the number of acquisitions in 15-minute units. According to this page, for example, the restrictions for " search / tweets "are as follows (oh, Standard).

Endpoint Resource family Requests / window (user auth) Requests / window (app auth)
GET search/tweets search 180 450
Excerpt from [Rate limits — Twitter Developers](https://developer.twitter.com/en/docs/basics/rate-limits)

Get restriction status

You can get the current limit status as Endpoint at https://api.twitter.com/1.1/application/rate_limit_status.json. The parameter resources is optional and specifies Resource family.

Obtained information example

user auth (OAuth v1.1)


{
  "rate_limit_context": {
    "access_token": "*******************"
  },
  "resources": {
    "search": {
      "/search/tweets": {
        "limit": 180,
        "remaining": 180,
        "reset": 1591016735
      }
    }
  }
}

app auth (OAuth v2.0)


{
  "rate_limit_context": {
    "application": "dummykey"
  },
  "resources": {
    "search": {
      "/search/tweets": {
        "limit": 450,
        "remaining": 450,
        "reset": 1591016736
      }
    }
  }
}

Each reset number is epoch time, which indicates the time it will be reset.

In the above example, ʻuser auth (OAuth v1.1) with epoch time 1591016735=2020-06-01 22:05:35, ʻapp auth (OAuth v2.0) with 1591016736 = Indicates that it will be reset to 2020-06-01 22:05:36.

If the limit is violated, the number for remaining will be 0.

Organize items

The items of the acquired information are as follows. (Example of search family)

Category Family Endpoint Key Value
rate_limit_context access_token (user auth (v1.1)) Contents of Access Token
application (app auth (v2.0)) dummykey (Seems to be fixed)
resources
search
/search/tweets
limit Maximum number of times within the time limit
remaining Remaining number of times that can be accessed within the time limit
reset Time when the time limit is reset(epoch time)

Case where the wrong Resource Family is specified

The example below is when you mistakenly specify'user' for the Resource Family. (Actually, you should specify'users' (with s))

user auth


{
  "rate_limit_context": {
    "access_token": "*******************"
  }
}

app auth


{
  "rate_limit_context": {
    "application": "dummykey"
  }
}

Both return " rate_limit_context "but do not have" resources".

Response when Rate Limit error (acquisition result)

A 429 is returned in res.status_code (HTTP Status Code) on Rate Limit errors. (420 may be returned [^ 1].)

[^ 1]: Generally, 429 " Too Many Requests: Returned when a request cannot be served due to the app's rate limit having been exhausted for the resource. "is returned, but very rarely 420" ʻEnhance Your Calm: Returned when an app is being rate limited for making too many requests. `" May be returned. The latter may happen if you accidentally make multiple requests at the same time (unverified). → There was an explanation in "Connecting to a streaming endpoint — Twitter Developers", so I added it to the text. (2020/06/03).

Code Text Description
420 Enhance Your Calm Returned when an app is being rate limited for making too many requests.
429 Too Many Requests Returned when a request cannot be served due to the app's rate limit having been exhausted for the resource. See Rate Limiting.
Excerpt from [Response codes — Twitter Developers](https://developer.twitter.com/en/docs/basics/response-codes)

[Updated on 06/03/2020] There was a detailed explanation of 420 below.

420 Rate Limited

The client has connected too frequently. For example, an endpoint returns this status if:

  • A client makes too many login attempts in a short period of time.
  • Too many copies of an application attempt to authenticate with the same credentials.
Excerpt from [Connecting to a streaming endpoint — Twitter Developers](https://developer.twitter.com/en/docs/tweets/filter-realtime/guides/connecting)

88 is entered in the JSON errors.code.

{
  "errors": [
    {
      "code": 88,
      "message": "Rate limit exceeded"
    }
  ]
}
Code Text Description
88 Rate limit exceeded Corresponds with HTTP 429. The request limit for this resource has been reached for the current rate limit window.
Excerpt from [Response codes — Twitter Developers](https://developer.twitter.com/en/docs/basics/response-codes)

See each site for exceptions such as requests.

Process flow

Is the processing flow considering Rate Limit as follows?

while True:
    try:
        res =Request to API get/post
        res.raise_for_status()
    except requests.exceptions.HTTPError:
        #429 when the Rate Limit is reached/420 is returned
        if res.status_code in (420, 429):
Get Rate Limit information ← Here
Wait quietly until reset time
            continue
        420/Exception handling other than 429
    except OtherException:
Exception handling

Processing when it can be acquired successfully
break or return or yield etc.

The following is a concrete measure for the " Rate Limit information acquisition "part.

Sample of getting restriction information

GetTweetStatus class

As a sample of information acquisition, there is not much merit to implement it in a class, but considering actually incorporating it in a program, I think it would be better to write it in a form that is easy to modularize, so I made it a class called GetTweetStatus. There is. (There is also a feeling that I want to avoid accessing from the outside as much as possible, such as apikey and Bearer ...)

class GetTweetStatus


    def __init__(self, apikey, apisec, access_token="", access_secret=""):
        self._apikey = apikey
        self._apisec = apisec
        self._access_token = access_token
        self._access_secret = access_secret
        self._access_token_mask = re.compile(r'(?P<access_token>"access_token":)\s*".*"')

The last line, re.compile (), is for masking the display of the received ʻaccess_token`.

Acquisition part

user auth (OAuth v1.1)

GetTweetStatus.get_limit_status_v1(&nbsp;)


    def get_limit_status_v1(self, resource_family="search"):
        """OAuth v1.Get status using 1"""

        #Use OAuth1Session because OAuth is complicated
        oauth1 = OAuth1Session(self._apikey, self._apisec, self._access_token, self._access_secret)

        params = {
            'resources': resource_family  # help, users, search, statuses etc.
        }

        try:
            res = oauth1.get(STATUS_ENDPOINT, params=params, timeout=5.0)
            res.raise_for_status()
        except (TimeoutError, requests.ConnectionError):
            raise requests.ConnectionError("Cannot get Limit Status")
        except Exception:
            raise Exception("Cannot get Limit Status")
        return res.json()

app auth (OAuth v2.0)

GetTweetStatus.get_limit_status_v2(&nbsp;)


    def get_limit_status_v2(self, resource_family="search"):
        """OAuth v2.0 (Bearer)Get status using"""
        bearer = self._get_bearer() #Get Bearer

        headers = {
            'Authorization':'Bearer {}'.format(bearer),
            'User-Agent': USER_AGENT
        }
        params = {
            'resources': resource_family  # help, users, search, statuses etc.
        }

        try:
            res = requests.get(STATUS_ENDPOINT, headers=headers, params=params, timeout=5.0)
            res.raise_for_status()
        except (TimeoutError, requests.ConnectionError):
            raise requests.ConnectionError("Cannot get Limit Status")
        except Exception:
            raise Exception("Cannot get Limit Status")
        return res.json()
  1. Get the Bearer Token at the beginning and
  2. Set this in the header.
  3. Set Resource Family as a parameter and set
  4. Sending to STATUS_ENDPOINT.

Bearer generation

bearer = self._get_bearer () This is the _get_bearer () part called by #Get Bearer.

GetTweetStatus._get_bearer(&nbsp;),&nbsp;_get_credential(&nbsp;)


    def _get_bearer(self):
        """Get Bearer"""
        cred = self._get_credential()
        headers = {
            'Authorization': 'Basic ' + cred,
            'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
            'User-Agent': USER_AGENT
            }
        data = {
            'grant_type': 'client_credentials',
            }

        try:
            res = requests.post(TOKEN_ENDPOINT, data=data, headers=headers, timeout=5.0)
            res.raise_for_status()
        except (TimeoutError, requests.ConnectionError):
            raise Exception("Cannot get Bearer")
        except requests.exceptions.HTTPError:
            if res.status_code == 403:
                raise requests.exceptions.HTTPError("Auth Error")
            raise requests.exceptions.HTTPError("Other Exception")
        except Exception:
            raise Exception("Cannot get Bearer")
        rjson = res.json()
        return rjson['access_token']

    def _get_credential(self):
        """Credential generation"""
        pair = self._apikey + ':' + self._apisec
        bcred = b64encode(pair.encode('utf-8'))
        return bcred.decode()
  1. Combine APIKEY and APISEC for Base64 encoding and
  2. Set in the header,
  3. Set the payload data to grant_type =" client_credentials "
  4. Requesting (POST) to Endpoint.
  5. The Bearer Token is set in "ʻaccess_token`" of the returned JSON, so get it.

Display part

Implemented as a method. In actual use, is it implemented in something that returns "reset"?

GetTweetStatus.disp_limit_status(&nbsp;)


    def disp_limit_status(self, version=2, resource_family="search"):
        """Display Rate Limit by version"""
        if version == 2:
            resj = self.get_limit_status_v2(resource_family=resource_family)
        elif version == 1:
            resj = self.get_limit_status_v1(resource_family=resource_family)
        else:
            raise Exception("Version error: {version}")

        #JSON display
        print(self._access_token_mask.sub(r'\g<access_token> "*******************"',
                                          json.dumps(resj, indent=2, ensure_ascii=False)))
        #Disassembled display(remain/Example of getting reset)
        print("resources:")
        if 'resources' in resj:
            resources = resj['resources']
            for family in resources:
                print(f"  family: {family}")
                endpoints = resources[family]
                for endpoint in endpoints:
                    items = endpoints[endpoint]
                    print(f"    endpoint: {endpoint}")
                    limit = items['limit']
                    remaining = items['remaining']
                    reset = items['reset']
                    e2d = epoch2datetime(reset)
                    duration = get_delta(reset)
                    print(f"      limit: {limit}")
                    print(f"      remaining: {remaining}")
                    print(f"      reset: {reset}")         #← Actually a form that returns this
                    print(f"      reset(epoch2datetime): {e2d}")
                    print(f"      duration: {duration} sec")
        else:
            print("  Not Available")

Time-related utilities & headers

The time manipulation utility and the beginning of the file.

getTwitterStatus.py


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Twitter Rate Limit Information acquisition sample"""

import os
import sys
import json
from base64 import b64encode
import datetime
import time
import re
import argparse

#!pip install requests
import requests
#!pip install requests_oauthlib
from requests_oauthlib import OAuth1Session


USER_AGENT = "Get Twitter Staus Application/1.0"
TOKEN_ENDPOINT = 'https://api.twitter.com/oauth2/token'
STATUS_ENDPOINT = 'https://api.twitter.com/1.1/application/rate_limit_status.json'


def epoch2datetime(epoch):
    """Epoch time(UNIX time)Datetime(localtime)Convert to"""
    return datetime.datetime(*(time.localtime(epoch)[:6]))


def datetime2epoch(d_utc):
    """datetime (UTC)Epoch time(UNIX time)Convert to"""
    #Convert UTC to localtime
    date_localtime = \
        d_utc.replace(tzinfo=datetime.tzinfo.tz.tzutc()).astimezone(datetime.tzinfo.tz.tzlocal())
    return int(time.mktime(date_localtime.timetuple()))


def get_delta(target_epoch_time):
    """target_epoch_Returns the difference between time and the current time"""
    return target_epoch_time - int(round(time.time(), 0))

main () part

Since it's a big deal, I tried to make it possible to specify the OAuth version and Resource Family with command line arguments.

main(&nbsp;)


def main():
    """main()"""
    # API_KEY, API_Confirmation of environment variables such as SEC
    apikey = os.getenv('API_KEY', default="")
    apisec = os.getenv('API_SEC', default="")
    access_token = os.getenv('ACCESS_TOKEN', default="")
    access_secret = os.getenv('ACCESS_SECRET', default="")

    if apikey == "" or apisec == "":    #If the environment variable cannot be obtained
        print("Environment variable API_KEY and API_Please set SEC.", file=sys.stderr)
        print("OAuth v1.If you use 1, the environment variable ACCESS_TOKEN and ACCESS_Also set SECRET.",
              file=sys.stderr)
        sys.exit(255)

    #Argument setting
    parser = argparse.ArgumentParser()
    parser.add_argument('-a', '--oauthversion', type=int, default=0,
                        metavar='N', choices=(0, 1, 2),
                        help=u'OAuth version specification[1|2]')
    parser.add_argument('-f', '--family', type=str, default='search',
                        metavar='Family',
                        help=u'API family specification. Separated by commas for multiple')

    args = parser.parse_args()
    oauthversion = args.oauthversion
    family = args.family

    #GetTweetStatus Object Get
    gts = GetTweetStatus(apikey, apisec, access_token=access_token, access_secret=access_secret)

    # User Auth (OAuth v1.1)Rate Limit acquisition and display by
    if (oauthversion in (0, 1)) and (access_token != "" and access_secret != ""):
        print("<<user auth (OAuth v1)>>")
        gts.disp_limit_status(version=1, resource_family=family)

    # App Auth (OAuth v2.0)Rate Limit acquisition and display by
    if oauthversion in (0, 2):
        print("<<app auth (OAuth v2)>>")
        gts.disp_limit_status(version=2, resource_family=family)

if __name__ == "__main__":
    main()

All code getTwitterStatus.py

[^ 2]: I tried using Gist for the first time. I'm worried whether the usage is correct.

Execution result

$ python3 getTwitterStatus.py
<<user auth (OAuth v1)>>
{
  "rate_limit_context": {
    "access_token": "*******************"
  },
  "resources": {
    "search": {
      "/search/tweets": {
        "limit": 180,
        "remaining": 180,
        "reset": 1591016735
      }
    }
  }
}
resources:
  family: search
    endpoint: /search/tweets
      limit: 180
      remaining: 180
      reset: 1591016735
      reset(epoch2datetime): 2020-06-01 22:05:35
      duration: 899 sec
<<app auth (OAuth v2)>>
{
  "rate_limit_context": {
    "application": "dummykey"
  },
  "resources": {
    "search": {
      "/search/tweets": {
        "limit": 450,
        "remaining": 450,
        "reset": 1591016736
      }
    }
  }
}
resources:
  family: search
    endpoint: /search/tweets
      limit: 450
      remaining: 450
      reset: 1591016736
      reset(epoch2datetime): 2020-06-01 22:05:36
      duration: 900 sec
$ 

Limitation on the number of tweets acquired

There is a limit to the number of times you can get it at one time, regardless of the time limit.

200 ($ count \ leq200 ) for `statuses / user_timeline`, 100 ( count \ leq100 $) for search / tweets. There are various other restrictions, but in the case of search / tweets, the item next_results will be included in search_metadata so that it can be retrieved continuously.

For search / tweets

{
  "statuses": [
  ...
  ],
  "search_metadata": {
    "completed_in": 0.047,
    "max_id": 1125490788736032770,
    "max_id_str": "1125490788736032770",
    "next_results": "?max_id=1124690280777699327&q=from%3Atwitterdev&count=2&include_entities=1&result_type=mixed",
    "query": "from%3Atwitterdev",
    "refresh_url": "?since_id=1125490788736032770&q=from%3Atwitterdev&result_type=mixed&include_entities=1",
    "count": 2,
    "since_id": 0,
    "since_id_str": "0"
  }
}

There is next_results in search_metadata, so if you request this as a new parameter, you will also get the rest of the search results (in the units specified in count).

As long as you don't hit the time limit, you can refer to this and repeat to get the results continuously. That is, you can get $ count $ (up to 100) $ × limit $ (180 for user auth) $ = 18,000 Tweet $ in Rate Limit.

In the case of the above sample, $ count = 2 $, so if you continue as it is, you can get $ count (2) tweets / time x limit (180) times / 15 minutes = 360 tweets / 15 minutes $, and you will be limited. You will reach it (if you request it, of course).

When all the search results have been retrieved, next_results disappears from search_metadata.

In addition, sometimes, if you reacquire it, next_results may be restored, so you may want to wait for a while and try again.

If there is no metadata

In the case of statuses / user_timeline etc., * _metadata is not included, so make good use of the specification of max_id and generate something equivalent to next_results of search by yourself. is needed. (Actually, I haven't used it for anything other than search, so I'm not sure, but I think it's not that far off.)

In the case of search, the past 7 days are targeted, but since ʻuser_timeline` is the past 24 hours, I think that the purpose is different in the first place ...

Summary

Reference (based on what was referenced in this article)

Recommended Posts

Continuous acquisition by Twitter API (Tips)
Use twitter API (API account registration and tweet acquisition)
EXE Web API by Python
Use Twitter API with Python
Try using the Twitter API
Try using the Twitter API
Stock price data acquisition tips
Google Drive Api Tips (Python)
Successful update_with_media with twitter API