[PYTHON] I can't use the "next_results" parameter in the Twitter API Search API! ?? Causes and remedies

I was addicted to collecting tweets using the Twitter API Search API. !! There is a pitfall in an unexpected place ...

About Twitter's Search API

This time, I used the Search API to get tweets. A brief summary of the Search API specifications.

There are three types of Search API. --Standard Search API (free) --Premium Search API (paid) --Enterprise Search API (paid)

This time, we will use the Standard Search API, which can be used for free.

Features of Standard Search API

--Free available --There is a limit to the number of requests --When authenticating with user auth (OAuth1) 180 requests / 15 minutes --When authenticating with app auth (OAuth2) 450 requests / 15 minutes -** You can get up to the last 7 days ** tweets --With a paid API, you can get more past tweets.

Request parameters

Parameters Description Remarks
q Search query (required) Search similar to tweet search on Twitter is possible, only character strings are possible
geocode Where you tweeted Specify by latitude, longitude, radius
lang Specifying the language of tweets
locale Query language specification Currently in JapanesejaOnly valid
result_type Specify the type of acquired tweet recentThen the latest tweet,popularThen popular tweets,mixedThen both
count Specify the number of acquisitions The default is 15 and the maximum is 100.
until Specifying the tweet time YYYY-MM-Get tweets before DD (cannot get before 7 days)
since_id ID value specification Get tweets with ID values larger than the specified ID value
max_id ID value specification Get tweets with ID values smaller than the specified ID value
include_entities entitiesWith or without falseIf you specify, you will get tweets without including entities information.

Response parameters

Parameters Description Remarks
statuses List of tweets The tweet object is stored in a list
search_metadata Search metadata Contains search metadata

Response example

python


{
  "statuses": [
(Omitted because it is a tweet object),
    ...
  ],
  "search_metadata": {
    "max_id": 250126199840518145,
    "since_id": 24012619984051000,
    "refresh_url": "?since_id=250126199840518145&q=%23freebandnames&result_type=mixed&include_entities=1",
    "next_results": "?max_id=249279667666817023&q=%23freebandnames&count=4&include_entities=1&result_type=mixed",
    "count": 4,
    "completed_in": 0.035,
    "since_id_str": "24012619984051000",
    "query": "%23freebandnames",
    "max_id_str": "250126199840518145"
  }
}

Occurrence of problem

Background

I was trying to collect a lot of tweets with the hashtag #Qiita.

However, the Standard Search API can only retrieve up to 100 tweets in a single request. Therefore, I tried to get 1000 tweets by recursively calling the API by using the request parameter next_results. A query is stored in next_results, and by executing this query, you can get the 101st and subsequent tweets. In other words

Request-> Response-> Parse next_results-> Go to next request parameter-> Request-> ...

I will do it until I get 1000 cases.

(Reference: Get more than 100 tweets with Twitter API search / tweets (PHP))

However, ** requests are only executed 3 times ** and only 200 tweets can be taken! (The number of tweets acquired is 0 in the third response) Even though there are clearly over 200 tweets ...

program

The code was written in Python. In addition, various API keys are registered in environment variables.

get_tweet.py


from requests_oauthlib import OAuth1Session
import os
import json

#API key installation
CONSUMER_KEY = XXXXXXXXXXXXXXXXXXXXXX #API key 
CONSUMER_SECRET = XXXXXXXXXXXXXXXXXXXXXX #API secret
ACCESS_TOKEN = XXXXXXXXXXXXXXXXXXXXXX
ACCESS_SECRET = XXXXXXXXXXXXXXXXXXXXXX

#URL for getting tweets
SEARCH_URL = 'https://api.twitter.com/1.1/search/tweets.json'


def search(params):
    twitter = OAuth1Session(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET)
    req = twitter.get(SEARCH_URL, params = params)
    tweets = json.loads(req.text)
    return tweets

#Parse in PHP_Instead of str function
def parseToParam(parse_str, parse=None):
    if parse is None:
        parse = '&'
    return_params = {}
    parsed_str = parse_str.split(parse)
    for param_string in parsed_str:
        param, value = param_string.split('=', 1)
        return_params[param] = value
    return return_params

def main():
    search_word = '#Qiita'
    tweet_data = []

    # Tweet Search
    params = {
                'q'  : search_word,
            'count'  : 100,
             }
    tweet_count = 0

    while tweet_count < 1000:
        tweets = search(params)
        for tweet in tweets['statuses']:
            tweet_data.append(tweet)
        # tweets['search_metadata']['next_results']Parse to param
        if 'next_results' in tweets['search_metadata'].keys():
            next_results = tweets['search_metadata']['next_results']
            next_results = next_results.lstrip('?') #At the beginning?Delete
            params = parseToParam(next_results)
            tweet_count += len(tweets['statuses'])
        else:
            break

if __name__=='__main__':
    main()

Investigation of the cause

Since the response parameter next_results is used for the next request parameter, --Request parameters --Response parameter next_results Check the two points.

Check request parameters and response parameters

First time Request parameters

{
  'q'    : '#Qiita',
  'count': 100
}

Response parameter next_results

?max_id=1250763045871079425&q=%23Qiita&count=100&include_entities=1

Second time Request parameters

{
 'max_id': '1250763045871079425', 
  'q'    : '%23Qiita',
  'count': 100,
  'include_entities': '1'
}

Response parameter next_results

?max_id=1250673475351572480&q=%2523Qiita&count=100&include_entities=1

Third time Request parameters

{
 'max_id': '1250673475351572480', 
  'q'    : '%2523Qiita',
  'count': 100,
  'include_entities': '1'
}

Response parameter next_results

None

Survey results

Apparently, the same query is inherited originally #Qiita%23Qiita%2523Qiita It seems that the query is changing.

#Qiita and% 23Qiita are compatible with each other by URL encoding, but % 2523Qiita is a completely different query. (You can check it by trying encoding / decoding at here.)

That is, It seems that the cause of the problem is that **% is decoded ** in the process of % 23Qiita% 2523Qiita.

solution

After parsing the response parameter next_results, replace **% 25 in the request parameter with% **.

Program modification

Added parameter replacement process in while statement

get_tweet.py


    while tweet_count < 1000:
        tweets = search(params)
        for tweet in tweets['statuses']:
            tweet_data.append(tweet)
        # tweets['search_metadata']['next_results']Parse to param
        if 'next_results' in tweets['search_metadata'].keys():
            next_results = tweets['search_metadata']['next_results']
            next_results = next_results.lstrip('?') #At the beginning?Delete
            params = parseToParam(next_results)
            # %Added 25 replacement processes
            params['q'] = params['q'].replace('%25', '%') 
            tweet_count += len(tweets['statuses'])
        else:
            break

Summary

In the query q contained in the response parameter next_results,% after URL encoding was extra encoded. As a result, the query inheritance did not work and there was a problem getting tweets.

The solution was to restore the extra-encoded% in next_results with string replacement.

reference

Twitter API official documentation https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets Twitter developer documentation Japanese translation http://westplain.sakuraweb.com/translate/twitter/Documentation/REST-APIs/Public-API/GET-search-tweets.cgi Get over 100 tweets with Twitter API search / tweets (PHP) https://blog.apar.jp/php/3007/ URL encoding / decoding https://tech-unlimited.com/urlencode.html

Recommended Posts

I can't use the "next_results" parameter in the Twitter API Search API! ?? Causes and remedies
I can't use the darknet command in Google Colaboratory!
Access the Twitter API in Python
I tried using Twitter api and Line api
Tweet using the Twitter API in Python
Sample code to get the Twitter API oauth_token and oauth_token_secret in Python 2.7
I can't get the element in Selenium!
Use twitter API (API account registration and tweet acquisition)
I can't enter characters in the text area! ?? !! ?? !! !! ??
It's too easy to access the Twitter API with rauth and I have her ...
I investigated the calculation time of "X in list" (linear search / binary search) and "X in set"
Crawling with Python and Twitter API 1-Simple search function
Using the National Diet Library Search API in Python
[Note] I can't call the installed module in jupyter
I want to use the R dataset in python
Before I knew it, I couldn't use hyphens in the client id of GAE's Channel API.
Try hitting the Twitter API quickly and easily with Python
I tried follow management with Twitter API and Python (easy)
Search for variables in pandas.DataFrame and get the corresponding row.
Streamline information gathering with the Twitter API and Slack bots
I tried to illustrate the time and time in C language
I tried programming the chi-square test in Python and Java.
I can't log in to the admin page with Django3
I implemented N-Queen in various languages and measured the speed