If you think about it carefully, it's quite natural, but I'm quite addicted to it, so I'll leave it as a memo.
I wrote a script that uses the Twitter API to retrieve tweets in Python and save them in the database every 15 minutes in an attempt to use them in deep learning programs, but sometimes the number of tweets retrieved jumped abnormally. Normally, it takes about 20 tweets / 15 minutes on average, but suddenly nearly 600 to 700 tweets are acquired only in some places. This happens almost every day, but it doesn't happen at a fixed time, and the number of times it happens in a day is indefinite.
It was a program that saves the ID of the latest tweet at the time of the previous acquisition, and at the next execution, it goes back from the latest tweet to that tweet and acquires it.
--------------------This time acquisition--------------------
Tweet 1 ID 899673612013064192<-Go back down from here
Tweet 2 ID 899673575619141633
Tweet 3 ID 899673508619276288
. . .
. . .
. . .
Tweet n ID 899669914251796480
--------------------This time acquisition--------------------
--------------------Last acquisition--------------------
Tweet 1' ID 899669914251796480 <-Finish when you reach here
Tweet 2' ID 899669747448414209
Tweet 3' ID 899669628170911750
. . .
. . .
. . .
Tweet n' ID 899668363969941506
As shown above, 100 tweets are fetched from the latest tweet downward, and if the tweet ID matches the last ID obtained last time, it ends. It was like this in Python:
fetch.py
for tweet in fetched:
if tweet["id_str"] == last_time_id: # last_time_id is a string
break
else:
tweets.append(tweet)
The cause was that the tweet with last_time_id
was deleted before the next acquisition. Or the retweet may have been canceled.
In other words, there are no more tweets that match the ID of last_time_id
, so new acquisitions will be repeated forever without matching the ID oftweet [" id_str"]
.
ʻId_str in the API response is a character string version of ʻid
, which is originally a numerical value (it seems that a character string is prepared because an error occurs depending on the language if it is a numerical value), but I am using this It seems that it was the cause.
The ID of the tweet is a number that increases over time, so if you do the following, the bug will disappear.
fetch_fixed.py
for tweet in fetched:
if tweet["id"] <= int(last_time_id):
break
else:
tweets.append(tweet)
I just changed both sides to numbers and changed it to <=
.
With this, even if the tweet with the ID of last_time_id
is deleted, the tweet immediately before it has a smaller ID, so you can break it at that point.
Recommended Posts