A Python program that collects tweets containing specific keywords daily and saves them in csv


I have an idea that SNS information such as Twitter can be used to monitor the risk of outbreak of a new coronavirus infection cluster, and I would like to collect tweets by keyword search such as "Today's drinking party". Since the free search API can only trace tweets one week ago, we will build a mechanism that can automatically collect data every day, considering the possibility of using it for future research.

If you have a good idea of a search word that can be used to evaluate the risk of developing a new coronavirus infection cluster, please comment! </ b>

Reference URL

Originator https://developer.twitter.com/en/docs/twitter-api Very easy-to-understand commentary site https://gaaaon.jp/blog/twitterapi It's sad that this article hasn't even reached the "tried self-satisfaction code" mentioned in the link above, so in some cases the article may be kept private.


Execute the following nomikai_tweets.py.

# coding: utf-8
# nomikai_tweets.py

import pandas as pd
import json
import schedule
from time import sleep
from requests_oauthlib import OAuth1Session
import datetime
from datetime import date, timedelta
import pytz

def convert_to_datetime(datetime_str):
Date and time format conversion
    tweet_datetime = datetime.datetime.strptime(datetime_str,'%a %b %d %H:%M:%S %z %Y')

def job():
Program that repeats in main

    #Search keyword Excluding retweets
    keyword = "Drinking party exclude today:retweets" 
    #Save directory Create first
    DIR = 'nomikai/' 

    #Information obtained by API communication and developer registration
    Consumer_key = 'bT*****************'
    Consumer_secret = 've*****************'
    Access_token = '25*****************'
    Access_secret = 'NT*****************'
    url = "https://api.twitter.com/1.1/search/tweets.json"
    twitter = OAuth1Session(Consumer_key, Consumer_secret, Access_token, Access_secret)

    #Parameters used for collection
    max_id = -1
    count = 100
    params = {'q' : keyword, 'count' : count, 'max_id' : max_id, 'lang' : 'ja', 'tweet_mode' : 'extended'}

    #Preparing to compare date processing utc and jst in Japan time
    today =datetime.datetime.now(pytz.timezone('Asia/Tokyo'))
    today_beggining_of_day = today.replace(hour=0, minute=0, second=0, microsecond=0)
    yesterday_beggining_of_day = today_beggining_of_day - timedelta(days=1)
    yesterday_str = datetime.datetime.strftime(yesterday_beggining_of_day, '%Y-%m-%d')

    #Corresponds to record in the DF while statement that stores tweet information
    columns = ['time', 'user.id', 'user.location', 'full_text', 'user.followers_count', 'user.friends_count', 'user.description', 'id']
    df = pd.DataFrame(index=[], columns=columns)

        if max_id != -1: #Go back to the tweet that already stored the tweet id
            params['max_id'] = max_id - 1
        req = twitter.get(url, params = params)

        if req.status_code == 200: #If you can get it normally
            search_timeline = json.loads(req.text)

            if search_timeline['statuses'] == []:  #When you finish taking all tweets

            for tweet in search_timeline['statuses']:
                #Tweet time utc
                tweet_datetime = convert_to_datetime(tweet['created_at'])
                #If it's not yesterday's tweet, skip
                in_jst_yesterday = today_beggining_of_day > tweet_datetime >= yesterday_beggining_of_day

                if not in_jst_yesterday:  #If it's not yesterday's tweet, skip

                #Store in DF
                record = pd.Series([tweet_datetime,
                df = df.append(record, ignore_index=True)

            max_id = search_timeline['statuses'][-1]['id']

        else: #Wait 15 minutes if you get stuck in access frequency restrictions
            print("Total", df.shape[0], "tweets were extracted", sep=" ")
            print('wainting for 15 min ...')

    df = df.set_index("time")
    df.index = df.index.tz_convert('Asia/Tokyo')
    df.to_pickle(DIR + yesterday_str + keyword +".pkl")
    df.to_csv(DIR + yesterday_str + keyword +".csv")
    print(today, "Total", df.shape[0], "tweets were extracted!\nnext start at 01:00 tommorow")

def main():
    print("start at 01:00 tommorow")
    #Run at 01:00 every day

    while True:

if __name__ == '__main__':


I want to analyze as soon as the data is collected. This is because it is not possible to evaluate even the periodic changes depending on the day of the week using only the past data for one week.


I learned that I have to be careful about handling Japan time and standard time in order to collect data every day. Note that datetime.datetime.now () depends on the environment in which the program runs, so running this source on a machine in another country will not work properly. The same applies to schedule.every (). day.at ("01: 00 "). Do (job).

Of the tweets that included the past "today" and "drinking party" that could be extracted, about 10% included "online." Also, many Twitterers don't like company drinking parties.

Recommended Posts

A Python program that collects tweets containing specific keywords daily and saves them in csv
A script that retrieves tweets with Python, saves them in an external file, and performs morphological analysis.
Save tweets containing specific keywords in CSV on Twitter
A script that transfers tweets containing specific Twitter keywords to Slack in real time
How to stop a program in python until a specific date and time
A program that removes duplicate statements in Python
A Python script that reads a SQL file, executes BigQuery and saves the csv
Continue to retrieve tweets containing specific keywords using the Streaming API in Python
I made a program in Python that reads CSV data of FX and creates a large amount of chart images
A note that runs an external program in Python and parses the resulting line
A Python program in "A book that gently teaches difficult programming"
A general-purpose program that formats Linux command strings in python
I tried "a program that removes duplicate statements in Python"
I made a program to collect images in tweets that I liked on twitter with Python
Collect tweets using tweepy in Python and save them in MongoDB
Create code that outputs "A and pretending B" in python
A program that determines whether a number entered in Python is a prime number
[Python] A program that creates stairs with #
Get tweets containing keywords using Python Tweepy
I made a payroll program in Python!
A program that plays rock-paper-scissors using Python
[Python] A program that rounds the score
[Beginner] What happens if I write a program that runs in php in Python?
Publishing and using a program that automatically collects facial images of specified people
I want to exe and distribute a program that resizes images Python3 + pyinstaller
[Python] A program that finds the minimum and maximum values without using methods
[Python] A program that calculates the number of updates of the highest and lowest records
Save Twitter's tweets with Geo in CSV and plot them on Google Map.
Until you get daily data for multiple years of Japanese stocks and save it in a single CSV (Python)
Organize python modules and packages in a mess
A memo that I wrote a quicksort in Python
A nice nimporter that connects nim and python
I wrote a class in Python3 and Java
Reading and writing CSV and JSON files in Python
A simple Pub / Sub program note in Python
Extract lines containing a specific "string" in Pandas
Let's write a Python program and run it
Create a package containing global commands in Python
I made a Caesar cryptographic program in Python.
Get a row containing a specific element in np.where
A Python script that crawls RSS in Azure Status and posts it to Hipchat
A program that asks for a few kilograms to reach BMI and standard weight [Python]
[Python] Rename all image files in a specific folder by shooting date and time
[Python] A program that finds the shortest number of steps in a game that crosses clouds
[Python] Change the text color and background color of a specific keyword in print output
[Python] Leave only the elements that start with a specific character string in the array
A program that summarizes the transaction history csv data of SBI SECURITIES stocks [Python3]
A solution to the problem that files containing [and] are not listed in glob.glob ()