Collect tweets using tweepy in Python and save them in MongoDB

I usually use C # and Java for work, but I've always been interested in Python and have been interested in Python. Data analysis and machine learning are so popular that I decided to take this opportunity to study Python! !!

For Python, recently released [Introduction to Python3](http://www.amazon.co.jp/%E5%85%A5%E9%96%80-Python-3-Bill-Lubanovic/dp/4873117380/ref = sr_1_1? ie = UTF8 & qid = 1457003044 & sr = 8-1 & keywords =% E5% 85% A5% E9% 96% 80 Python3) is reading and learning.

This time, various people have already introduced it, but I will introduce it because I wrote a program to save the results of Twitter search in MongoDB. I would be very happy if you could dig into more things like this! !!

environment

various settings

Please set according to your environment.

config.py


# coding=utf-8
# write code...

# mongodb
HOST = 'localhost'
PORT = 27017
DB_NAME = 'twitter-archive'
COLLECTION_NAME = 'tweets'

# twitter
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
ACCESS_TOKEN_KEY = ''
ACCESS_TOKEN_SECRET = ''

Search keyword

I decided to manage the keywords specified when searching Twitter in a YAML file.

keywords.yml


#Define Twitter search keywords as a list.
#The following is an example.
- 'hamburger'
- 'baseball'
- 'Christmas'

Log output class

I created a wrapper class while investigating how to use logging. There are many things I don't understand yet, so I'm studying detailed settings, but I've confirmed that I can output logs.

logger.py


import logging
from logging.handlers import TimedRotatingFileHandler

# coding=utf-8
# write code...

class Logger:
    def __init__(self, log_type):
        logger = logging.getLogger(log_type)
        logger.setLevel(logging.DEBUG)
        #I want to rotate every day, but I haven't done it yet. .. ..
        handler = TimedRotatingFileHandler(filename='archive.log', when='D', backupCount=30)
        formatter = logging.Formatter('[%(asctime)s] %(name)s %(levelname)s %(message)s')
        handler.setFormatter(formatter)
        logger.addHandler(handler)
        self.logger = logger

    def info(self, msg, *args, **kwargs):
        self.logger.info(msg, *args, **kwargs)

    def debug(self, msg, *args, **kwargs):
        self.logger.debug(msg, *args, **kwargs)

    def error(self, msg, *args, **kwargs):
        self.logger.error(msg, *args, **kwargs)

    def exception(self, msg, *args, exc_info=True, **kwargs):
        self.logger.exception(msg, *args, exc_info, **kwargs)

Main search & save process

I'm thinking of starting a batch once a week and accumulating tweets on a regular basis. I had a harder time understanding the specifications of the Twitter API than I expected. I thought about how to control using since_id and max_id so as not to get duplicate tweets. How was it good to do it? .. ..

archive.py


import sys
import config
import yaml
from tweepy import *
from tweepy.parsers import JSONParser
from pymongo import *
from logger import Logger


# coding: UTF-8
# write code...

def archive():

    #Read the list of search keywords from the YAML file and generate a string for OR search.
    with open('keywords.yml', 'r') as file:
        keywords = yaml.load(file)
    query_string = ' OR '.join(keywords)

    #Initialization of log output object
    logger = Logger('archive')

    #Generate client for Twitter search
    auth = OAuthHandler(config.CONSUMER_KEY, config.CONSUMER_SECRET)
    auth.set_access_token(config.ACCESS_TOKEN_KEY, config.ACCESS_TOKEN_SECRET)
    #I want to receive the result in JSON, so set JSON Parser.
    #Even if the search limit is reached, the library will do the best. Should be.
    twitter_client = API(auth, parser=JSONParser(), wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
    if twitter_client is None:
        logger.error('certification failed.')
        sys.exit(-1)

    #Initialize a collection of mongodb to store tweets
    client = MongoClient(config.HOST, config.PORT)
    tweet_collection = client[config.DB_NAME][config.COLLECTION_NAME]

    #Get the latest tweets from the acquired tweets, and set to get the id and later of the tweets.
    last_tweet = tweet_collection.find_one(sort=[('id', DESCENDING)])
    since_id = None if last_tweet is None else last_tweet['id']

    #When searching for the first time, max_Do not set id-Set 1.
    max_id = -1

    # tweet_count is max_tweet_When the count is reached, the search ends.
    # max_tweet_Set a large value for count.
    tweet_count = 0
    max_tweet_count = 100000

    logger.info('maximum{0}Collect individual tweets.'.format(max_tweet_count))
    while tweet_count < max_tweet_count:
        try:
            params = {
                'q': query_string,
                'count': 100,
                'lang': 'ja'
            }
            # max_id and since_Only pass id as a parameter if it is set.
            if max_id > 0:
                params['max_id'] = str(max_id - 1)
            if since_id is not None:
                params['since_id'] = since_id

            search_result = twitter_client.search(**params)
            statuses = search_result['statuses']

            #Check if you could search to the end
            if statuses is None or len(statuses) == 0:
                logger.info('The tweet was not found.')
                break

            tweet_count += len(statuses)
            logger.debug('{0}I got the tweets.'.format(tweet_count))

            result = tweet_collection.insert_many([status for status in statuses])
            logger.debug('I saved it in MongoDB. ID is{0}is.'.format(result))

            #Update with the last Tweet ID you got.
            max_id = statuses[-1]['id']

        except (TypeError, TweepError) as e:
            print(str(e))
            logger.exception(str(e))
            break

if __name__ == '__main__':
    archive()

Summary

I haven't mastered Python at all yet, but I thought it was a language where I could write what I wanted to do. I will continue to study. In the future, I will try to analyze the collected tweets using the library for data analysis! !!

Recommended Posts

Collect tweets using tweepy in Python and save them in MongoDB
Notes using cChardet and python3-chardet in Python 3.3.1.
Get tweets containing keywords using Python Tweepy
Collect data using scrapy and populate mongoDB
Try using ChatWork API and Qiita API in Python
Save Twitter's tweets with Geo in CSV and plot them on Google Map.
Try using Tweepy [Python2.7]
Read and write NFC tags in python using PaSoRi
Try to make it using GUI and PyQt in Python
Save images using python3 requests
Shuffle the images in any directory with Python and save them in another folder with serial numbers.
Temporarily save a Python object and reuse it in another Python
[python] Use DataFrame to label arbitrary variables and arrays together and save them in csv [pandas]
Use MongoDB ODM in Python
Stack and Queue in Python
Predict gender from name using Gender API and Pykakasi in Python
Graph time series data in Python using pandas and matplotlib
Unittest and CI in Python
[Go language] Collect and save Vtuber images using Twitter API
Translate using googletrans in Python
Using Python mode in Processing
A Python program that collects tweets containing specific keywords daily and saves them in csv
Process Splunk execution results using Python and save to a file
Collect product information and process data using Rakuten product search API [Python]
Build and try an OpenCV & Python environment in minutes using Docker
[Python3] Save the mean and covariance matrix in json with pandas
I compared Node.js and Python in creating thumbnails using AWS Lambda
Dump, restore and query search for Python class instances using mongodb
GUI programming in Python using Appjar
MIDI packages in Python midi and pretty_midi
Difference between list () and [] in Python
Difference between == and is in python
View photos in Python and html
Sorting algorithm and implementation in Python
Save the binary file in Python
Authentication using tweepy-User authentication and application authentication (Python)
How to collect images in Python
Manipulate files and folders in Python
About dtypes in Python and Cython
Try using LevelDB in Python (plyvel)
Assignments and changes in Python objects
Check and move directories in Python
Using global variables in python functions
Ciphertext in Python: IND-CCA2 and RSA-OAEP
Hashing data in R and Python
Clustering and visualization using Python and CytoScape
Let's see using input in python
Infinite product in Python (using functools)
Function synthesis and application in Python
Edit videos in Python using MoviePy
Export and output files in Python
Reverse Hiragana and Katakana in Python2.7
Reading and writing text in Python
[GUI in Python] PyQt5-Menu and Toolbar-
Handwriting recognition using KNN in Python
Try using Leap Motion in Python
Depth-first search using stack in Python
When using regular expressions in Python
Create and read messagepacks in Python
GUI creation in python using tkinter 2
A script that retrieves tweets with Python, saves them in an external file, and performs morphological analysis.