[GO] Get a large amount of Starbucks Twitter data with python and try data analysis Part 1

I'm always grateful for the Mac Book Air, and I'm inconvenienced by staying for a long time. I would like to thank Mr. Starbucks and analyze the data to help him. This is an article about getting a lot of tweets containing "Starbucks" in the text and trying to find out what data analysis can provide. It's not a stemmer, but it may be a stemmer in the sense of giving back to Starbucks (・ ω ・)

1. Get account information to connect to Twitter API

With Google teacher ["twitter api account"](https://www.google.co.jp/search?q=twitter+api+%E3%82%A2%E3%82%AB%E3%82%A6%E3 If you search for% 83% B3% E3% 83% 88), you will find many sites that clearly describe how to register, so information for accessing the API by referring to them (especially consumer_key, consumer_secret, access_token) , Access_secret).

2. Installation of various required libraries

It is assumed that the basic Python environment such as iPython is in place. If you have the library in here, it's almost okay. It also installs an authentication library for using Twitter REST APIs.

pip install requests_oauthlib

Also, since mongoDB is used to store data, here and [here](http://qiita. Install by referring to com / hajimeni / items / 3c93fd981e92f66a20ce). For an overview of mongoDB, see "Thin book of MongoDB".

In order to access mongoDB from Python, we will also introduce pymongo.

pip install pymongo

3. Initialization process

from requests_oauthlib import OAuth1Session
from requests.exceptions import ConnectionError, ReadTimeout, SSLError
import json, datetime, time, pytz, re, sys,traceback, pymongo
#from pymongo import Connection     #Connection class is obsolete, so change to MongoClient
from pymongo import MongoClient
from collections import defaultdict
import numpy as np

KEYS = { #List the keys you got with your account below

twitter = None
connect = None
db      = None
tweetdata = None
meta    = None

def initialize(): #Initial processing such as twitter connection information and connection processing to mongoDB
    global twitter, twitter, connect, db, tweetdata, meta
    twitter = OAuth1Session(KEYS['consumer_key'],KEYS['consumer_secret'],
#   connect = Connection('localhost', 27017)     #Connection class is obsolete, so change to MongoClient
    connect = MongoClient('localhost', 27017)
    db = connect.starbucks
    tweetdata = db.tweetdata
    meta = db.metadata

4. Search Tweet

Use the code below to import tweets that include "Starbucks" in the text into mongoDB.

#Get 100 Tweet data from Twitter REST APIs by specifying a search word
def getTweetData(search_word, max_id, since_id):
    global twitter
    url = 'https://api.twitter.com/1.1/search/tweets.json'
    params = {'q': search_word,
    # max_Set if id is specified
    if max_id != -1:
        params['max_id'] = max_id
    # since_Set if id is specified
    if since_id != -1:
        params['since_id'] = since_id
    req = twitter.get(url, params = params)   #Get Tweet data

    #Decomposition of acquired data
    if req.status_code == 200: #If successful
        timeline = json.loads(req.text)
        metadata = timeline['search_metadata']
        statuses = timeline['statuses']
        limit = req.headers['x-rate-limit-remaining'] if 'x-rate-limit-remaining' in req.headers else 0
        reset = req.headers['x-rate-limit-reset'] if 'x-rate-limit-reset' in req.headers else 0              
        return {"result":True, "metadata":metadata, "statuses":statuses, "limit":limit, "reset_time":datetime.datetime.fromtimestamp(float(reset)), "reset_time_unix":reset}
    else: #If it fails
        print ("Error: %d" % req.status_code)
        return{"result":False, "status_code":req.status_code}

#Returns the character string in a date type that combines the Japan time 2 time zone
def str_to_date_jp(str_date):
    dts = datetime.datetime.strptime(str_date,'%a %b %d %H:%M:%S +0000 %Y')
    return pytz.utc.localize(dts).astimezone(pytz.timezone('Asia/Tokyo'))

#Returns the current time in UNIX Time
def now_unix_time():
    return time.mktime(datetime.datetime.now().timetuple())

Here is the tweet acquisition loop.

#-------------Get Tweet data repeatedly-------------#
mid = -1 
count = 0
res = None
        count = count + 1
        sys.stdout.write("%d, "% count)
        res = getTweetData(u'Starbucks', max_id=mid, since_id=sid)
        if res['result']==False:
            #Exit if failed
            print "status_code", res['status_code']
        if int(res['limit']) == 0:    #I have reached the limit, so I take a break
            #Date type column'created_datetime'To add
            print "Adding created_at field."
            for d in tweetdata.find({'created_datetime':{ "$exists": False }},{'_id':1, 'created_at':1}):
                #print str_to_date_jp(d['created_at'])
                tweetdata.update({'_id' : d['_id']}, 
                     {'$set' : {'created_datetime' : str_to_date_jp(d['created_at'])}})
            #Waiting time calculation.Resume after limit + 5 seconds
            diff_sec = int(res['reset_time_unix']) - now_unix_time()
            print "sleep %d sec." % (diff_sec+5)
            if diff_sec > 0:
                time.sleep(diff_sec + 5)
            #metadata processing
            if len(res['statuses'])==0:
                sys.stdout.write("statuses is none. ")
            elif 'next_results' in res['metadata']:
                #Store the result in mongoDB
                meta.insert({"metadata":res['metadata'], "insert_date": now_unix_time()})
                for s in res['statuses']:
                next_url = res['metadata']['next_results']
                pattern = r".*max_id=([0-9]*)\&.*"
                ite = re.finditer(pattern, next_url)
                for i in ite:
                    mid = i.group(1)
                sys.stdout.write("next is none. finished.")
    except SSLError as (errno, request):
        print "SSLError({0}): {1}".format(errno, strerror)
        print "waiting 5mins"
    except ConnectionError as (errno, request):
        print "ConnectionError({0}): {1}".format(errno, strerror)
        print "waiting 5mins"
    except ReadTimeout as (errno, request):
        print "ReadTimeout({0}): {1}".format(errno, strerror)
        print "waiting 5mins"
        print "Unexpected error:", sys.exc_info()[0]
        info = sys.exc_info()

## 5. Twitter REST API data structure ## The structure of the data obtained by "[GET search / tweets](https://dev.twitter.com/rest/reference/get/search/tweets)" of Twitter REST APIs is as follows. ### Structure of TwitterListResponse ### A description of the main elements of Tweet information.   
Item Description   
id Tweet ID. The new ones have old numbers and the old ones have young numbers. If you specify larger or smaller than this ID when searching, you can retrieve previous tweets after that.
id_str It seems to be a character string version of "id", but the details are unknown because it is originally obtained as a character string.
user User information. It has the following elements (only typical ones are picked up)
   id User ID. A number ID that you don't normally see.
name The name of the longer user.
screen_name User name used when specifying with @ etc.
description User description information. Profile-like sentences.
friends_count Number of followers
followers_count Number of followers
statuses_count Number of tweets (including retweets)
favourites_count Number of favorites
location Where you live
created_at Registration date for this user
text Tweet body
retweeted_status Whether it is a retweet (True: retweet / False: normal tweet)
retweeted Whether or not it was retweeted (True / False)
retweet_count Number of retweets
favorited Whether it was favorited (True / False)
favorite_count Favorite number
coordinates latitude / longitude
entities Additional information shown below
user_mentions User information specified by @ in the text
hashtags Hashtag in the body
urls URL information in the text
source Information about the app / site that tweeted
lang Language information
created_at Tweet date and time
place Location information related to tweets
in_reply_to_screen_name The user name of the tweet source when the tweet was a reply
n_reply_to_status_id Tweet ID of the tweet source when the tweet was a reply
in_reply_to_status_id_str string version of n_reply_to_status_id

Metadata structure

A description of the metadata returned when searching for'https://api.twitter.com/1.1/search/tweets.json'.

item Description
query Search word
count How many tweets did you get in a single search?
completed_in How many seconds did the acquisition complete?
max_id Newest ID among the acquired tweets
max_id_str max_String version of id?(Both are strings, but ...)
since_id The oldest ID of the tweets you got
since_id_str since_String version of id?(Both are strings, but ...)
refresh_url URL when you want to get newer tweets with the same search word
next_results URL when you want to get older tweets with the same search word

Summary of the data obtained this time

Total number of acquisitions
Acquisition data period
From 2015-03-11 04:43:42 to 2015-03-22 00:01:12
Number of tweets per second
4.101 tweet/sec

Current issues

If you get up to the latter half of 100,000 with GET search / tweets, you can not get the tweets before that, the'statuses' element becomes empty, and the'next_results' element is not returned in the first place. I haven't solved it at the moment, but I got about 200,000 cases, so I would like to analyze this data from the next time. ** Update: ** I received a comment, but it seems that I can only get tweets for one week.

Continue to Part 2.

The full code described on this page is here

Referenced page

Access the Twitter API with Python Twitter official REST API document

