[PYTHON] Experiment to collect tweets for a long period of time (aggregation & content confirmation)

Until last time.

-[x] Program ready. -[x] Fixed bugs in the library -[x] Start execution! -[x] Was there any problem with the first peak? -[] Let's add up for the time being -[] Let's take a peek from the outside

Try to aggregate by hour

About a week has passed for the time being. For the time being, it seems to be working safely. Then, what I want next is the current number of inputs or the total for each hour.

>>> db.xxx.count()

If you do, you can see that the count itself is increasing, but it is hard to say that anything is accurate ... Somehow, it is necessary to bring it to the hourly tweet count.

Aggregate with MapReduce

It seems that there are multiple aggregation methods in MongoDB, but for the time being, I will try to aggregate with MapReduce based on a "thin book". ...... I'm wondering if it's my first time to touch JavaScript, but I'm googled

check.js


db.xxxxx.mapReduce(
    // map
    function() {
    	
    	d = new Date(Date.parse(this.created_at) + 32400000);
    	p = d.getFullYear() + "/" + ("0" + (d.getMonth() + 1)).slice(-2) + "/" + ("0" + d.getDate()).slice(-2) + " " + ("0" + d.getHours()).slice(-2) + ":00:00"
        emit(
            p,
            1
        );
    },

    // reduce
    function(key, values) {
        return Array.sum(values)
    },

    {
        query: {},
        out: "TweetsCount"
    }
)

Like this. (Isn't JavaScript's date and time format the only way to do this?)

# mongo localhost/TwitterDB check.js

If you execute it like this, the result will be stored in another Collection "TweetsCount".

(Excerpt only at the peak occurrence point)
{ "_id" : "2016/10/28 13:00:00", "value" : 67 }
{ "_id" : "2016/10/28 14:00:00", "value" : 47 }
{ "_id" : "2016/10/28 15:00:00", "value" : 102 }
{ "_id" : "2016/10/28 16:00:00", "value" : 103 }
{ "_id" : "2016/10/28 17:00:00", "value" : 2850 }
{ "_id" : "2016/10/28 18:00:00", "value" : 5317 }
{ "_id" : "2016/10/28 19:00:00", "value" : 4324 }

It seems that it can be taken for the time being. Isn't it small overall? I feel like that, but (I feel like I'm having a peak of one digit ... maybe because of the search key) </ sub>. This level of execution does not seem to be a significant load. It may change again if it is settled for 100 days ...

Let's check the contents of the DB

The acquisition program currently running does not output anything to the standard output on the screen, so it is not possible to check what was stored. You can see it by looking directly at MongoDB, but you don't know the real-time acquisition status, and if you just look at it, there are many unnecessary items in JSON, so frankly it is hard to see.

Somehow, I want a script that can check the content of tweets in real time (or with some follow-up).

There seems to be various means, but it's quick and not exaggerated, and if you try to see only what you want to see, it's like this (implementation about 2h).

InsertChecker.py


#!/usr/bin/env python
# -*- coding:utf-8 -*-

from pymongo import MongoClient
import json

import time

#Variables related to MongoDB connection
HOST = 'mongo'      #host
PORT = 27017            #port(Default:27017)
DB_NAME = 'TwitterDB'   #DB name
COL_NAME= 'xxxxxx'    #Collection name


# ------Main processing from here------

try:
    Client = MongoClient(HOST, PORT)
    db = Client[DB_NAME]
    Collection = db[COL_NAME]
    print('DB ready')
    
    twCnt = Collection.count()  #Get the very first count
    
except pymongo.errors.PyMongoError as exc:
    #Connection error
    print('DB connection error')


while(True):        #infinite loop
    try:
        
        if(Collection.count() > twCnt):
            for doc in Collection.find(skip=twCnt):
                
                if("retweeted_status" in doc):
                    if("extended_tweet" in doc["retweeted_status"]):
                        #RT long sentence
                        print("RT @" + doc["retweeted_status"]["user"]["screen_name"] + ": " +doc["retweeted_status"]["extended_tweet"]["full_text"])
                    else:
                        #RT short sentence
                        print("RT @" + doc["retweeted_status"]["user"]["screen_name"] + ": " +doc["retweeted_status"]["text"])
                else:
                    if("extended_tweet" in doc):
                        #Long sentence
                        print(doc["extended_tweet"]["full_text"])
                    else:
                        #Short sentence
                        print(doc["text"])
                
                print(u"{name}({screen}) {created} via {src}".format(name=doc["user"]["name"], screen=doc["user"]["screen_name"],
                    created=doc["created_at"], src=doc["source"]))
                print(u"--------------------------------------------------")
        
        twCnt = Collection.count()
        
        #Check interval is about every 3 seconds(Is 10 seconds acceptable depending on the flow velocity?)
        time.sleep(3)
        
    except KeyboardInterrupt:
        # CTRL+C
        print('CTRL +End with C.')
        ExitCode = 0
        break

(The notation of the print () sentence in the middle has changed because the copy and paste from the reference material is mixed) It's easy to do, check the number of collections on a regular basis, and if there are more, format and output the increased amount.

Somehow Extension of tweet wording As a result of the specification change, the following if To branch into 4 patterns. ・ RT tweets and extended specification (long) tweet patterns ・ RT tweets and standard (short) tweet patterns ・ Normal tweets and extended specifications (long sentences) tweet patterns ・ Tweet patterns that are normal tweets and standard specifications (short sentences)

After that, display the time and display the user name. Even if it is so ultra suitable, it will be displayed like "tail -f"! Wonderful!

Minor remodeling

This still fulfills what I want to do, but it is humanity that makes me want to pursue convenience.

Display the number of Favs and RTs

How much is the tweet that is being RT actually RT? I think I want it.

print("\n<RT: " + str(doc["retweeted_status"]["retweet_count"]) + " cnt, Fav: " + str(doc["retweeted_status"]["favorite_count"]) + "cnt >")

Insert something like this in the branch that is displayed during RT (align the indents). You can see that it counts up every time it is RT.

Make the time display easier to read

The time display in JSON sent from Twitter is difficult for Japanese people to read, so I modified it.

d = datetime.strptime(doc["created_at"],'%a %b %d %H:%M:%S +0000 %Y') + timedelta(hours=9)

After converting to datetime format, the time storage location

created=d.strftime("%Y/%m/%d %H:%M:%S")

If you replace it like this, it will be completed. Don't forget "from datetime import datetime, timedelta".

Remove HTML tags from client name display

It is possible to display what kind of client the tweet was sent from, but it is hard to see because this item and HTML tags are included. If you enable the regular expression as "import re" and do something like this in the place where doc ["source"] is specified, the tag will disappear strangely.

src=re.sub("<.+?>", "", doc["source"])

(It's possible that there will be nothing but tags, but it's working fine so far.)

What to do next (no plan)

For the time being, with this ・ Number of tweets per hour ・ Display of tweets acquired in real time

Can now be done. After all, it's mentally reassuring to see the tweets flowing properly.

The next goal is ・ Extraction of tweets that are currently the most RT ・ Twitter client usage statistics ・ Extract the store (in other words) </ sub> name from the tweet and measure the number of mentions. Is it around? I will try to make it from time to time.

(Continues persistently.)

Recommended Posts

Experiment to collect tweets for a long period of time (aggregation & content confirmation)
Experiment to collect tweets for a long time (Program preparation (3))
Experiment to collect tweets for a long time (Program preparation (1))
Experiment to collect tweets for a long time (program preparation (2))
Experiment to collect tweets for a long time (Program preparation (5))
Experiment to collect tweets for a long time (immediately before execution)
Put the process to sleep for a certain period of time (seconds) or more in Python
Randomly play the movie on ChromeCast for a certain period of time
[Python] Create a list of date and time (datetime type) for a certain period
A simple workaround for bots to try to post tweets with the same content
The story of Airflow's webserver and DAG, which takes a long time to load
[Python] Create a date and time list for a specified period
Experiment to make a self-catering PDF for Kindle with Python
A study method for beginners to learn time series analysis
I want to create a Dockerfile for the time being.