[PYTHON] Extract sudden buzzwords with twitter streaming API

The streaming API does not support Japanese filtering, it can only pick up about 1% of tweets, and I think that many people think that the usability is not good. I wondered if there was any use for such a streaming API, so I thought about it, so I made it.

Preparation

All you need is the word count data (stored in sqlite) every 10 minutes obtained from twitter's streaming API. The DB fields are ymd, hour, minute, word_dict, tweet_cnt. Date ('2014-06-01'), Hours ('22'), Minutes ('10' (in 10-minute increments from 00 to 50)), pickled dictionary-type word sets, every 10 minutes, respectively. The number of tweets. I thought after making it, but I made a mistake in designing the DB. There wasn't much point in separating the date, time, and minutes. .. I think I did it stupid, yes. For how to use the twitter API, please refer to the page described in Dividing twitter data into SPAM and HAM.

What i did

Reference code

python


# coding: utf-8

import sqlite3 as sqlite
from collections import defaultdict
import cPickle as pickle
import copy
import matplotlib.pyplot as plt
import re, unicodedata
from normalizewords import Replace # !!!!!To!!!Like, capitalization, full-width
from math import floor, sqrt

stopword = ('Please describe according to the purpose such as pictograms',)

class DataCreater:
    date = []
    tweet_cnt = []
    ex_words_store = []
    now_words_store = []
    new_break_words = set()
    bin_cnt = 0
    p = re.compile("[!-/:-@[-`{-~]")
    
    def __init__(self, dbname, word_limit, ofname, oftcntname,ofnewword,ofcos):
        self.con = sqlite.connect(dbname)
        self.word_limit = word_limit
        self.ofword = open(ofname, 'w',1000)
        self.oftcnt = open(oftcntname, 'w',1000)
        self.ofnewwords = open(ofnewword,'w',1000)
        self.ofcos = open(ofcos,'w',1000)
    
    def __def__(self):
        self.ofword.close()
        self.oftcnt.close()
        self.ofnewwords.close()
        self.ofcos.close()

    def re_aggregate(self,span,con_bin=10):
        response = self.get_data_yeald()
        itr, tweet_cnt, word_cnt = self.initiarize_cnt()
        s_date = None
        while 1:
            res = response.fetchone()
            if res is None: break

            #print res[0]+', '+res[1]+', '+res[2]+', '+str(res[4])
            tweet_cnt += int(res[4])
            word_cnt = self.aggregate_word_cnt(pickle.loads(str(res[3])), word_cnt) 
            
            if itr==0:
                s_date = res[0]+' '+res[1].zfill(2)+':'+res[2]
                
            if itr == span-1:
                date = res[0]+' '+res[1].zfill(2)+':'+res[2]
                sorted_word_list = self.sort_word_dict(word_cnt)
                self.output_topN_word(s_date, sorted_word_list)
                self.output_tweet_cnt(s_date, tweet_cnt)
                self.date.append(s_date)
                self.tweet_cnt.append(tweet_cnt)
                s_date = date                
                    
                self.bin_cnt += 1
                self.store_now_dict(sorted_word_list[:self.word_limit])
                print len(self.now_words_store)
                if self.bin_cnt >= con_bin:
                    if len(self.ex_words_store)!=0:
                        self.store_new_words(sorted_word_list[:self.word_limit])
                        cos_sim = self.calculate_cos_similarity(sorted_word_list[:self.word_limit])
                        self.output_new_words(s_date)
                        self.output_cos_sim(s_date,cos_sim)
                        self.ex_words_store = copy.deepcopy( self.now_words_store )
                        self.now_words_store.pop(0)
                        self.new_break_words = set()
                    else:
                        self.ex_words_store = copy.deepcopy( self.now_words_store )
                        self.now_words_store.pop(0)
                    
                itr, tweet_cnt, word_cnt = self.initiarize_cnt()
            else:
                itr += 1
                
    def get_data_yeald(self, start_date=None, end_date=None):        
        return self.con.execute("select ymd, hour, minute, word_dict,tweet_cnt from word_count")            
                
    def initiarize_cnt(self):
        return 0, 0, defaultdict(int)

    def aggregate_word_cnt(self, new_word_dic, orig_word_dic):
        for k,v in new_word_dic.items():
            if k not in stopword:
                m = self.p.search(unicodedata.normalize('NFKC', unicode(k)))
                if m is None:
                    orig_word_dic[k] += v
        return orig_word_dic
    
    def sort_word_dict(self, word_dict):
        lst = word_dict.items()
        lst.sort(lambda p0,p1: cmp(p1[1],p0[1])) # get Top N data
        return lst
            
    def calculate_cos_similarity(self, sorted_wordlist):
        ex_words =[]
        now_word_list = []
        for w_list in self.ex_words_store:
            ex_words +=w_list  
        for k,_ in sorted_wordlist:
            now_word_list.append(k)
        numerator = sum([1 for c in now_word_list if c in ex_words])
        denominator =  sqrt(len(ex_words) * len(now_word_list))
        return 1-float(numerator) / denominator if denominator != 0 else 1        
            
    def output_topN_word(self, date, sorted_word_list):
        if len(sorted_word_list) >=self.word_limit:
            s_list = [ st[0]+':'+str(st[1]) for st in sorted_word_list[:self.word_limit]]
            s = '\t'.join(s_list)
            self.ofword.write(date+'\t'+s+'\n')
        else:
            s_list = [ st[0]+':'+str(st[1]) for st in sorted_word_list[:self.word_limit]]
            s = '\t'.join(s_list)
            self.ofword.write(date+'\t'+s+'\n')

    def output_tweet_cnt(self, date, cnt):
        s = date+'\t'+str(cnt)
        self.oftcnt.write(s+'\n')

    def output_new_words(self,date):
        s = '\t'.join(list(self.new_break_words))
        print date, s
        self.ofnewwords.write(date+'\t'+s+'\n')
        
    def output_cos_sim(self, date, cos_sim):
        self.ofcos.write(date+'\t'+str(cos_sim)+'\n')
        
    def store_now_dict(self, sorted_wordlist):
        words = [k for k,_ in sorted_wordlist]
        self.now_words_store.append(words)
    
    def store_new_words(self, sorted_wordlist):
        ex_words =[]
        for w_list in self.ex_words_store:
            ex_words +=w_list  
        for k,_ in sorted_wordlist:
            if k not in ex_words:
                self.new_break_words.add(k)

if __name__=='__main__':
    dbname = 'stream_word_cnt.db'
    ofname = 'topN_words'
    oftcnt = 'tweet_count'
    ofnewword = 'new_word'
    ofcos = 'cos_simularity'
    word_top = 100 #top How many words to get
    span = 6 #How many counts in 10-minute units should be combined and recounted
    
    dc = DataCreater(dbname, word_top, ofname, oftcnt,ofnewword,ofcos)
    dc.re_aggregate(span,con_bin=24) #con_bin is the number of spans to compare similarity
    
    tick = int(floor(len(dc.tweet_cnt)/7))
    
    plt.plot(range(len(dc.tweet_cnt)), dc.tweet_cnt)
    plt.xticks(range(0, len(dc.tweet_cnt), tick), dc.date[::tick], rotation=40)
    plt.show()

Experimental result

I had some data that I had stored from June 10th to 25th, so I experimented with it. As a result, words related to the World Cup were taken around 5 o'clock in the morning, words related to the earthquake on the 16th were taken, and words such as heavy rain and thunder were extracted on heavy rainy days. It's very simple to do, but I got some good data. The dissimilarity calculated by the cosine similarity also responded greatly to the time of the World Cup (especially the time when the game against Japan was broadcast) and the time of the earthquake.

Summary

I thought I was looking at the top 10 words, but most of the spam tweet words are on the top. It seems that it can also be used as a means to remove words contained in this spam (a simple spam filter that does not require maintenance). I have a feeling that the streaming API data can be used in this way.

Sorry to trouble you, but if you have any strange points, please comment.

Recommended Posts

Extract sudden buzzwords with twitter streaming API
Use Twitter API with Python
Extract Twitter data with CSV
Support yourself with Twitter API
Successful update_with_media with twitter API
Pick up only crispy Japanese with Twitter streaming API
Specifying the date with the Twitter API
Collecting information from Twitter with Python (Twitter API)
Streaming speech recognition with Google Cloud Speech API
Post from another account with Twitter API
Automatic follow-back using streaming api with Tweepy
Tweet regularly with the Go language Twitter API
Hit the Twitter API after Oauth authentication with Django
[Life hack] Women's Anna support bot with Twitter API
Crawling with Python and Twitter API 1-Simple search function
YOLP: Extract latitude and longitude with Yahoo! Geocoder API.
Twitter OAuth with Django
Extrude with Fusion360 API
Extract EXIF with sips
Try hitting the Twitter API quickly and easily with Python
I tried follow management with Twitter API and Python (easy)
Create a real-time auto-reply bot using the Twitter Streaming API
Streamline information gathering with the Twitter API and Slack bots
Python: Extract file information from shared drive with Google Drive API