Collect tweets about "Corona" with python and automatically detect words that became a hot topic due to the influence of "Corona"

This is an explanation of how to collect Twitter data with python and how to detect bursts for time-series text data.

Technically, it's similar to the previous article below.

Past articles: I collected tweets about "Princess Kuppa" with python and tried burst detection https://qiita.com/pocket_kyoto/items/de4b512b8212e53bbba3

In order to confirm the versatility of the method adopted at this time, as of March 10, 2020, we practiced collecting Twitter data and detecting bursts of words that co-occur with "corona" using the topic "corona" as a keyword. I tried to.

Collect tweets about "Corona"

The collection method is basically the same as the past articles.

First, prepare for tweet collection, such as loading the library.

#Login key information for collecting Twitter data
KEYS = { #List the key you got with your account
        'consumer_key':'*********************',
        'consumer_secret':'*********************',
        'access_token':'*********************',
        'access_secret':'*********************',
       }

#Collection of Twitter data (preparation for collection)
import json
from requests_oauthlib import OAuth1Session
twitter = OAuth1Session(KEYS['consumer_key'],KEYS['consumer_secret'],KEYS['access_token'],KEYS['access_secret'])

For information on how to obtain a login key for collecting Twitter data, the Reference [1] site is easy to understand.

The function for collecting tweets is defined as follows. Since the tweet location is not used this time, the default argument (None) can be set. Also, since you can only search up to 100 tweets at a time, you need to make repeated requests with a for statement, but it was smarter to manage it outside the Twitter data acquisition function, so I implemented it that way. This area follows the writing method of Reference [2].

#Twitter data acquisition function
def getTwitterData(key_word, latitude=None, longitude=None, radius=None, mid=-1):
    
    url = "https://api.twitter.com/1.1/search/tweets.json"
    params ={'q': key_word, 'count':'100', 'result_type':'recent'} #Acquisition parameters
    if latitude is not None: #Judgment only by latitude
        params = {'geocode':'%s,%s,%skm' % (latitude, longitude, radius)}
    
    params['max_id'] = mid #Get only tweets with IDs older than mid
    req = twitter.get(url, params = params)

    if req.status_code == 200: #When normal communication is possible

        tweets = json.loads(req.text)['statuses'] #Get tweet information from response

        #Ingenuity for taking the oldest tweets (* There seems to be a better way to write)
        user_ids = []
        for tweet in tweets:
            user_ids.append(int(tweet['id']))
        if len(user_ids) > 0:
            min_user_id = min(user_ids)
        else:
            min_user_id = -1
        
        #Meta information
        limit = req.headers['x-rate-limit-remaining'] if 'x-rate-limit-remaining' in req.headers else 0
        reset = req.headers['x-rate-limit-reset'] if 'x-rate-limit-reset' in req.headers else 0  
            
        return {'tweets':tweets, 'min_user_id':min_user_id, 'limit':limit, 'reset':reset}

    else: #When normal communication is not possible
        print("Failed: %d" % req.status_code)
        return {}

I created a control function (getTwitterDataRepeat) to execute the above function continuously. To avoid getting caught in the request limit, it will automatically wait when you are about to get caught in the limit.

#Continuous acquisition of Twitter data
import datetime, time
def getTwitterDataRepeat(key_word, latitude=None, longitude=None, radius=None, mid=-1, repeat=10):
    
    tweets = []
    
    for i in range(repeat):

        res = getTwitterData(key_word, latitude, longitude, radius, mid)
        
        if 'tweets' not in res: #Leave if an error occurs
            break
        else:
            sub_tweets = res['tweets']
            for tweet in sub_tweets:
                tweets.append(tweet)
            
        if int(res['limit']) == 0:    #Take a break if you reach the limit

            #Waiting time calculation.Resume after limit + 5 seconds
            now_unix_time = time.mktime(datetime.datetime.now().timetuple())  #Get the current time
            diff_sec = int(res['reset']) - now_unix_time
            print ("sleep %d sec." % (diff_sec+5))
            if diff_sec > 0:
                time.sleep(diff_sec + 5)
        
        mid = res['min_user_id'] - 1
        
    print("Number of tweets acquired:%s" % len(tweets))
    return tweets

By implementing in this way, it is possible to automatically collect tweets without worrying about the upper limit of requests. After that, I wanted to collect tweets separately by time zone, so I ran the following script.

#reference[3]I borrowed the function that was created in
import time, calendar
def YmdHMS(created_at):
    time_utc = time.strptime(created_at, '%a %b %d %H:%M:%S +0000 %Y')
    unix_time = calendar.timegm(time_utc)
    time_local = time.localtime(unix_time)  # 2018/9/Fixed to 24
    return time.strftime("%Y/%m/%d %H:%M:%S", time_local)

#Get tweets about Corona every 6 hours for a week
tweet_corona = {}
mid = -1

for t in range(4*7):
    tweets = getTwitterDataRepeat("corona", mid=mid, repeat=10)    
    old_tweet = tweets[-1]  #The oldest tweet we've collected

    key = YmdHMS(old_tweet["created_at"])  #YmdHMS function
    tweet_corona[key] = tweets  #Save the time of the oldest tweet as a key

    mid = old_tweet["id"] - 15099494400000*6 #Collect about 6 hours back

I wanted to collect tweets by going back 6 hours each, so I'm subtracting 15,099,494,400,000 * 6 from the oldest tweet mid. This value of 15,099,494,400,000 is determined by Tweeter's tweet ID specification. Twitter's tweet ID has a structure in which the millisecond time stamp + the number of the machine issuing the ID + the sequence number is pushed into 64 bits. (Reference [4])

Compare tweets about "Corona" in chronological order

So far, we have been able to collect tweets containing "Corona" in chronological order. First of all, in order to understand the data, I would like to visualize the frequency of occurrence of words in chronological order.

I defined the following function, morphologically analyzed it with janome, and counted the frequency of occurrence of words.

#Morphological analysis of sentences and conversion to Bag of Words
from janome.tokenizer import Tokenizer
import collections
import re

def CountWord(tweets):
    tweet_list = [tweet["text"] for tweet in tweets]
    all_tweet = "\n".join(tweet_list)

    t = Tokenizer()

    #Transformed into the original form, nouns only, one character removed, limited to continuous drinking of kanji, hiragana, and katakana
    c = collections.Counter(token.base_form for token in t.tokenize(all_tweet) 
                            if token.part_of_speech.startswith('noun') and len(token.base_form) > 1 
                            and token.base_form.isalpha() and not re.match('^[a-zA-Z]+$', token.base_form)) 

    freq_dict = {}
    mc = c.most_common()
    for elem in mc:
        freq_dict[elem[0]] = elem[1]

    return freq_dict

WordCloud was used as the visualization method. I implemented it as follows.

#Visualization with Word Cloud, Word Cloud visualization function
def color_func(word, font_size, position, orientation, random_state, font_path):
    return 'white'

from wordcloud import WordCloud
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
from matplotlib.font_manager import FontProperties
fp = FontProperties(fname=r'C:\WINDOWS\Fonts\meiryo.ttc', size=50) #Japanese support

def DrawWordCloud(word_freq_dict, fig_title):

    #Change the default settings and change the colormap"rainbow"change to
    wordcloud = WordCloud(background_color='white', min_font_size=15, font_path='C:\WINDOWS\Fonts\meiryo.ttc',
                          max_font_size=200, width=1000, height=500, prefer_horizontal=1.0, relative_scaling=0.0, colormap="rainbow")    
    wordcloud.generate_from_frequencies(word_freq_dict)
    plt.figure(figsize=[20,20])
    plt.title(fig_title, fontproperties=fp)
    plt.imshow(wordcloud,interpolation='bilinear')
    plt.axis("off")

With these, Visualize the frequency of occurrence of words in chronological order.

output: (Omitted) : image.png : (Omitted) : image.png : (Omitted) : image.png : (Omitted) : image.png : (Omitted) :


The influence of the super-family that easily co-occurs with the word "corona" such as "new type", "virus", and "infection" came out strongly. From this visualization result, it is difficult to understand the word that became a hot topic due to the influence of "corona", so we will try to detect it automatically.

Try to automatically detect the word that became a topic due to the influence of corona

Using the data set collected this time and a method called burst detection, I would like to automatically detect the word that became a hot topic due to the influence of "corona". Regarding the method of burst detection, in the book, "Machine learning of web data (machine learning professional series) = UTF8 & btkr = 1) ”, but there are few commentary articles on the net. This time, Commentary article of Tohoku University Inui Suzuki Laboratory, which is famous as a laboratory on natural language processing. I would like to try to implement and apply the burst detection method with reference to 2FTrend% 20Analysis).

This time, I tried to detect bursts using an index called Moving Average Convergence Divergence (MACD). As a burst detection method, the method announced by Kleinberg in 2002 seems to be often used as a baseline, but MACD announced by He and Parker in 2010 seems to be simpler and less computationally expensive.

↓ As for the explanation of MACD, I would like to quote it as it is because it is easy to understand from the Inui-Suzuki laboratory.


[Explanation of MACD]

The MACD at a certain time is

MACD = (Movement index average of past f period of time series value)-(Movement index average of past s period of time series value) Signal = (Movement index average of MACD value over the past t period) Histgram = MACD - Signal

Here, f, s, and t are parameters (f <s), and these are collectively written as MACD (f, s, t). In this experiment, MACD (4, 8, 5), which was also used in the experiment of He and Parker (2010), was adopted. When MACD is used as a technical index, it is said that the Histgram indicates the strength of the trend, with the status of "Signal <MACD" being raised and the trend being "MACD <Signal" being lowered. This time, the period of 15 minutes is taken as a group (15 minutes), and the frequency of appearance of words appearing on Twitter within that period divided by 15, that is, the appearance speed [times / minute] is used as the observed value. We performed trend analysis by MACD. The value of the moving index average required for MACD calculation can be calculated sequentially, and this trend analysis can be implemented as a streaming algorithm, so we think that it is suitable for trend analysis from big data.


From the above explanation, MACD was implemented as follows.

# Moving Average Convergence Divergence (MACD)Calculation
class MACDData():
    def __init__(self,f,s,t):
        self.f = f
        self.s = s
        self.t = t
        
    def calc_macd(self, freq_list):
        n = len(freq_list)
        self.macd_list = []
        self.signal_list = []
        self.histgram_list = []
        
        for i in range(n):

            if i < self.f:
                self.macd_list.append(0)
                self.signal_list.append(0)
                self.histgram_list.append(0)
            else :
                macd = sum(freq_list[i-self.f+1:i+1])/len(freq_list[i-self.f+1:i+1]) - sum(freq_list[max(0,i-self.s):i+1])/len(freq_list[max(0,i-self.s):i+1])
                self.macd_list.append(macd)
                signal = sum(self.macd_list[max(0,i-self.t+1):i+1])/len(self.macd_list[max(0,i-self.t+1):i+1])
                self.signal_list.append(signal)
                histgram = macd - signal
                self.histgram_list.append(histgram)   

Using this program, from Wednesday, March 4, 2020 to Tuesday, March 10, 2020 Due to the influence of corona, I would like to automatically detect the word that became a hot topic.

Program that assigns data to the above function (folding)
#Burst detection of terms ranked in the top 100 words in tweets in each time zone

top_100_words = []

i = 0

for freq_dict in datetime_freq_dicts:

    for k,v in freq_dict.items():
        top_100_words.append(k)
        i += 1

        if i >= 100:
            i = 0
            break
            
top_100_words = list(set(top_100_words))  #Limited to unique words
print(len(top_100_words))

#Acquisition of MACD calculation result
word_list_dict = {}

for freq_dict in datetime_freq_dicts:
    
    for word in top_100_words:
        if word not in word_list_dict:
            word_list_dict[word] = []
        
        if word in freq_dict:
            word_list_dict[word].append(freq_dict[word])
        else:
            word_list_dict[word].append(0)
            
#Normalization
word_av_list_dict = {}

for k, v in word_list_dict.items():
    word_av_list = [elem/sum(v) for elem in v]
    word_av_list_dict[k] = word_av_list

#Calculation(He and Parker(2010)Same parameters as)
f = 4
s = 8
t = 5

word_macd_dict = {}

for k, v in word_av_list_dict.items():
    word_macd_data = MACDData(f,s,t)
    word_macd_data.calc_macd(v)
    word_macd_dict[k] = word_macd_data

#Burst detection
word_burst_dict = {}

for k,v in word_macd_dict.items():
    burst = max(v.histgram_list)  #Since Histgram shows the strength of the trend, take the maximum value within the period
    word_burst_dict[k] = burst

The result of inputting the data is as follows.

i = 1
for k, v in sorted(word_burst_dict.items(), key=lambda x: -x[1]):
    print(str(i) + "Rank:" + str(k))
    i += 1

output: 1st place: Kuro 2nd place: Lotte Marines 3rd place: Ground 4th place: Ward office 5th place: Dignity 6th place: brim 7th place: Self-study 8th place: Deliveryman 9th place: Methanol 10th place: Kohoku 11th place: Serum 12th place: Eplus 13th place: Harassment 14th place: Equipment 15th place: Snack 16th place: Sagawa Express 17th place: Libero 18th place: Miyuki 19th place: Goddess 20th place: Psychedelic 21st place: Live 22nd place: Yokohama City University 23rd place: Depression 24th place: whole volume 25th place: Korohara 26th place: Epizootic 27th place: Refund 28th place: Appearance 29th place: Obligation 30th place: Display : (Omitted) :


"Tsuba", "Kuro", "Lotte Marines", etc. were detected as words that became a hot topic due to the influence of "Corona". The results for the other words were generally convincing.

image.png

Next, I also tried to estimate the time when it became a hot topic.

Visualization program (folding)
#Visualization of results
import numpy as np
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
from matplotlib.font_manager import FontProperties
fp = FontProperties(fname=r'C:\WINDOWS\Fonts\meiryo.ttc', size=10) #Japanese support

x = np.array(sorted(tweet_corona.keys()))
y1 = np.array(word_macd_dict["Lotte Marines"].histgram_list)
y2 = np.array(word_macd_dict["Self-study"].histgram_list)
y3 = np.array(word_macd_dict["Deliveryman"].histgram_list)
y4 = np.array(word_macd_dict["methanol"].histgram_list)
y5 = np.array(word_macd_dict["snack"].histgram_list)
y6 = np.array(word_macd_dict["harassment"].histgram_list)


plt.plot(x, y1, marker="o")
plt.plot(x, y2, marker="+", markersize=10, markeredgewidth=2)
plt.plot(x, y3, marker="s", linewidth=1)
plt.plot(x, y4, marker="o")
plt.plot(x, y5, marker="+", markersize=10, markeredgewidth=2)
plt.plot(x, y6, marker="s", linewidth=1)

plt.xticks(rotation=90)

plt.title("Burst detection result", fontproperties=fp)
plt.xlabel("Date and time", fontproperties=fp)
plt.ylabel("Burst detection result", fontproperties=fp)
plt.ylim([0,0.2])
plt.legend([""Lotte Marines"",""Self-study"", ""Deliveryman"",""methanol"", ""snack"",""harassment""], loc="best", prop=fp)

The visualization result is as follows.

image.png

The Yakult Swallows vs. Lotte Marines unattended match was held It's Saturday, March 7th, so it seems that we can estimate it correctly. As of March 10th (Tuesday), "methanol" seems to be one of the hottest words.

(3/18 postscript) Results from 3/11 (Wednesday) to 3/18 (Wednesday)

The results of inputting the data from 3/11 (Wednesday) to 3/18 (Wednesday) are as follows.

i = 1
for k, v in sorted(word_burst_dict.items(), key=lambda x: -x[1]):
    print(str(i) + "Rank:" + str(k))
    i += 1

output: 1st place: Terms 2nd place: Saiyan 3rd place: Majestic Legon 4th place: tough 5th place: Civil 6th place: Earthling 7th place: Juan 8th place: City 9th place: Cannabis 10th place: Paraiso 11th place: Fighting conference 12th place: Ranbu 13th place: Laura Ashley 14th place: Musical 15th place: Impossible 16th place: Estimate 17th place: Honey 18th place: Chasing 19th place: Lemon 20th place: Performance 21st place: Receipt 22nd place: Sword 23rd place: Investigation 24th place: Macron 25th place: Crowdfunding 26th place: Okeya 27th place: Grandmother 28th place: Smile 29th place: Full amount 30th place: Owned : (Omitted) :


Etc. were detected as tweets that became a hot topic momentarily.

The time when it became a hot topic is as follows.

image.png

Summary and future

This time, I tried to detect bursts with the theme of "corona". Technically, it is a reprint of the content of the past article, but I think that a reasonable analysis result was obtained. In the past article, the theme was "Princess Kuppa", but we were able to confirm that the method itself is highly versatile.

I would like to continue taking on the challenge of Twitter data analysis.

reference

[1] [2019] Specific method to register with Twitter API and obtain access key token https://miyastyle.net/twitter-api [2] Get a large amount of Starbucks Twitter data with python and try data analysis Part 1 https://qiita.com/kenmatsu4/items/23768cbe32fe381d54a2 [3] How to handle tweets acquired by Streaming API http://blog.unfindable.net/archives/4302 [4] Scalable numbering and snowflake https://kyrt.in/2014/06/08/snowflake_c.html [5] Tohoku University Inui Suzuki Laboratory Project 311 / Trend Analysis http://www.cl.ecei.tohoku.ac.jp/index.php?Project%20311%2FTrend%20Analysis [6] Dan He and D. Stott Parker(2010) 「Topic Dynamics: An Alternative Model of 'Bursts' in Streams of Topics」 https://dollar.biz.uiowa.edu/~street/HeParker10.pdf

Recommended Posts

Collect tweets about "Corona" with python and automatically detect words that became a hot topic due to the influence of "Corona"
I made a program to collect images in tweets that I liked on twitter with Python
[Python] I tried to visualize tweets about Corona with WordCloud
[Python] A program to find the number of apples and oranges that can be harvested
Find the white Christmas rate by prefecture with Python and map it to a map of Japan
A story that struggled to handle the Python package of PocketSphinx
I tried to automatically collect images of Kanna Hashimoto with Python! !!
[python] A note that started to understand the behavior of matplotlib.pyplot
The story of making a module that skips mail with python
[Python] A program that rotates the contents of the list to the left
[Python] About creating a tool to create a new Outlook email based on the data of the JSON file and the part that got caught
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
Calculate the shortest route of a graph with Dijkstra's algorithm and Python
[Python] A program that calculates the number of socks to be paired
[Python] I made a web scraping code that automatically acquires the news title and URL of Nikkei Inc.
[Python] Creating a tool that can list, select, and execute python files with tkinter & about the part that got caught
A story about calculating the speed of a small ball falling while receiving air resistance with Python and Sympy
Article that can be a human resource who understands and masters the mechanism of API (with Python code)
Prepare a development environment that is portable and easy to duplicate without polluting the environment with Python embeddable (Windows)
I made a system that automatically decides whether to run tomorrow with Python and adds it to Google Calendar.
Around the authentication of PyDrive2, a package that operates Google Drive with Python
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
Build a python environment to learn the theory and implementation of deep learning
[Python] A program that calculates the number of updates of the highest and lowest records
How to make a Raspberry Pi that speaks the tweets of the specified user
Get the stock price of a Japanese company with Python and make a graph
How to get a list of files in the same directory with python
I tried to automatically send the literature of the new coronavirus to LINE with Python
[Introduction to Python] How to get the index of data with a for statement
A note about the python version of python virtualenv
About the * (asterisk) argument of python (and itertools.starmap)
A discussion of the strengths and weaknesses of Python
How to identify the element with the smallest number of characters in a Python list?
Extract images and tables from pdf with python to reduce the burden of reporting
I tried to automate the article update of Livedoor blog with Python and selenium.
[Python] I tried to automatically create a daily report of YWT with Outlook mail
A simple system that automatically shoots with object detection and sends it to LINE
A memo of misunderstanding when trying to load the entire self-made module with Python3
Try to create a waveform (audio spectrum) that moves according to the sound with python
A story about trying to introduce Linter in the middle of a Python (Flask) project
I tried to compare the processing speed with dplyr of R and pandas of Python
I thought about why Python self is necessary with the feeling of a Python interpreter
When writing to a csv file with python, a story that I made a mistake and did not meet the delivery date
Since the stock market crashed due to the influence of the new coronavirus, I tried to visualize the performance of my investment trust with Python.
I don't like to be frustrated with the release of Pokemon Go, so I made a script to detect the release and tweet it
[Python] A program that counts the number of valleys
A memo connected to HiveServer2 of EMR with python
Visualize the range of interpolation and extrapolation with python
I tried to automatically generate a password with Python3
A memo that I touched the Datastore with python
A reminder about the implementation of recommendations in Python
[Python] A program that compares the positions of kangaroos.
Python Note: The mystery of assigning a variable to a variable
A server that returns the number of people in front of the camera with bottle.py and OpenCV
[C / C ++] Pass the value calculated in C / C ++ to a python function to execute the process, and use that value in C / C ++.
I tried to automatically post to ChatWork at the time of deployment with fabric and ChatWork Api
A library that monitors the life and death of other machines by pinging from Python
How to start a simple WEB server that can execute cgi of php and python
About the case that it became a Chinese font after updating with Linux (correction method)
[Python] I want to make a 3D scatter plot of the epicenter with Cartopy + Matplotlib!
A story that makes it easy to estimate the living area using Elasticsearch and Python