[PYTHON] I analyzed tweets about the new coronavirus posted on Twitter

Overview

On Twitter, the new coronavirus is being actively discussed day and night. By analyzing these tweets, I wondered if I could grasp the meaningful tendency of Twitter users. Therefore, in this article, we will collect tweets about the new coronavirus posted on Twitter and analyze them easily.

Please feel free to point out any mistakes, hard-to-see parts, or advice. Thank you.

Data details

The tweet data used in this article is a tweet that was posted between January 1, 2020 and April 1, 2020, and includes any of "Corona", "COVID-19", and "Infectious disease". However, it is limited to Japanese tweets, and only tweets with more than 100 RTs are used. As a result, we constructed a dataset consisting of 47041 tweets.

The tweet data was saved in the following associative array.

{
'text': 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 
'date': datetime.datetime(2020, 1, 1, 1, 0, 1),
'retweets': 123,
'favorites': 456,
'user_id': 7890123,
'hashtags': ['# yyy', '# zzz'],
'url': ['https://aaaaaa.com', 'http://bbb.com']
}

Exploratory Data Analysis (EDA)

From the data used in this article, we can get some quantities such as text length, posting time, number of RTs, number of likes, presence / absence of hashtag, presence / absence of URL. Therefore, we use these quantities to read the characteristics of the data.

import os, sys, json, re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from datetime import datetime
import datetime as dt
%matplotlib inline

Number of characters, number of RTs, number of likes, presence / absence of URL, presence / absence of hashtag

word count
tweet_len = tweets["text"].str.len()
tweets["text_len"] = tweet_len
tweets["text_len"].hist(bins=range(0, 141, 5))
plt.xlabel("len. of text")
plt.ylabel("num. of tweets")
plt.title("Histgram on length of texts in tweets")

文字数.png At the time of data acquisition, a small number of tweets with more than 140 characters were seen (why?), But they were omitted for the sake of readability of the figure. Many tweets in the dataset seem to contain many characters (≒ information). Since this dataset uses only tweets that exceed 10 RT, it may be that the number of RTs tends to increase as the number of characters increases. (This point is verified below.)

Number of RTs, number of likes
tweets["retweets"].hist(bins=range(0,4001,100))
plt.xlabel("num. of RT")
plt.ylabel("num. of tweets")
plt.title("Histgram on the number of RT.")
-------------------------------------------------
tweets["favorites"].hist(bins=range(0,4001,100))
plt.xlabel("num. of favorites")
plt.ylabel("num. of tweets")
plt.title("Histgram on the number of favorites.")

RT数.pngいいね数.png Actually, there are tweets that exceed 100,000 RT / likes, but the range is set like this for the ease of viewing the figure. Both of them show a decrease, but since the number of tweets tends to increase up to about 300 RTs, it seems that the number of RTs tends to continue to increase up to a certain line for tweets that have begun to be RT once. You can guess if there isn't.

Now let's look at the correlation between the number of RTs and the number of characters.

fig, ax = plt.subplots()
ax.scatter(tweets["text_len"], tweets["retweets"], s=1)
plt.xlim(0, 140)
plt.ylim(0, 5000)
plt.xlabel("len. of texts in tweets")
plt.ylabel("num. of RT")
plt.title("Scatter plot of RT and len. of texts.")

文字数とRT数の相関.png Correlation coefficient: 0.022 Looking at the figure, it seems that there are many tweets with a large number of RTs where the number of characters is close to 140, but there seems to be no correlation. Therefore, it seems that it cannot be said that "tweets that are RT a lot tend to have a large number of characters" and vice versa.

Presence or absence of hashtag / URL
tweets.loc[tweets["hashtags"].str.len() > 0, "has_hashtag"] = 1
tweets.loc[tweets["hashtags"].str.len() <= 0, "has_hashtag"] = 0
tweets["has_hashtag"].hist()
plt.xlabel("has hashtag (1) or not (0)")
plt.ylabel("num. of tweets")
plt.title("Histgram of whether tweets have hashtag(s) or not.")
----------------------------------------------------------------
tweets.loc[tweets["url"].str.len() > 0, "has_url"] = 1
tweets.loc[tweets["url"].str.len() <= 0, "has_url"] = 0
tweets["has_url"].hist()
plt.xlabel("has URL (1) or not (0)")
plt.ylabel("num. of tweets")
plt.title("Histgram of whether tweets have URL(s) or not.")

ハッシュタグの有無.pngURLの有無.png In this dataset, there are few tweets with hashtags, and many tweets with URLs. Since URLs are given to more than half of the tweets, it can be seen that most tweets with RTs of 10 or more give information not only by the text but also by the URL.

Summary so far

It seems that this area has the same characteristics as general tweets. (Actually, it is necessary to create a dataset with general tweets and compare the same amount ...) Dataset tweets have a long body and tend to add information by URL rather than hashtag. The number of RTs and the number of likes also tend to decline cleanly, and this is thought to be the same for general tweets.

Analysis using time series

In the following, we will look at the changes in various quantities over the 92 days from 1/1 to 4/1.

Number of tweets per day
sns.set()
fig, ax = plt.subplots(figsize=(16.0, 8.0))
ax.bar(df.index, df["tweets"], color='#348ABD')
ax.plot(df.index, df["kansen"], color="blue")
ax.set_xticks([1,32,61,92])
ax.set_xticklabels(["01/01", "02/01", "03/01", "04/01"])
ax.set_xlabel("date")

日ごとのツイート数2.png

The horizontal axis is the date. The histogram shows the number of tweets per day, and the polygonal line shows the number of new coronavirus infections found in Japan [^ 1]. The scale on the vertical axis is common to both.

Some peaks can be seen in the above figure. For each peak, the number of tweets is increasing over a few days, not just one day. From this, it is expected that these peaks are not outliers, and that there were things that attracted the user's attention during this period.

In addition, the number of confirmed infections is also posted, but it seems that there is not much correlation with the number of tweets. ~~ I had a hard time ~~ From this, it is expected that users will respond more strongly to other events that result (such as political judgment and cancellation of events) than to the increase in the number of patients with the new coronavirus (spread of infection).

Now let's analyze what causes each peak.

日ごとのツイート数_mod.png

In this figure, regarding the peak of the previous figure and the day of personal concern, we investigated the actual content of the tweet and the announcement of the Ministry of Health, Labor and Welfare, and added the content that seems to have caused the event such as the peak. Thing.

I've read over 100 tweets in the dataset, both around 1/28 and 2/26, which are important, but they're inconsistent and I can't confirm what's causing the peak. did. This will be confirmed in detail later when analyzing frequently-used words and the number of RTs. Perhaps

Cumulative number of RTs per day
sns.set_style("dark")
fig, ax1 = plt.subplots(figsize=(16.0, 8.0))
ax1.bar(df.index, df["retweets"], color='#348ABD')
ax2 = ax1.twinx()
ax2.plot(df.index, df["kansen"], color="blue")
ax2.set_ylim(0,2500)
ax1.set_xticks([1,32,61,92])
ax1.set_xticklabels(["01/01", "02/01", "03/01", "04/01"])
ax1.set_xlabel("date")
ax1.set_ylabel("num. of retweets")
ax2.set_ylabel("num. of infected people")

日ごとの累計RT数.png In the above figure, the histogram and the left vertical axis represent the cumulative number of RTs per day, and the polygonal line and the right vertical axis represent the number of confirmed infections per day. A graph similar to the number of tweets per day came out. After all, it seems that there is no relation with the number of infections found every day.

Now, let's compare the number of tweets per day with the number of RTs per day.

sns.set_style("dark")
fig, ax1 = plt.subplots(figsize=(16.0, 8.0))
ax1.bar(df.index, df["tweets"], color='#348ABD', alpha=0.7)
ax2 = ax1.twinx()
ax2.plot(df.index, df["retweets"], color="red")
ax2.set_ylim(0,3000000)
ax1.set_xticks([1,32,61,92])
ax1.set_xticklabels(["01/01", "02/01", "03/01", "04/01"])
ax1.set_xlabel("date")
ax1.set_ylabel("num. of tweets")
ax2.set_ylabel("num. of retweets")

日ごとのツイート数_日ごとのRT数_mod.png In the figure above, the histogram and the left vertical axis represent the number of tweets per day, and the red polygonal line and the right vertical axis represent the cumulative number of RTs per day. The number of tweets per day and the number of RTs per day seem to be correlated on most days. Here, only around 3/25 to 28, it looks different from the others. The reason why the number of tweets (10 RT or more) exceeded the number of RTs compared to other people was that there were many topics related to the new coronavirus compared to other periods only during this period, and users could not overtake the topic ( It is possible that RT was not possible).

from now on

~~ I wanted to post it quickly ~~ Based on the above, I will write what I will do in the future.

--Analysis of frequent words ――Why are there peaks around 1/24 and 2/26? --MeCab on Windows doesn't work for some reason. --Tweet clustering ――For example, by clustering tweets by day, you can quantify the number of topics on that day, and you can expect that the contents of tweets will be organized and easier to analyze. --Construction of RT number prediction model ――By building an appropriate regression model, you may be able to analyze things like "what kind of tweets are likely to be RT (≒ attract people's interest)" ...? ――It seems good to classify the problem. --Utilization of user information --For example, expert tweets will have more RTs than non-expert users. In this way, by utilizing the information of the poster, it may be possible to analyze the trends of Twitter users regarding the new coronavirus in more detail.

in conclusion

I will post to Qiita for the first time. As an analysis, I did not get any innovative knowledge, but I feel that I have grasped the policy to dig into this data set in the future. EDA is also important in determining this direction, isn't it? Actually, I wanted to build a predictive model firmly and analyze the evaluation and the nature of the model, but I had a strong desire to post it, so I decided to turn it on next time.

Thank you for reading this far. I'm sorry for the poor analysis and writing, but please feel free to give us your suggestions, opinions, and advice.

[^ 1]: Reference: https://www.asahi.com/special/corona/

Recommended Posts

I analyzed tweets about the new coronavirus posted on Twitter
I analyzed the tweets about the new coronavirus posted on Twitter Part 2
(Now) I analyzed the new coronavirus (COVID-19)
I tried to make a script that traces the tweets of a specific user on Twitter and saves the posted image at once
I tried using PDF data of online medical care based on the spread of the new coronavirus infection
I checked the image of Science University on Twitter with Word2Vec.
Plot the spread of the new coronavirus
Get only image tweets on twitter
I refactored "I tried to make a script that saves posted images at once by going back to the tweets of a specific user on Twitter".
I tried to predict the behavior of the new coronavirus with the SEIR model.
Folding @ Home on Linux Mint to contribute to the analysis of the new coronavirus
I stumbled on the Hatena Keyword API
Post the subject of Gmail on twitter
Estimate the peak infectivity of the new coronavirus
The epidemic forecast of the new coronavirus was released on the Web at explosive speed
I tried to display the infection condition of coronavirus on the heat map of seaborn
I tried to automatically send the literature of the new coronavirus to LINE with Python