On Twitter, the new coronavirus is being actively discussed day and night. By analyzing these tweets, I wondered if I could grasp the meaningful tendency of Twitter users. Therefore, in this article, we will collect tweets about the new coronavirus posted on Twitter and analyze them easily.
Please feel free to point out any mistakes, hard-to-see parts, or advice. Thank you.
The tweet data used in this article is a tweet that was posted between January 1, 2020 and April 1, 2020, and includes any of "Corona", "COVID-19", and "Infectious disease". However, it is limited to Japanese tweets, and only tweets with more than 100 RTs are used. As a result, we constructed a dataset consisting of 47041 tweets.
The tweet data was saved in the following associative array.
{
'text': 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
'date': datetime.datetime(2020, 1, 1, 1, 0, 1),
'retweets': 123,
'favorites': 456,
'user_id': 7890123,
'hashtags': ['# yyy', '# zzz'],
'url': ['https://aaaaaa.com', 'http://bbb.com']
}
From the data used in this article, we can get some quantities such as text length, posting time, number of RTs, number of likes, presence / absence of hashtag, presence / absence of URL. Therefore, we use these quantities to read the characteristics of the data.
import os, sys, json, re
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from datetime import datetime
import datetime as dt
%matplotlib inline
tweet_len = tweets["text"].str.len()
tweets["text_len"] = tweet_len
tweets["text_len"].hist(bins=range(0, 141, 5))
plt.xlabel("len. of text")
plt.ylabel("num. of tweets")
plt.title("Histgram on length of texts in tweets")
At the time of data acquisition, a small number of tweets with more than 140 characters were seen (why?), But they were omitted for the sake of readability of the figure. Many tweets in the dataset seem to contain many characters (≒ information). Since this dataset uses only tweets that exceed 10 RT, it may be that the number of RTs tends to increase as the number of characters increases. (This point is verified below.)
tweets["retweets"].hist(bins=range(0,4001,100))
plt.xlabel("num. of RT")
plt.ylabel("num. of tweets")
plt.title("Histgram on the number of RT.")
-------------------------------------------------
tweets["favorites"].hist(bins=range(0,4001,100))
plt.xlabel("num. of favorites")
plt.ylabel("num. of tweets")
plt.title("Histgram on the number of favorites.")
Actually, there are tweets that exceed 100,000 RT / likes, but the range is set like this for the ease of viewing the figure. Both of them show a decrease, but since the number of tweets tends to increase up to about 300 RTs, it seems that the number of RTs tends to continue to increase up to a certain line for tweets that have begun to be RT once. You can guess if there isn't.
Now let's look at the correlation between the number of RTs and the number of characters.
fig, ax = plt.subplots()
ax.scatter(tweets["text_len"], tweets["retweets"], s=1)
plt.xlim(0, 140)
plt.ylim(0, 5000)
plt.xlabel("len. of texts in tweets")
plt.ylabel("num. of RT")
plt.title("Scatter plot of RT and len. of texts.")
Correlation coefficient: 0.022 Looking at the figure, it seems that there are many tweets with a large number of RTs where the number of characters is close to 140, but there seems to be no correlation. Therefore, it seems that it cannot be said that "tweets that are RT a lot tend to have a large number of characters" and vice versa.
tweets.loc[tweets["hashtags"].str.len() > 0, "has_hashtag"] = 1
tweets.loc[tweets["hashtags"].str.len() <= 0, "has_hashtag"] = 0
tweets["has_hashtag"].hist()
plt.xlabel("has hashtag (1) or not (0)")
plt.ylabel("num. of tweets")
plt.title("Histgram of whether tweets have hashtag(s) or not.")
----------------------------------------------------------------
tweets.loc[tweets["url"].str.len() > 0, "has_url"] = 1
tweets.loc[tweets["url"].str.len() <= 0, "has_url"] = 0
tweets["has_url"].hist()
plt.xlabel("has URL (1) or not (0)")
plt.ylabel("num. of tweets")
plt.title("Histgram of whether tweets have URL(s) or not.")
In this dataset, there are few tweets with hashtags, and many tweets with URLs. Since URLs are given to more than half of the tweets, it can be seen that most tweets with RTs of 10 or more give information not only by the text but also by the URL.
It seems that this area has the same characteristics as general tweets. (Actually, it is necessary to create a dataset with general tweets and compare the same amount ...) Dataset tweets have a long body and tend to add information by URL rather than hashtag. The number of RTs and the number of likes also tend to decline cleanly, and this is thought to be the same for general tweets.
In the following, we will look at the changes in various quantities over the 92 days from 1/1 to 4/1.
sns.set()
fig, ax = plt.subplots(figsize=(16.0, 8.0))
ax.bar(df.index, df["tweets"], color='#348ABD')
ax.plot(df.index, df["kansen"], color="blue")
ax.set_xticks([1,32,61,92])
ax.set_xticklabels(["01/01", "02/01", "03/01", "04/01"])
ax.set_xlabel("date")
The horizontal axis is the date. The histogram shows the number of tweets per day, and the polygonal line shows the number of new coronavirus infections found in Japan [^ 1]. The scale on the vertical axis is common to both.
Some peaks can be seen in the above figure. For each peak, the number of tweets is increasing over a few days, not just one day. From this, it is expected that these peaks are not outliers, and that there were things that attracted the user's attention during this period.
In addition, the number of confirmed infections is also posted, but it seems that there is not much correlation with the number of tweets. ~~ I had a hard time ~~ From this, it is expected that users will respond more strongly to other events that result (such as political judgment and cancellation of events) than to the increase in the number of patients with the new coronavirus (spread of infection).
Now let's analyze what causes each peak.
In this figure, regarding the peak of the previous figure and the day of personal concern, we investigated the actual content of the tweet and the announcement of the Ministry of Health, Labor and Welfare, and added the content that seems to have caused the event such as the peak. Thing.
I've read over 100 tweets in the dataset, both around 1/28 and 2/26, which are important, but they're inconsistent and I can't confirm what's causing the peak. did. This will be confirmed in detail later when analyzing frequently-used words and the number of RTs. Perhaps
sns.set_style("dark")
fig, ax1 = plt.subplots(figsize=(16.0, 8.0))
ax1.bar(df.index, df["retweets"], color='#348ABD')
ax2 = ax1.twinx()
ax2.plot(df.index, df["kansen"], color="blue")
ax2.set_ylim(0,2500)
ax1.set_xticks([1,32,61,92])
ax1.set_xticklabels(["01/01", "02/01", "03/01", "04/01"])
ax1.set_xlabel("date")
ax1.set_ylabel("num. of retweets")
ax2.set_ylabel("num. of infected people")
In the above figure, the histogram and the left vertical axis represent the cumulative number of RTs per day, and the polygonal line and the right vertical axis represent the number of confirmed infections per day. A graph similar to the number of tweets per day came out. After all, it seems that there is no relation with the number of infections found every day.
Now, let's compare the number of tweets per day with the number of RTs per day.
sns.set_style("dark")
fig, ax1 = plt.subplots(figsize=(16.0, 8.0))
ax1.bar(df.index, df["tweets"], color='#348ABD', alpha=0.7)
ax2 = ax1.twinx()
ax2.plot(df.index, df["retweets"], color="red")
ax2.set_ylim(0,3000000)
ax1.set_xticks([1,32,61,92])
ax1.set_xticklabels(["01/01", "02/01", "03/01", "04/01"])
ax1.set_xlabel("date")
ax1.set_ylabel("num. of tweets")
ax2.set_ylabel("num. of retweets")
In the figure above, the histogram and the left vertical axis represent the number of tweets per day, and the red polygonal line and the right vertical axis represent the cumulative number of RTs per day. The number of tweets per day and the number of RTs per day seem to be correlated on most days. Here, only around 3/25 to 28, it looks different from the others. The reason why the number of tweets (10 RT or more) exceeded the number of RTs compared to other people was that there were many topics related to the new coronavirus compared to other periods only during this period, and users could not overtake the topic ( It is possible that RT was not possible).
~~ I wanted to post it quickly ~~ Based on the above, I will write what I will do in the future.
--Analysis of frequent words ――Why are there peaks around 1/24 and 2/26? --MeCab on Windows doesn't work for some reason. --Tweet clustering ――For example, by clustering tweets by day, you can quantify the number of topics on that day, and you can expect that the contents of tweets will be organized and easier to analyze. --Construction of RT number prediction model ――By building an appropriate regression model, you may be able to analyze things like "what kind of tweets are likely to be RT (≒ attract people's interest)" ...? ――It seems good to classify the problem. --Utilization of user information --For example, expert tweets will have more RTs than non-expert users. In this way, by utilizing the information of the poster, it may be possible to analyze the trends of Twitter users regarding the new coronavirus in more detail.
I will post to Qiita for the first time. As an analysis, I did not get any innovative knowledge, but I feel that I have grasped the policy to dig into this data set in the future. EDA is also important in determining this direction, isn't it? Actually, I wanted to build a predictive model firmly and analyze the evaluation and the nature of the model, but I had a strong desire to post it, so I decided to turn it on next time.
Thank you for reading this far. I'm sorry for the poor analysis and writing, but please feel free to give us your suggestions, opinions, and advice.
[^ 1]: Reference: https://www.asahi.com/special/corona/
Recommended Posts