This is a continuation of Last time.
We will expand the dataset we built last time for further analysis.
The goals of this article are two things:
――Understanding how the topic of the new coronavirus is changing on Twitter --Get insight into the content of tweets and user interests by quantifying the content of tweets and predicting the number of RTs
Regarding the above goals, we will analyze the tweet data posted on Twitter.
The tweet data used in this article includes any of "Corona", "COVID-19", and "Infectious Diseases" posted between January 1, 2020 and ** April 30, 2020 **. This is a tweet. Each tweet contained in the dataset has been RT more than 100 times.
(29 days worth of tweets have been added to the dataset since the last dataset.)
The size of the dataset was last time: 47071-> this time: 79562.
The figure above shows the number of tweets for each day for the constructed dataset. The circled numbers in the figure correspond to each consideration below. (See Previous article for a discussion of 01/01 to 04/01 in this graph)
From the frequency of occurrence of words in daily tweets, you can infer what the user was interested in that day. In this article, we have applied the following pre-processing to all tweets.
def pre_process(texts):
texts_mod = []
for text in tqdm(texts):
text = re.sub(r'd+', '', text) #Removal of numbers(Unnecessary?)
text = zenhan.z2h(text) #Full-width to half-width
text = mecab_wakati(text) #Word-separation(This alone may be enough)
text = text.lower() #Unification of character types
stopwords = load_stopwords()
#Remove stopword(Unnecessary?)
for sw in stopwords:
if sw in text:
text = text.replace(sw, '')
texts_mod.append(text)
return texts_mod
However, morphological analysis extracts only tweet nouns, adjectives, and adjective verbs.
Next, let's aggregate the words of each day and output the frequently-used words.
date | Frequent words |
---|---|
2020-02-01 | pneumonia,Absent,Coronavirus,Countermeasures,Expansion,Virus,Wuhan,prevention,Medical,Alcohol |
2020-02-02 | pneumonia,Wuhan,Countermeasures,Treatment,Coronavirus,Virus,Influenza,Presentation,Expansion,death |
2020-02-03 | pneumonia,Expansion,Inspection,Countermeasures,Wuhan,Influenza,Virus,Coronavirus,diffusion,Mask |
2020-02-04 | pneumonia,Wuhan,Verification,hospital,Expansion,Absent,Hong Kong, who,Countermeasures,death |
2020-02-05 | pneumonia,Mask,Cruise,Verification,Inspection,Absent,Wuhan,Countermeasures,Virus,Expansion |
2020-02-06 | pneumonia,Wuhan,Virus,Coronavirus,Correspondence,Mask,Inspection,the study,Impact,Absent |
2020-02-07 | Doctor,Cruise,pneumonia,Verification,Virus,Expansion,Coronavirus,Influenza,Wuhan,Absent |
2020-02-08 | pneumonia,Wuhan,death,Mask,Coronavirus,Presentation,Hospitalization,Doubt,Absent,Countermeasures |
2020-02-09 | pneumonia,Aerosol,Coronavirus,Cruise,Verification,Countermeasures,Absent, who,Wuhan,Expansion |
2020-02-10 | Cruise,Inspection,pneumonia,Verification,Countermeasures,Coronavirus,Presentation,Expansion,Wuhan,Correspondence |
2020-02-11 | pneumonia,Coronavirus,Correspondence,Wuhan,Verification,Medical,Absent,Abe,Countermeasures,Inspection |
2020-02-12 | covid,quarantine,pneumonia, who,Inspection,Countermeasures,Mask,Cruise,Absent,Virus |
2020-02-13 | Verification,death,pneumonia,Inspection,operation,Tokyo,Taxi,Kanagawa,Breaking news,Countermeasures |
2020-02-14 | Countermeasures,Verification,Inspection,Abe,Correspondence,pneumonia,Expansion,Absent,Presentation,Doctor |
2020-02-15 | Countermeasures,Inspection,Abe,Correspondence,Coronavirus,Absent,Verification,administration,Tokyo,Specialty |
2020-02-16 | Countermeasures,Abe,Inspection,Correspondence,Absent,pneumonia,Specialty,Expansion,Coronavirus,Verification |
2020-02-17 | Countermeasures,Expansion,Absent,Coronavirus,Presentation,Abe,Opposition,Inspection,Correspondence,pneumonia |
2020-02-18 | Countermeasures,Opposition,Correspondence,Inspection,Symptoms,pneumonia,Held,Absent,Verification,Expansion |
2020-02-19 | Countermeasures,Iwata,Specialty,Princess,Diamond mode,Expansion,Abe,Correspondence, covid,Kentaro |
2020-02-20 | Countermeasures,Held,Cruise,Verification,Expansion,Iwata,Event,Correspondence,plans,Absent |
2020-02-21 | Expansion,Verification,Countermeasures,Inspection,pneumonia,Held,Please,Notice,plans,Event |
2020-02-22 | Inspection,Countermeasures,Verification,Expansion,pneumonia,Correspondence,Cruise,Coronavirus,Virus,Abe |
2020-02-23 | Inspection,Countermeasures,Abe,Absent,Correspondence,Coronavirus,pneumonia,Verification,Expansion,Virus |
2020-02-24 | Inspection,Countermeasures,Abe,Absent,Expansion,Correspondence,pneumonia,hospital,Presentation,Coronavirus |
2020-02-25 | Inspection,Countermeasures,Expansion,Absent,Held,Correspondence,Coronavirus,Postponed,Abe,Verification |
2020-02-26 | Inspection,Expansion,Countermeasures,Held,Notice,Performance,Event,Postponed,Correspondence,plans |
2020-02-27 | Expansion,Countermeasures,Notice,Held,Inspection,Event,plans,Performance,Postponed,Prevention |
2020-02-28 | Expansion,Countermeasures,Notice,Prevention,Held,Inspection,Abe,Verification,budget,Please |
2020-02-29 | Countermeasures,Abe,Inspection,Verification,Expansion,Correspondence,Absent,Coronavirus,Presentation,pneumonia |
2020-03-01 | Countermeasures,Inspection,Absent,Abe,Expansion,Mask,Coronavirus,pneumonia,Verification,Correspondence |
2020-03-02 | Expansion,Countermeasures,Abe,Inspection,Impact,Coronavirus,Specialty,emergency,Absent,Notice |
2020-03-03 | Expansion,Countermeasures,pneumonia,Inspection,Absent,Verification,Abe,Impact,Wuhan,emergency |
2020-03-04 | Expansion,Countermeasures,Abe,Inspection,Verification,Absent,pneumonia,emergency,Correspondence,Held |
2020-03-05 | Countermeasures,Abe,Expansion,Verification,Absent,Impact,Presentation,Postponed,emergency,Correspondence |
2020-03-06 | Countermeasures,Expansion,Inspection,Impact,Verification,Absent,Abe,Correspondence,Held,Coronavirus |
2020-03-07 | Countermeasures,Abe,Inspection,Expansion,Mask,Verification,Absent,Correspondence,Medical,Impact |
2020-03-08 | Countermeasures,Inspection,Absent,Abe,Verification,Coronavirus,Italy,Virus,Expansion,hospital |
2020-03-09 | Countermeasures,Expansion,Abe,Impact,Absent,Medical,Inspection,emergency,Postponed,Virus |
2020-03-10 | Countermeasures,Expansion,Inspection,Absent,Specialty,Abe,emergency,Correspondence,Virus,Impact |
2020-03-11 | Inspection,Expansion,Held,Absent,Countermeasures,Impact,Medical,plans,Postponed,consumption |
2020-03-12 | Inspection,Countermeasures,Expansion,Medical,Absent,Presentation,Pandemic, who,Impact,plans |
2020-03-13 | Countermeasures,Expansion,Inspection,Postponed,Impact,Held,Presentation,Possible,emergency,plans |
2020-03-14 | Inspection,Countermeasures,Abe,Absent,Declaration,President,Correspondence,Expansion,Coronavirus,Playing cards |
2020-03-15 | Inspection,Countermeasures,Expansion,Absent,Medical,hospital,Abe,Impact,Correspondence,emergency |
2020-03-16 | Inspection,Countermeasures,Expansion,Correspondence,Absent,Coronavirus,pneumonia,Impact,Abe,Italy |
2020-03-17 | Inspection,Expansion,Countermeasures,Verification,Abe,Absent,Impact, who,Performance,Held |
2020-03-18 | Countermeasures,Expansion,Inspection,Virus,Absent,Impact,Held,death,plans,Benefits |
2020-03-19 | Expansion,Countermeasures,Inspection,Impact,Absent,Economy,Verification,Held,Osaka,Correspondence |
2020-03-20 | Expansion,Countermeasures,Absent,Inspection,Presentation,Impact,Held,Verification,Italy,Postponed |
2020-03-21 | Countermeasures,Expansion,Inspection,Absent,Tokyo,Impact,Italy,hospital,Held,Presentation |
2020-03-22 | Countermeasures,Inspection,Expansion,Absent,Self-restraint,Economy,Verification,Impact,Italy,scale |
2020-03-23 | Countermeasures,Expansion,Tokyo,Absent,Postponed,Held,Economy,Presentation,Inspection,Verification |
2020-03-24 | Countermeasures,Postponed,Tokyo,Expansion,Inspection,Impact,Absent,Medical,Held,Economy |
2020-03-25 | Countermeasures,Tokyo,Economy,Inspection,Postponed,Verification,Expansion,Impact,Absent,emergency |
2020-03-26 | Countermeasures,Expansion,Tokyo,Self-restraint,Inspection,Notice,Verification,Absent,Impact,Postponed |
2020-03-27 | Expansion,Countermeasures,Inspection,Tokyo,Self-restraint,Impact,Verification,Presentation,Notice,Held |
2020-03-28 | Countermeasures,Verification,Expansion,Self-restraint,Tokyo,Absent,Abe,Inspection,Medical,hospital |
2020-03-29 | Countermeasures,Self-restraint,Absent,Tokyo,Inspection,Expansion,Verification,Economy,hospital,Abe |
2020-03-30 | Ken,Countermeasures,Expansion,Tokyo,pneumonia,Absent,Died,Verification,Self-restraint,Inspection |
2020-03-31 | Expansion,Countermeasures,Verification,Tokyo,Presentation,Self-restraint,Absent,Impact,Inspection,Please |
2020-04-01 | Countermeasures,Verification,Expansion,Mask,Medical,Absent,Tokyo,Presentation,Abe,Status |
2020-04-02 | Mask,Expansion,Countermeasures,Verification,Tokyo,Medical,Absent,Inspection,support,Impact |
2020-04-03 | Expansion,Countermeasures,Benefits,Verification,Tokyo,Household,Impact, nhk,Notice,Held |
2020-04-04 | Countermeasures,Expansion,Verification,Inspection,Medical,Mask,Tokyo,hospital,Absent, nhk |
2020-04-05 | Countermeasures,Tokyo,Expansion,Inspection,Medical,Absent,Verification,hospital,Self-restraint,Coronavirus |
2020-04-06 | emergency,Expansion,Countermeasures,Declaration,Inspection,Tokyo,Impact,Notice,Self-restraint,Verification |
2020-04-07 | emergency,Declaration,Expansion,Countermeasures,Tokyo,Notice,Verification,Absent,Closed,Please |
2020-04-08 | Expansion,Countermeasures,emergency,Declaration,Verification,Impact,Notice,Absent,Self-restraint,Postponed |
2020-04-09 | Countermeasures,Expansion,Verification,emergency,Inspection,Declaration,Impact,Tokyo,Absent,Medical |
2020-04-10 | Expansion,Countermeasures,Verification,Impact,Inspection,Medical,emergency,Absent,Tokyo,Notice |
2020-04-11 | Countermeasures,Expansion,Abe,Absent,Inspection,Self-restraint,Medical,Verification,emergency,hospital |
2020-04-12 | Countermeasures,Inspection,Verification,Abe,Expansion,Medical,Absent,Mask,hospital,Tokyo |
2020-04-13 | Expansion,Countermeasures,emergency,Inspection,Impact,Verification,Absent,Declaration,Closed,Presentation |
2020-04-14 | Expansion,Countermeasures,emergency,Medical,Impact,Verification,Correspondence,Absent,hospital,Abe |
2020-04-15 | Countermeasures,Expansion,hospital,Medical,Abe,Absent,Correspondence,Impact,Tokyo,Verification |
2020-04-16 | Countermeasures,Expansion,Benefits,emergency,Absent,Medical,Inspection,support,Impact,Mask |
2020-04-17 | Countermeasures,Expansion,Inspection,emergency,Medical,Impact,Correspondence,Abe,Benefits,Absent |
2020-04-18 | Medical,Countermeasures,Inspection,Absent,Expansion,Verification,Mask,Abe,hospital,necessary |
2020-04-19 | Countermeasures,Medical,Expansion,Absent,Mask,Inspection,Correspondence,the study,Impact,hospital |
2020-04-20 | Inspection,Countermeasures,Expansion,Medical,Impact,Verification,Absent, pcr,Turned out,Abe |
2020-04-21 | Countermeasures,Expansion,Impact,Absent,Medical,Benefits,Mask,Inspection,Abe,support |
2020-04-22 | Inspection,Expansion,Countermeasures,Medical,Absent,Verification,Impact, pcr, nhk,Self-restraint |
2020-04-23 | Okae,Inspection,Kumi,Medical,hospital,Expansion,Countermeasures,Absent,home,support |
2020-04-24 | Expansion,Inspection,Countermeasures,Medical,Impact,Verification,Absent, nhk,Tokyo,Please |
2020-04-25 | Inspection,Countermeasures,Medical,Expansion,Self-restraint,Economy,Verification,Absent, nhk,support |
2020-04-26 | Inspection,Countermeasures,Medical,Absent,Mask,Self-restraint,Expansion,Impact,Tokyo, rt |
2020-04-27 | Countermeasures,Expansion,Inspection,Medical,support,Absent,Self-restraint,Impact,Economy,new |
2020-04-28 | Expansion,Inspection,Countermeasures,Impact,Absent,support,Medical,Tokyo, news, nhk |
2020-04-29 | Countermeasures,Inspection,emergency,Expansion,Medical,Absent,Impact,Symptoms,Abe,Economy |
2020-04-30 | Inspection,Countermeasures,Expansion,Medical,Impact,Absent,Abe,Self-restraint,support,Verification |
The table above shows the top 10 most frequently used words for each day from February to April. However, "Coronavirus", "COVID-19", and "Infectious disease" are excluded from the output. In addition, the parts of the table that you want to pay attention to are shown in red.
(1) From January to early February, the symptoms of the new coronavirus such as "pneumonia", "influenza", "Wuhan", "confirmation", and "Cruise", and the trends in Wuhan and cruise ships where infected people have been confirmed were discussed. You can see that.
(2) Since February 14, words related to Japanese administrative trends such as "countermeasures," "announcements," and "responses" have appeared frequently, and it can be confirmed that topics of interest to users have changed. Also, in the graph shown in [Data Details](#Data Details), the number of tweets has increased sharply from around February 14th. Users are expected to be more interested in government trends than the damage caused by the new coronavirus.
(3) Regarding around February 26, the previous article stated that the cause was unknown due to the increase in the number of tweets. Here, there are many frequently used words related to events such as "holding", "lecture", "event", and "postponement". From this, it is probable that the number of tweets and the number of RTs increased in response to the fact that the event of interest to the user was postponed due to the influence of the new coronavirus.
In addition, from (1) to (3) above, it is considered that users tend to be more interested in things that occur closer to them. (Wuhan and cruise ships-> Government response-> Postponement of events, and the contents are familiar (directly related) in order.)
④ After March 23, the word "Tokyo" has appeared in frequent words. This is thought to be due to the rapid increase in infected people in Tokyo. Also, in the graph shown in [Data Details](#Data Details), the number of tweets has increased sharply since this date, so you can see that there is a great deal of interest in the trends in Tokyo. ..
⑤ From April 6th to 7th, there were mentions of emergency declarations in several cities including Tokyo. It is thought that many users were interested in the content because it is closely related to daily life due to requests to refrain from going out and closure of various stores.
⑥ On February 19, March 30, and April 23, there are "Posts about Kentaro Iwata's cruise ship," "Ken Shimura's news," and "Kumiko Okae's news," which are frequently used words each day. You can see that it affects. In addition, each day of the graph shown in [Data Details](#Data Details) shows a peak over several days. From here, we can see that many users have been interested in these topics for several days. Also, personal topics are likely to be less topical than topics that affect many users.
Here, by quantifying the content of tweets and building a model that predicts the number of RTs (how popular), does the number of RTs differ depending on the content of the tweet and the words included? To verify.
Here, the content of the tweet is quantified by TF-IDF, and the number of RTs is predicted by Light GBM.
TF-IDF There are many commentary articles on the Web, so please see them for details [^ 1] [^ 2]. Qualitatively, it is a "quantification of the importance of each word in a sentence", and for each tweet, a vector with dimensions equal to the number of words included in all tweets can be obtained. As a result, in this dataset, TF-IDF obtained a matrix of 79562 tweets x 5236 6 dimensions.
Light GBM For details, please check the commentary on the Web [^ 3] [^ 4]. It is a model that extends the decision tree and can perform regression and classification. In addition, the dimensions that contributed to regression / classification can be easily output, and by combining with TF-IDF, "which content contributed to the number of RTs" can be easily grasped.
This time, 80% of the tweets in the dataset were used for model training, and the remaining 20% were used for checking the accuracy and contribution of each dimension (test). Therefore, the problem that LightGBM predicts this time can be interpreted as "what is the user's interest in the new coronavirus at any time?"
train_data = lgb.Dataset(x_train, label=y_train)
test_data = lgb.Dataset(x_test, label=y_test, reference= train_data)
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression'
}
gbm = lgb.train(
params,
train_data
)
preds = gbm.predict(x_test)
Quantitatively evaluate how well the trained model can predict the number of RTs. Here, we use the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) indicators, which are often used to evaluate regression models that predict numbers. I will leave the details of each to the commentary article. It is easy to say that the smaller the value, the more accurate the prediction.
Here, how good is it just by looking at the numerical values of each evaluation index? Therefore, it is necessary to compare it with some other model. This time, as a baseline, we are comparing the model (BaseLine) that "always predicts the RT number of all tweets to be the average value (1429.36) of the RT number of the dataset".
MAE | RMSE | |
---|---|---|
LightGBM | 1462.95 | 4565.19 |
BaseLine | 1429.36 | 4559.26 |
In the table above, BaseLine outperforms LightGBM in all metrics, suggesting that the number of RTs cannot be predicted with high accuracy from the content of the tweet (TF-IDF + LightGBM).
Also, let's output the Feature Importance expressed by the number of appearances in the leaves of LightBGM of each dimension.
importance = pd.DataFrame(gbm.feature_importance(),
index=x_condition.get_feature_names(), columns=['importance'])
display(importance.sort_values("importance", ascending=False).head(100))
word | Feature Importance |
---|---|
Corona | 138 |
Coronavirus | 111 |
Comedian | 99 |
Our company | 95 |
rt | 93 |
cov | 85 |
beneficial | 81 |
Ryokan | 78 |
Uncle | 77 |
just | 77 |
New model | 71 |
com | 64 |
The higher the word in the above table, the more it can be interpreted as the word that Light BGM judges to be important in predicting the number of RTs. However, "Corona", "rt", "com (URL?)", Etc. are included in many tweets regardless of the number of RTs, and it can be qualitatively confirmed that LightGBM has not been learned accurately. On the other hand, for example, "comedian" is probably a word extracted from a tweet about Ken Shimura, and it is also thought that LightGBM has captured the above-mentioned feature that "prominent individual topics make a steep peak". You can.
Based on the above, LightGBM was unable to successfully solve the problem of "what is the user's interest in the new coronavirus at any time?" "Trends in Japan", "Postponement of events" ... As users' interests shift, It was suggested that there is no such thing as "tweets with this content will always attract the attention of users at any time" regarding the new coronavirus.
In this article, the following points were suggested.
――From the frequent words of tweets and the transition of the number of tweets per day, about the user's interest --On topics involving prominent individuals, there is a local peak in the number of related tweets over the course of several days, but the decline is fast. --Users are more interested in topics as they are closer to them. Surprisingly, users are more concerned about postponing the event than the infectious disease itself or the government response.
――There is no content that will surely increase the number of RTs regardless of the time.
In the future, we will make an analysis using user information based on the hypothesis that "tweets sent by well-known users such as the number of followers and official accounts will catch the eyes of many users, and it will be easier to attract the user's attention." I want to do
[^ 1]: TF-IDF Reference (1): https://qiita.com/AwaJ/items/5937665d5a4152cc24cf [^ 2]: TF-IDF Reference (2): https://dev.classmethod.jp/articles/yoshim_2017ad_tfidf_1-2/ [^ 3]: LightGBM Reference (1): https://www.codexa.net/lightgbm-beginner/ [^ 4]: LightGBM Reference (2): https://qiita.com/ryo_naka/items/f479e5b7cb49fb55f150
Recommended Posts