[PYTHON] I analyzed the tweets about the new coronavirus posted on Twitter Part 2

Overview

This is a continuation of Last time.

We will expand the dataset we built last time for further analysis.

The goals of this article are two things:

――Understanding how the topic of the new coronavirus is changing on Twitter --Get insight into the content of tweets and user interests by quantifying the content of tweets and predicting the number of RTs

Regarding the above goals, we will analyze the tweet data posted on Twitter.

Data details

The tweet data used in this article includes any of "Corona", "COVID-19", and "Infectious Diseases" posted between January 1, 2020 and ** April 30, 2020 **. This is a tweet. Each tweet contained in the dataset has been RT more than 100 times.

(29 days worth of tweets have been added to the dataset since the last dataset.)

The size of the dataset was last time: 47071-> this time: 79562.

日毎のツイート数.png

The figure above shows the number of tweets for each day for the constructed dataset. The circled numbers in the figure correspond to each consideration below. (See Previous article for a discussion of 01/01 to 04/01 in this graph)

Analysis of frequently-used words

From the frequency of occurrence of words in daily tweets, you can infer what the user was interested in that day. In this article, we have applied the following pre-processing to all tweets.

def pre_process(texts):
    texts_mod = []
    for text in tqdm(texts):
        text = re.sub(r'd+', '', text) #Removal of numbers(Unnecessary?)
        text = zenhan.z2h(text) #Full-width to half-width
        text = mecab_wakati(text) #Word-separation(This alone may be enough)
        text = text.lower() #Unification of character types

        stopwords = load_stopwords()
        #Remove stopword(Unnecessary?)
        for sw in stopwords:
            if sw in text:
                text = text.replace(sw, '')
                
        texts_mod.append(text)

    return texts_mod

However, morphological analysis extracts only tweet nouns, adjectives, and adjective verbs.

Next, let's aggregate the words of each day and output the frequently-used words.

** Table: Daily Frequent Words (Click to expand) **
date Frequent words
2020-02-01 pneumonia,Absent,Coronavirus,Countermeasures,Expansion,Virus,Wuhan,prevention,Medical,Alcohol
2020-02-02 pneumonia,Wuhan,Countermeasures,Treatment,Coronavirus,Virus,Influenza,Presentation,Expansion,death
2020-02-03 pneumonia,Expansion,Inspection,Countermeasures,Wuhan,Influenza,Virus,Coronavirus,diffusion,Mask
2020-02-04 pneumonia,Wuhan,Verification,hospital,Expansion,Absent,Hong Kong, who,Countermeasures,death
2020-02-05 pneumonia,Mask,Cruise,Verification,Inspection,Absent,Wuhan,Countermeasures,Virus,Expansion
2020-02-06 pneumonia,Wuhan,Virus,Coronavirus,Correspondence,Mask,Inspection,the study,Impact,Absent
2020-02-07 Doctor,Cruise,pneumonia,Verification,Virus,Expansion,Coronavirus,Influenza,Wuhan,Absent
2020-02-08 pneumonia,Wuhan,death,Mask,Coronavirus,Presentation,Hospitalization,Doubt,Absent,Countermeasures
2020-02-09 pneumonia,Aerosol,Coronavirus,Cruise,Verification,Countermeasures,Absent, who,Wuhan,Expansion
2020-02-10 Cruise,Inspection,pneumonia,Verification,Countermeasures,Coronavirus,Presentation,Expansion,Wuhan,Correspondence
2020-02-11 pneumonia,Coronavirus,Correspondence,Wuhan,Verification,Medical,Absent,Abe,Countermeasures,Inspection
2020-02-12 covid,quarantine,pneumonia, who,Inspection,Countermeasures,Mask,Cruise,Absent,Virus
2020-02-13 Verification,death,pneumonia,Inspection,operation,Tokyo,Taxi,Kanagawa,Breaking news,Countermeasures
2020-02-14 Countermeasures,Verification,Inspection,Abe,Correspondence,pneumonia,Expansion,Absent,Presentation,Doctor
2020-02-15 Countermeasures,Inspection,Abe,Correspondence,Coronavirus,Absent,Verification,administration,Tokyo,Specialty
2020-02-16 Countermeasures,Abe,Inspection,Correspondence,Absent,pneumonia,Specialty,Expansion,Coronavirus,Verification
2020-02-17 Countermeasures,Expansion,Absent,Coronavirus,Presentation,Abe,Opposition,Inspection,Correspondence,pneumonia
2020-02-18 Countermeasures,Opposition,Correspondence,Inspection,Symptoms,pneumonia,Held,Absent,Verification,Expansion
2020-02-19 Countermeasures,Iwata,Specialty,Princess,Diamond mode,Expansion,Abe,Correspondence, covid,Kentaro
2020-02-20 Countermeasures,Held,Cruise,Verification,Expansion,Iwata,Event,Correspondence,plans,Absent
2020-02-21 Expansion,Verification,Countermeasures,Inspection,pneumonia,Held,Please,Notice,plans,Event
2020-02-22 Inspection,Countermeasures,Verification,Expansion,pneumonia,Correspondence,Cruise,Coronavirus,Virus,Abe
2020-02-23 Inspection,Countermeasures,Abe,Absent,Correspondence,Coronavirus,pneumonia,Verification,Expansion,Virus
2020-02-24 Inspection,Countermeasures,Abe,Absent,Expansion,Correspondence,pneumonia,hospital,Presentation,Coronavirus
2020-02-25 Inspection,Countermeasures,Expansion,Absent,Held,Correspondence,Coronavirus,Postponed,Abe,Verification
2020-02-26 Inspection,Expansion,Countermeasures,Held,Notice,Performance,Event,Postponed,Correspondence,plans
2020-02-27 Expansion,Countermeasures,Notice,Held,Inspection,Event,plans,Performance,Postponed,Prevention
2020-02-28 Expansion,Countermeasures,Notice,Prevention,Held,Inspection,Abe,Verification,budget,Please
2020-02-29 Countermeasures,Abe,Inspection,Verification,Expansion,Correspondence,Absent,Coronavirus,Presentation,pneumonia
2020-03-01 Countermeasures,Inspection,Absent,Abe,Expansion,Mask,Coronavirus,pneumonia,Verification,Correspondence
2020-03-02 Expansion,Countermeasures,Abe,Inspection,Impact,Coronavirus,Specialty,emergency,Absent,Notice
2020-03-03 Expansion,Countermeasures,pneumonia,Inspection,Absent,Verification,Abe,Impact,Wuhan,emergency
2020-03-04 Expansion,Countermeasures,Abe,Inspection,Verification,Absent,pneumonia,emergency,Correspondence,Held
2020-03-05 Countermeasures,Abe,Expansion,Verification,Absent,Impact,Presentation,Postponed,emergency,Correspondence
2020-03-06 Countermeasures,Expansion,Inspection,Impact,Verification,Absent,Abe,Correspondence,Held,Coronavirus
2020-03-07 Countermeasures,Abe,Inspection,Expansion,Mask,Verification,Absent,Correspondence,Medical,Impact
2020-03-08 Countermeasures,Inspection,Absent,Abe,Verification,Coronavirus,Italy,Virus,Expansion,hospital
2020-03-09 Countermeasures,Expansion,Abe,Impact,Absent,Medical,Inspection,emergency,Postponed,Virus
2020-03-10 Countermeasures,Expansion,Inspection,Absent,Specialty,Abe,emergency,Correspondence,Virus,Impact
2020-03-11 Inspection,Expansion,Held,Absent,Countermeasures,Impact,Medical,plans,Postponed,consumption
2020-03-12 Inspection,Countermeasures,Expansion,Medical,Absent,Presentation,Pandemic, who,Impact,plans
2020-03-13 Countermeasures,Expansion,Inspection,Postponed,Impact,Held,Presentation,Possible,emergency,plans
2020-03-14 Inspection,Countermeasures,Abe,Absent,Declaration,President,Correspondence,Expansion,Coronavirus,Playing cards
2020-03-15 Inspection,Countermeasures,Expansion,Absent,Medical,hospital,Abe,Impact,Correspondence,emergency
2020-03-16 Inspection,Countermeasures,Expansion,Correspondence,Absent,Coronavirus,pneumonia,Impact,Abe,Italy
2020-03-17 Inspection,Expansion,Countermeasures,Verification,Abe,Absent,Impact, who,Performance,Held
2020-03-18 Countermeasures,Expansion,Inspection,Virus,Absent,Impact,Held,death,plans,Benefits
2020-03-19 Expansion,Countermeasures,Inspection,Impact,Absent,Economy,Verification,Held,Osaka,Correspondence
2020-03-20 Expansion,Countermeasures,Absent,Inspection,Presentation,Impact,Held,Verification,Italy,Postponed
2020-03-21 Countermeasures,Expansion,Inspection,Absent,Tokyo,Impact,Italy,hospital,Held,Presentation
2020-03-22 Countermeasures,Inspection,Expansion,Absent,Self-restraint,Economy,Verification,Impact,Italy,scale
2020-03-23 Countermeasures,Expansion,Tokyo,Absent,Postponed,Held,Economy,Presentation,Inspection,Verification
2020-03-24 Countermeasures,Postponed,Tokyo,Expansion,Inspection,Impact,Absent,Medical,Held,Economy
2020-03-25 Countermeasures,Tokyo,Economy,Inspection,Postponed,Verification,Expansion,Impact,Absent,emergency
2020-03-26 Countermeasures,Expansion,Tokyo,Self-restraint,Inspection,Notice,Verification,Absent,Impact,Postponed
2020-03-27 Expansion,Countermeasures,Inspection,Tokyo,Self-restraint,Impact,Verification,Presentation,Notice,Held
2020-03-28 Countermeasures,Verification,Expansion,Self-restraint,Tokyo,Absent,Abe,Inspection,Medical,hospital
2020-03-29 Countermeasures,Self-restraint,Absent,Tokyo,Inspection,Expansion,Verification,Economy,hospital,Abe
2020-03-30 Ken,Countermeasures,Expansion,Tokyo,pneumonia,Absent,Died,Verification,Self-restraint,Inspection
2020-03-31 Expansion,Countermeasures,Verification,Tokyo,Presentation,Self-restraint,Absent,Impact,Inspection,Please
2020-04-01 Countermeasures,Verification,Expansion,Mask,Medical,Absent,Tokyo,Presentation,Abe,Status
2020-04-02 Mask,Expansion,Countermeasures,Verification,Tokyo,Medical,Absent,Inspection,support,Impact
2020-04-03 Expansion,Countermeasures,Benefits,Verification,Tokyo,Household,Impact, nhk,Notice,Held
2020-04-04 Countermeasures,Expansion,Verification,Inspection,Medical,Mask,Tokyo,hospital,Absent, nhk
2020-04-05 Countermeasures,Tokyo,Expansion,Inspection,Medical,Absent,Verification,hospital,Self-restraint,Coronavirus
2020-04-06 emergency,Expansion,Countermeasures,Declaration,Inspection,Tokyo,Impact,Notice,Self-restraint,Verification
2020-04-07 emergency,Declaration,Expansion,Countermeasures,Tokyo,Notice,Verification,Absent,Closed,Please
2020-04-08 Expansion,Countermeasures,emergency,Declaration,Verification,Impact,Notice,Absent,Self-restraint,Postponed
2020-04-09 Countermeasures,Expansion,Verification,emergency,Inspection,Declaration,Impact,Tokyo,Absent,Medical
2020-04-10 Expansion,Countermeasures,Verification,Impact,Inspection,Medical,emergency,Absent,Tokyo,Notice
2020-04-11 Countermeasures,Expansion,Abe,Absent,Inspection,Self-restraint,Medical,Verification,emergency,hospital
2020-04-12 Countermeasures,Inspection,Verification,Abe,Expansion,Medical,Absent,Mask,hospital,Tokyo
2020-04-13 Expansion,Countermeasures,emergency,Inspection,Impact,Verification,Absent,Declaration,Closed,Presentation
2020-04-14 Expansion,Countermeasures,emergency,Medical,Impact,Verification,Correspondence,Absent,hospital,Abe
2020-04-15 Countermeasures,Expansion,hospital,Medical,Abe,Absent,Correspondence,Impact,Tokyo,Verification
2020-04-16 Countermeasures,Expansion,Benefits,emergency,Absent,Medical,Inspection,support,Impact,Mask
2020-04-17 Countermeasures,Expansion,Inspection,emergency,Medical,Impact,Correspondence,Abe,Benefits,Absent
2020-04-18 Medical,Countermeasures,Inspection,Absent,Expansion,Verification,Mask,Abe,hospital,necessary
2020-04-19 Countermeasures,Medical,Expansion,Absent,Mask,Inspection,Correspondence,the study,Impact,hospital
2020-04-20 Inspection,Countermeasures,Expansion,Medical,Impact,Verification,Absent, pcr,Turned out,Abe
2020-04-21 Countermeasures,Expansion,Impact,Absent,Medical,Benefits,Mask,Inspection,Abe,support
2020-04-22 Inspection,Expansion,Countermeasures,Medical,Absent,Verification,Impact, pcr, nhk,Self-restraint
2020-04-23 Okae,Inspection,Kumi,Medical,hospital,Expansion,Countermeasures,Absent,home,support
2020-04-24 Expansion,Inspection,Countermeasures,Medical,Impact,Verification,Absent, nhk,Tokyo,Please
2020-04-25 Inspection,Countermeasures,Medical,Expansion,Self-restraint,Economy,Verification,Absent, nhk,support
2020-04-26 Inspection,Countermeasures,Medical,Absent,Mask,Self-restraint,Expansion,Impact,Tokyo, rt
2020-04-27 Countermeasures,Expansion,Inspection,Medical,support,Absent,Self-restraint,Impact,Economy,new
2020-04-28 Expansion,Inspection,Countermeasures,Impact,Absent,support,Medical,Tokyo, news, nhk
2020-04-29 Countermeasures,Inspection,emergency,Expansion,Medical,Absent,Impact,Symptoms,Abe,Economy
2020-04-30 Inspection,Countermeasures,Expansion,Medical,Impact,Absent,Abe,Self-restraint,support,Verification

The table above shows the top 10 most frequently used words for each day from February to April. However, "Coronavirus", "COVID-19", and "Infectious disease" are excluded from the output. In addition, the parts of the table that you want to pay attention to are shown in red.

(1) From January to early February, the symptoms of the new coronavirus such as "pneumonia", "influenza", "Wuhan", "confirmation", and "Cruise", and the trends in Wuhan and cruise ships where infected people have been confirmed were discussed. You can see that.

(2) Since February 14, words related to Japanese administrative trends such as "countermeasures," "announcements," and "responses" have appeared frequently, and it can be confirmed that topics of interest to users have changed. Also, in the graph shown in [Data Details](#Data Details), the number of tweets has increased sharply from around February 14th. Users are expected to be more interested in government trends than the damage caused by the new coronavirus.

(3) Regarding around February 26, the previous article stated that the cause was unknown due to the increase in the number of tweets. Here, there are many frequently used words related to events such as "holding", "lecture", "event", and "postponement". From this, it is probable that the number of tweets and the number of RTs increased in response to the fact that the event of interest to the user was postponed due to the influence of the new coronavirus.

In addition, from (1) to (3) above, it is considered that users tend to be more interested in things that occur closer to them. (Wuhan and cruise ships-> Government response-> Postponement of events, and the contents are familiar (directly related) in order.)

④ After March 23, the word "Tokyo" has appeared in frequent words. This is thought to be due to the rapid increase in infected people in Tokyo. Also, in the graph shown in [Data Details](#Data Details), the number of tweets has increased sharply since this date, so you can see that there is a great deal of interest in the trends in Tokyo. ..

⑤ From April 6th to 7th, there were mentions of emergency declarations in several cities including Tokyo. It is thought that many users were interested in the content because it is closely related to daily life due to requests to refrain from going out and closure of various stores.

⑥ On February 19, March 30, and April 23, there are "Posts about Kentaro Iwata's cruise ship," "Ken Shimura's news," and "Kumiko Okae's news," which are frequently used words each day. You can see that it affects. In addition, each day of the graph shown in [Data Details](#Data Details) shows a peak over several days. From here, we can see that many users have been interested in these topics for several days. Also, personal topics are likely to be less topical than topics that affect many users.

Predict the number of RTs from the meaning of tweets

Here, by quantifying the content of tweets and building a model that predicts the number of RTs (how popular), does the number of RTs differ depending on the content of the tweet and the words included? To verify.

Overview of the model to build

Here, the content of the tweet is quantified by TF-IDF, and the number of RTs is predicted by Light GBM.

TF-IDF There are many commentary articles on the Web, so please see them for details [^ 1] [^ 2]. Qualitatively, it is a "quantification of the importance of each word in a sentence", and for each tweet, a vector with dimensions equal to the number of words included in all tweets can be obtained. As a result, in this dataset, TF-IDF obtained a matrix of 79562 tweets x 5236 6 dimensions.

Light GBM For details, please check the commentary on the Web [^ 3] [^ 4]. It is a model that extends the decision tree and can perform regression and classification. In addition, the dimensions that contributed to regression / classification can be easily output, and by combining with TF-IDF, "which content contributed to the number of RTs" can be easily grasped.

This time, 80% of the tweets in the dataset were used for model training, and the remaining 20% were used for checking the accuracy and contribution of each dimension (test). Therefore, the problem that LightGBM predicts this time can be interpreted as "what is the user's interest in the new coronavirus at any time?"

train_data = lgb.Dataset(x_train, label=y_train)
test_data = lgb.Dataset(x_test, label=y_test, reference= train_data)

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression'
}

gbm = lgb.train(
    params,
    train_data
)

preds = gbm.predict(x_test)

result

Quantitatively evaluate how well the trained model can predict the number of RTs. Here, we use the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) indicators, which are often used to evaluate regression models that predict numbers. I will leave the details of each to the commentary article. It is easy to say that the smaller the value, the more accurate the prediction.

Here, how good is it just by looking at the numerical values of each evaluation index? Therefore, it is necessary to compare it with some other model. This time, as a baseline, we are comparing the model (BaseLine) that "always predicts the RT number of all tweets to be the average value (1429.36) of the RT number of the dataset".

MAE RMSE
LightGBM 1462.95 4565.19
BaseLine 1429.36 4559.26

In the table above, BaseLine outperforms LightGBM in all metrics, suggesting that the number of RTs cannot be predicted with high accuracy from the content of the tweet (TF-IDF + LightGBM).

Also, let's output the Feature Importance expressed by the number of appearances in the leaves of LightBGM of each dimension.

importance = pd.DataFrame(gbm.feature_importance(), 
                index=x_condition.get_feature_names(), columns=['importance'])
display(importance.sort_values("importance", ascending=False).head(100))
word Feature Importance
Corona 138
Coronavirus 111
Comedian 99
Our company 95
rt 93
cov 85
beneficial 81
Ryokan 78
Uncle 77
just 77
New model 71
com 64

The higher the word in the above table, the more it can be interpreted as the word that Light BGM judges to be important in predicting the number of RTs. However, "Corona", "rt", "com (URL?)", Etc. are included in many tweets regardless of the number of RTs, and it can be qualitatively confirmed that LightGBM has not been learned accurately. On the other hand, for example, "comedian" is probably a word extracted from a tweet about Ken Shimura, and it is also thought that LightGBM has captured the above-mentioned feature that "prominent individual topics make a steep peak". You can.

Based on the above, LightGBM was unable to successfully solve the problem of "what is the user's interest in the new coronavirus at any time?" "Trends in Japan", "Postponement of events" ... As users' interests shift, It was suggested that there is no such thing as "tweets with this content will always attract the attention of users at any time" regarding the new coronavirus.

Summary

In this article, the following points were suggested.

――From the frequent words of tweets and the transition of the number of tweets per day, about the user's interest --On topics involving prominent individuals, there is a local peak in the number of related tweets over the course of several days, but the decline is fast. --Users are more interested in topics as they are closer to them. Surprisingly, users are more concerned about postponing the event than the infectious disease itself or the government response.

――There is no content that will surely increase the number of RTs regardless of the time.

In the future, we will make an analysis using user information based on the hypothesis that "tweets sent by well-known users such as the number of followers and official accounts will catch the eyes of many users, and it will be easier to attract the user's attention." I want to do

[^ 1]: TF-IDF Reference (1): https://qiita.com/AwaJ/items/5937665d5a4152cc24cf [^ 2]: TF-IDF Reference (2): https://dev.classmethod.jp/articles/yoshim_2017ad_tfidf_1-2/ [^ 3]: LightGBM Reference (1): https://www.codexa.net/lightgbm-beginner/ [^ 4]: LightGBM Reference (2): https://qiita.com/ryo_naka/items/f479e5b7cb49fb55f150

Recommended Posts

I analyzed the tweets about the new coronavirus posted on Twitter Part 2
I analyzed tweets about the new coronavirus posted on Twitter
(Now) I analyzed the new coronavirus (COVID-19)
I tried to make a script that traces the tweets of a specific user on Twitter and saves the posted image at once
I tried using PDF data of online medical care based on the spread of the new coronavirus infection
I studied with Kaggle Start Book on the subject of kaggle [Part 1]
I checked the image of Science University on Twitter with Word2Vec.
Plot the spread of the new coronavirus
Get only image tweets on twitter
I refactored "I tried to make a script that saves posted images at once by going back to the tweets of a specific user on Twitter".
I tried to predict the behavior of the new coronavirus with the SEIR model.
Folding @ Home on Linux Mint to contribute to the analysis of the new coronavirus