[PYTHON] Try to predict if tweets will burn with machine learning

1. About this article

Occasionally, many people blame the posted content on Twitter. There are cases where so-called "flaming" occurs.

In this article, based on the characteristics of the text of the tweet, Describe the procedure and result of machine learning to predict whether the content to be posted will burn.

2. Verification flow

Data collection + preprocessing is performed as shown in the figure below. Put it into a sentence classification task using machine learning. For tweets included in groups with many negative reactions and groups with few negative reactions The hypothesis is that the characteristics of the text may tend to be different. Binary classification may be fine, but it seems more interesting to be able to evaluate it step by step. I did it like this.

3. Verification procedure

3.1 Data preparation

3.1.1 Data collection

Using the python code posted to here We collect tweets and a set of replies attached to tweets as a csv file. This time, I used the following data. Period: 2020/5/20 ~ 2020/12/06 Number of tweets: Approximately 160,000 tweets / 9 million replies Target tweets: ・ Posting in Japanese ・ As of 2 days after posting, there are more than 30 replies. ・ No images, videos, or URL links

The following search string is passed as an argument. For tweet collection: "lang: ja -filter: links exclude: retweets min_replies: 30" For reply collection: "lang: ja -filter: links filter: replies exclude: retweets"

In addition, because it is stuck in the acquisition limit of Twitter API We have not been able to collect all tweets that meet the conditions within the period.

3.1.2 Data load

Collect multiple csv into DataFrame. Load Tweet-> tweet_df, Reply-> reply_df.

import pandas as pd
import csv
import glob

#Get the path of the csv directory
DATA_PATH = "/csv_directory"
All_Files = glob.glob('{}tweet*'.format(DATA_PATH))

#Merge all csv in the directory
tweets = []
for file in All_Files:
    tweets.append(pd.read_csv(file,engine='python'))
tweet_df = pd.concat(tweets, sort=False)

DataFrame has the following column structure. ・ Id (unique for each tweet) ・ To_id (destination tweet ID * reply only) ・ Created_at (Posted date and time) ・ User (posted user name) ・ Text (tweet content)

3.1.3 Data shaping

・ Delete missing value

tweet_df.dropna()

・ Delete pictograms and @usernames in tweet texts

import emoji
import re

#Emoji deletion function
def remove_emoji(src_str):
    return ''.join(c for c in src_str if c not in emoji.UNICODE_EMOJI)

#Add formatted text to DataFrame
tweet_text = []
for text in tweet_df["text"]:
    text = remove_emoji(text) #Delete emoji
    text = re.sub(r'@[0-9a-zA-Z_:]*', "", text) #@Delete user name
    tweet_text.append(text)
tweet_df["text"] = tweet_text

Do the same for reply_df.

3.1.4 Reply sentiment analysis

There are many packages and APIs that can analyze text sentiment, Most of the evaluations are based on two axes, Positive/Negative. However, from the perspective of whether "sad" among the negative emotions will burn this time. I think it is out of alignment.

For example, the reply to the tweet that informs the news of a celebrity is centered on "sad" It is expected that the score of Nagative emotions will be high, but it is different from burning. So to exclude these, "happy", "sad", "angry", "disgust", "surprise", "fear" Using the here package that allows you to evaluate emotions in 6 types, we will focus on the scores of "anger" and "disgust" and perform prediction processing.

Sentiment analysis score as well as data shaping earlier Add it to "reply_df" and do as follows. Aggregate sentiment analysis scores for each destination tweet.

reply_groupby_df = reply_df.groupby('to_id').sum()

3.1.5 Combining Tweet and Reply Data

Combine the text of the tweet with the aggregated reply emotion score as a set.

tweet_reply_df = pd.merge(tweet_df, reply_groupby_df, left_on='id',right_on='to_id', how='inner')

3.2 Data confirmation

Check the emotion score for each tweet. ·Statistics ・ Distribution of angry and disgust

3.3 Classification

From here, we will classify tweets based on the values of angry and disgust. I was thinking of taking the sum of angry and disgust and dividing it into the top n%. Looking at the distribution of the values above, it seems better to cluster. This time, I will try clustering with k-means.

However, if nothing is done, the cluster will be created with a few outliers in the upper right. Since it seems that it can be done, normalize the value with min-max normalization.

tweet_reply_df["angry_mmn"] = (tweet_reply_df["angry"]-tweet_reply_df["angry"].min()) / (tweet_reply_df["angry"].max()-tweet_reply_df["angry"].min())
tweet_reply_df["disgust_mmn"] = (tweet_reply_df["disgust"]-tweet_reply_df["disgust"].min()) / (tweet_reply_df["disgust"].max()-tweet_reply_df["disgust"].min())

The number of clusters is estimated by the elbow method based on the normalized values of angry and disgust.

from sklearn.cluster import KMeans
distortions = []

for i  in range(1,11): # 1~Calculate up to 10 clusters
    km = KMeans(n_clusters=i,
                init='k-means++',
                n_init=10,
                max_iter=300,
                random_state=0)
    km.fit(tweet_df[['disgust_mmn', 'angry_mmn']])
    distortions.append(km.inertia_)

plt.plot(range(1,11),distortions,marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

The result is as follows. About 5 to 6 seems to be an appropriate number of clusters. This time, let's proceed with the number of clusters by 5. Cluster with normalized angry and disgust values.

# k-means disgust,Classification based on angry
kmeans_model = KMeans(n_clusters=5, random_state=0).fit(tweet_df[['disgust_mmn', 'angry_mmn']])
#Add classification result label to DataFrame
tweet_df["label"] = kmeans_model.labels_

It is a clustering result.

・ Look at the tweets of the cluster with the most critical replies (label = 5)

・ Look at the tweets of the cluster with the fewest critical replies (label = 1)

There seems to be a considerable difference in the characteristics of the text. The tweet, which seemed to have burned a lot, had the following characteristics. ・ Contents with a large subject (Japan is ~, man is ~ ... etc) ・ References to politics, gender, national character, and corona ・ Apologies for scandals

3.4 Vectorization of tweet text

In predicting which of the above classes a tweet belongs to Gets the feature vector of the tweet text. This time we will use BERT as the vectorization method. I borrowed this code using BERT that has already learned Japanese. I'm running by replacing "sample_df" in the code with "tweet_df".

BSV = BertSequenceVectorizer()
tweet_df['text_feature'] = tweet_df['text'].progress_apply(lambda x: BSV.vectorize(x))

Reference: Thorough explanation of the treatise of "BERT", the king of natural language processing

3.5 Machine learning modeling

Divide the training data and test data 9: 1.

from sklearn.model_selection import train_test_split
feature_train, feature_test,label_train, label_test, weight_train, weight_test = train_test_split(tweet_df["text_feature"],tweet_df["label"], weight_list, test_size=0.1, shuffle=True)

Try some algorithms and use the most accurate XGBoost.

Since the number of data items for each class is biased, at sample_weight = ... Weighting is performed according to the number of data items.

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits
#Model definition
xgb_model = xgb.XGBClassifier()
#Hyperparameter search
xgb_model_cv = GridSearchCV(xgb_model, {'eta': [0.01,0.1,0.3], 'max_depth': [5,6,7,8,9], 'gamma': [0.01,0.1,1.0,10.0]}, verbose=1, n_jobs=-1)
#Learning
xgb_model_cv.fit(feature_train, label_train, sample_weight=weight_train)
print(xgb_model_cv.best_params_, xgb_model_cv.best_score_)

3.6 Model evaluation

Let's look at the classification accuracy.

from sklearn.metrics import confusion_matrix, classification_report

pred = xgb_model_cv.predict(feature_test)
print(classification_report(label_test, pred))

The overall accuracy is 41%. Recall (recall rate) of labels 4 and 5 that seems to have burned up I hope I can raise more, but since there are only 35 labels5, there are too few samples and it seems to be strict.

Reference: Meaning of each evaluation index

In addition, since this time the class is divided in stages, one class above or below the correct answer will be selected. I would like to consider the predicted cases to some extent as an evaluation of the model. Let's take a look at the accuracy with the correct answer for the upper and lower classes.

result = sum(1*(p == t or p == t+1 or p==t-1) for p, t in zip(pred.tolist(), label_test)) / len(label_test)
print("Accuracy: {:.2f}".format(result))

・ Accuracy that allows class up and down (overall)

Accuracy: 0.84

・ Accuracy that allows class up and down (only for data whose correct answer is label = 4,5)

Accuracy: 0.59

It seems that you can grasp the tendency to burn to some extent, It would be nice to have more samples of tweets that went up in flames. In the search string on Twitter: (Put a negative search to use negative search to burn tweets I was wondering if it could be collected efficiently, but it didn't seem to work.

Afterword

This time I focused on the burning tweet and made a prediction Even if the content is on fire, I want to raise some issues. I don't think it's necessarily that burning = bad. However, to be able to predict the reaction to the tweet to be posted to some extent I think it's worth it.

Also, this time, I just applied sentiment analysis to the reply text. The context between the tweet and the reply is not taken into account. Therefore, tweets with many "agreements including negative words" are burned. There are cases where it has been determined that there is a high possibility.

that's all.