[PYTHON] Try to predict if tweets will burn with machine learning

1. About this article

Occasionally, many people blame the posted content on Twitter. There are cases where so-called "flaming" occurs.

In this article, based on the characteristics of the text of the tweet, Describe the procedure and result of machine learning to predict whether the content to be posted will burn.

2. Verification flow

Data collection + preprocessing is performed as shown in the figure below. Put it into a sentence classification task using machine learning. For tweets included in groups with many negative reactions and groups with few negative reactions The hypothesis is that the characteristics of the text may tend to be different. Binary classification may be fine, but it seems more interesting to be able to evaluate it step by step. I did it like this. image.png

3. Verification procedure

3.1 Data preparation

3.1.1 Data collection

Using the python code posted to here We collect tweets and a set of replies attached to tweets as a csv file. This time, I used the following data. Period: 2020/5/20 ~ 2020/12/06 Number of tweets: Approximately 160,000 tweets / 9 million replies Target tweets: ・ Posting in Japanese ・ As of 2 days after posting, there are more than 30 replies. ・ No images, videos, or URL links

The following search string is passed as an argument. For tweet collection: "lang: ja -filter: links exclude: retweets min_replies: 30" For reply collection: "lang: ja -filter: links filter: replies exclude: retweets"

In addition, because it is stuck in the acquisition limit of Twitter API We have not been able to collect all tweets that meet the conditions within the period.

3.1.2 Data load

Collect multiple csv into DataFrame. Load Tweet-> tweet_df, Reply-> reply_df.

import pandas as pd
import csv
import glob

#Get the path of the csv directory
DATA_PATH = "/csv_directory"
All_Files = glob.glob('{}tweet*'.format(DATA_PATH))

#Merge all csv in the directory
tweets = []
for file in All_Files:
    tweets.append(pd.read_csv(file,engine='python'))
tweet_df = pd.concat(tweets, sort=False)

DataFrame has the following column structure. ・ Id (unique for each tweet) ・ To_id (destination tweet ID * reply only) ・ Created_at (Posted date and time) ・ User (posted user name) ・ Text (tweet content)

3.1.3 Data shaping

・ Delete missing value

tweet_df.dropna()

・ Delete pictograms and @usernames in tweet texts

import emoji
import re

#Emoji deletion function
def remove_emoji(src_str):
    return ''.join(c for c in src_str if c not in emoji.UNICODE_EMOJI)

#Add formatted text to DataFrame
tweet_text = []
for text in tweet_df["text"]:
    text = remove_emoji(text) #Delete emoji
    text = re.sub(r'@[0-9a-zA-Z_:]*', "", text) #@Delete user name
    tweet_text.append(text)
tweet_df["text"] = tweet_text

3.1.4 Reply sentiment analysis

There are many packages and APIs that can analyze text sentiment, Most of the evaluations are based on two axes, Positive/Negative. However, from the perspective of whether "sad" among the negative emotions will burn this time. I think it is out of alignment.

Sentiment analysis score as well as data shaping earlier Add it to "reply_df" and do as follows. image.png Aggregate sentiment analysis scores for each destination tweet.

reply_groupby_df = reply_df.groupby('to_id').sum()

3.1.5 Combining Tweet and Reply Data

Combine the text of the tweet with the aggregated reply emotion score as a set.

tweet_reply_df = pd.merge(tweet_df, reply_groupby_df, left_on='id',right_on='to_id', how='inner')

3.2 Data confirmation

Check the emotion score for each tweet. ·Statistics image.png ・ Distribution of angry and disgust image.png

3.3 Classification

From here, we will classify tweets based on the values ​​of angry and disgust. I was thinking of taking the sum of angry and disgust and dividing it into the top n%. Looking at the distribution of the values ​​above, it seems better to cluster. This time, I will try clustering with k-means.

However, if nothing is done, the cluster will be created with a few outliers in the upper right. Since it seems that it can be done, normalize the value with min-max normalization.

tweet_reply_df["angry_mmn"] = (tweet_reply_df["angry"]-tweet_reply_df["angry"].min()) / (tweet_reply_df["angry"].max()-tweet_reply_df["angry"].min())
tweet_reply_df["disgust_mmn"] = (tweet_reply_df["disgust"]-tweet_reply_df["disgust"].min()) / (tweet_reply_df["disgust"].max()-tweet_reply_df["disgust"].min())

The number of clusters is estimated by the elbow method based on the normalized values ​​of angry and disgust.

from sklearn.cluster import KMeans
distortions = []

for i  in range(1,11): # 1~Calculate up to 10 clusters
    km = KMeans(n_clusters=i,
                init='k-means++',
                n_init=10,
                max_iter=300,
                random_state=0)
    km.fit(tweet_df[['disgust_mmn', 'angry_mmn']])
    distortions.append(km.inertia_)

plt.plot(range(1,11),distortions,marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.show()

The result is as follows. About 5 to 6 seems to be an appropriate number of clusters. This time, let's proceed with the number of clusters by 5. image.png Cluster with normalized angry and disgust values.

# k-means disgust,Classification based on angry
kmeans_model = KMeans(n_clusters=5, random_state=0).fit(tweet_df[['disgust_mmn', 'angry_mmn']])
#Add classification result label to DataFrame
tweet_df["label"] = kmeans_model.labels_

It is a clustering result. image.png

・ Look at the tweets of the cluster with the most critical replies (label = 5) image.png

・ Look at the tweets of the cluster with the fewest critical replies (label = 1) image.png

There seems to be a considerable difference in the characteristics of the text. The tweet, which seemed to have burned a lot, had the following characteristics. ・ Contents with a large subject (Japan is ~, man is ~ ... etc) ・ References to politics, gender, national character, and corona ・ Apologies for scandals

3.4 Vectorization of tweet text

In predicting which of the above classes a tweet belongs to Gets the feature vector of the tweet text. This time we will use BERT as the vectorization method. I borrowed this code using BERT that has already learned Japanese. I'm running by replacing "sample_df" in the code with "tweet_df".

BSV = BertSequenceVectorizer()
tweet_df['text_feature'] = tweet_df['text'].progress_apply(lambda x: BSV.vectorize(x))

3.5 Machine learning modeling

Divide the training data and test data 9: 1.

from sklearn.model_selection import train_test_split
feature_train, feature_test,label_train, label_test, weight_train, weight_test = train_test_split(tweet_df["text_feature"],tweet_df["label"], weight_list, test_size=0.1, shuffle=True)

Try some algorithms and use the most accurate XGBoost.

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits
#Model definition
xgb_model = xgb.XGBClassifier()
#Hyperparameter search
xgb_model_cv = GridSearchCV(xgb_model, {'eta': [0.01,0.1,0.3], 'max_depth': [5,6,7,8,9], 'gamma': [0.01,0.1,1.0,10.0]}, verbose=1, n_jobs=-1)
#Learning
xgb_model_cv.fit(feature_train, label_train, sample_weight=weight_train)
print(xgb_model_cv.best_params_, xgb_model_cv.best_score_)

3.6 Model evaluation

Let's look at the classification accuracy.

from sklearn.metrics import confusion_matrix, classification_report

pred = xgb_model_cv.predict(feature_test)
print(classification_report(label_test, pred))

image.png

The overall accuracy is 41%. Recall (recall rate) of labels 4 and 5 that seems to have burned up I hope I can raise more, but since there are only 35 labels5, there are too few samples and it seems to be strict.

In addition, since this time the class is divided in stages, one class above or below the correct answer will be selected. I would like to consider the predicted cases to some extent as an evaluation of the model. Let's take a look at the accuracy with the correct answer for the upper and lower classes.

result = sum(1*(p == t or p == t+1 or p==t-1) for p, t in zip(pred.tolist(), label_test)) / len(label_test)
print("Accuracy: {:.2f}".format(result))

・ Accuracy that allows class up and down (overall)

Accuracy: 0.84

・ Accuracy that allows class up and down (only for data whose correct answer is label = 4,5)

Accuracy: 0.59

It seems that you can grasp the tendency to burn to some extent, It would be nice to have more samples of tweets that went up in flames. In the search string on Twitter: (Put a negative search to use negative search to burn tweets I was wondering if it could be collected efficiently, but it didn't seem to work.

Afterword

This time I focused on the burning tweet and made a prediction Even if the content is on fire, I want to raise some issues. I don't think it's necessarily that burning = bad. However, to be able to predict the reaction to the tweet to be posted to some extent I think it's worth it.

Also, this time, I just applied sentiment analysis to the reply text. The context between the tweet and the reply is not taken into account. Therefore, tweets with many "agreements including negative words" are burned. There are cases where it has been determined that there is a high possibility.

that's all.

Recommended Posts

Try to predict if tweets will burn with machine learning
Try to predict forex (FX) with non-deep machine learning
Try machine learning with Kaggle
Try machine learning with scikit-learn SVM
Try to predict cherry blossoms with xgboost
A beginner of machine learning tried to predict Arima Kinen with python
Try to forecast power demand by machine learning
Introduction to machine learning
I tried to move machine learning (ObjectDetection) with TouchDesigner
Machine learning beginners try to make a decision tree
Site summary to learn machine learning with English video
[Machine learning] Try to detect objects using Selective Search
[Machine learning] Start Spark with iPython Notebook and try MLlib
Machine learning learned with Pokemon
Try to factorial with recursion
Uncle SE with hardened brain tried to study machine learning
Try deep learning with TensorFlow
Try to build a deep learning / neural network with scratch
Try to evaluate the performance of machine learning / regression model
[Evangelion] Try to automatically generate Asuka-like lines with Deep Learning
An introduction to machine learning
Introduction to Machine Learning with scikit-learn-From data acquisition to parameter optimization
Machine learning with Python! Preparation
Try to evaluate the performance of machine learning / classification model
Try Deep Learning with FPGA
For those who want to start machine learning with TensorFlow2
Machine learning beginners try to reach out to Naive Bayes (2) --Implementation
Machine learning Minesweeper with PyTorch
Machine learning to learn with Nogizaka46 and Keyakizaka46 Part 1 Introduction
Beginning with Python machine learning
Super introduction to machine learning
Machine learning beginners try to reach out to Naive Bayes (1) --Theory
Manga Recommendations with Machine Learning Part 1 First, try dividing without thinking
[Machine learning] Try running Spark MLlib with Python and make recommendations
Machine learning with python without losing to categorical variables (dummy variable)
[Introduction to StyleGAN] Unique learning of anime with your own machine ♬
Introduction to Deep Learning (2) --Try your own nonlinear regression with Chainer-
Machine learning model management to avoid quarreling with the business side
Try to predict the triplet of boat race by ranking learning
How to create a serverless machine learning API with AWS Lambda
Try Deep Learning with FPGA-Select Cucumbers
Introduction to machine learning Note writing
[Machine learning] Try studying decision trees
Try to operate Facebook with Python
I tried machine learning with liblinear
Machine learning with python (1) Overall classification
Try deep learning with TensorFlow Part 2
Machine learning beginners try linear regression
Try to profile with ONNX Runtime
Introduction to Machine Learning Library SHOGUN
[Machine learning] Try studying random forest
Try Common Representation Learning with chainer
Quantum-inspired machine learning with tensor networks
Try to output audio with M5STACK
Get started with machine learning with SageMaker
"Scraping & machine learning with Python" Learning memo
How to collect machine learning data
Try to predict the value of the water level gauge by machine learning using the open data of Data City Sabae
[Introduction to machine learning] Until you run the sample code with chainer
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
I tried to visualize the model with the low-code machine learning library "PyCaret"