[PYTHON] Try to improve the accuracy of Twitter like number estimation

Continued ...

Below the previous article Estimate the number of likes on Twitter

The number of likes is estimated from the content of the tweet.

This time we will seek accuracy.

Review the dataset

There is too little data, so we will get more relevant items.

That said, the following two things seemed to be useful for whether or not it was a tweet that grew so much.

--Whether it is a reply or not --Whether it is a quote retweet

Let's add these two.

get_twitter.py


    #Get by specifying user (screen_name)
    getter = TweetsGetter.byUser('hana_oba')
    df = pd.DataFrame(columns = ['week_day','have_photo','have_video','tweet_time','text_len','favorite_count','retweet_count','quoted_status','reply','year_2018','year_2019','year_2020'])
 
    cnt = 0
    for tweet in getter.collect(total = 10000):
        cnt += 1

        week_day = tweet['created_at'].split()[0]
        tweet_time = tweet['created_at'].split()[3][:2]
        year = tweet['created_at'].split()[5]

        #Specify the column you want to Eoncode in the list. Of course, you can specify more than one.
        list_cols = ['week_day']
        #Specify the column you want to OneHotEncode. Also specify the completion method when Null or unknown.
        ce_ohe = ce.OneHotEncoder(cols=list_cols,handle_unknown='impute')

        photo = 0
        video = 0
        quoted_status = 0
        reply = 0
        yar_2018 = 0
        yar_2019 = 0
        yar_2020 = 0

        if 'media' in tweet['entities']:

            if 'photo' in tweet['entities']['media'][0]['expanded_url']:
                photo = 1
            else:
                video = 1

        if 'quoted_status_id' in tweet:
            quoted_status = 1
        else:
            quoted_status = 0
        
        if tweet['in_reply_to_user_id_str'] is None:
            reply = 0
        else:
            reply = 1
        if year == '2018':
            yar_2018 = 1
            yar_2019 = 0
            yar_2020 = 0
        if year == '2019':
            yar_2018 = 0
            yar_2019 = 1
            yar_2020 = 0
        if year == '2020':
            yar_2018 = 0
            yar_2019 = 0
            yar_2020 = 1


        df = df.append(pd.Series([week_day, photo, video, int(tweet_time), len(tweet['text']),tweet['favorite_count'],tweet['retweet_count'],quoted_status,reply,yar_2018,yar_2019,yar_2020], index=df.columns),ignore_index=True)
        df_session_ce_onehot = ce_ohe.fit_transform(df)

    df_session_ce_onehot.to_csv('oba_hana_data.csv',index=False)

I will give you a score with this.

IhaveOBAHANAfullyunderstood.ipynb


datapath = '/content/drive/My Drive/data_science/'
df = pd.read_csv(datapath + 'oba_hana_data.csv')

train_count = int(df.shape[0]*0.7)
df_train = df.sample(n=train_count)
df_test = df.drop(df_train.index)

have_photo = 'have_photo'
have_video = 'have_video'
tweet_time = 'tweet_time'
text_len = 'text_len'
favorite_count = 'favorite_count'
retweet_count = 'retweet_count'
quoted_status = 'quoted_status'
reply = 'reply'
year_2018 = 'year_2018'
year_2019 = 'year_2019'
year_2020 = 'year_2020'


#Model declaration
from sklearn.ensemble import RandomForestRegressor

#Outlier removal
df_train = df_train[df_train['favorite_count'] < 4500]
df_train.shape

x_train = df_train.loc[:,[have_photo,have_video,tweet_time,text_len,quoted_status,reply,year_2018,year_2019,year_2020]]
t_train = df_train['favorite_count']
x_test = df_test.loc[:,[have_photo,have_video,tweet_time,text_len,quoted_status,reply,year_2018,year_2019,year_2020]]
t_test = df_test['favorite_count']

#Model declaration
model = RandomForestRegressor(n_estimators=2000, max_depth=10,
                                min_samples_leaf=4, max_features=0.2, random_state=0)

#Model learning
model.fit(x_train, t_train)
#Model validation
print(model.score(x_train, t_train))
print(model.score(x_test, t_test))
0.7189988420451674
0.6471214647821018

ダウンロード (5).png

The accuracy has improved dramatically! Whether it is a quote retweet or not does not contribute so much, but it seems that it is not as irrelevant as the day of the week.

However, another thing I was interested in was the tweet time. Currently, it is only viewed as a numerical value, so it seems better to divide it into several bands.

time_mean = pd.DataFrame(columns = ['time','favorite_mean'])
time_array = range(23)
for i in range(23):
  time_mean = time_mean.append(pd.Series([i,df_train[df_train['tweet_time'] == time_array[i]].favorite_count.mean()], index=time_mean.columns),ignore_index=True)

time_mean['time'] = time_mean['time'].astype(int)

sns.set_style('darkgrid')
plt.figure(figsize=(12, 8))
sns.catplot(x="time", y="favorite_mean", data=time_mean,
                height=6, kind="bar", palette="muted")
plt.show()

ダウンロード (6).png

Since there is a rule that SNS is until 24:00 for members of Eikolab (it is safe even if it sticks out a little), there is a part of 0, but other than that, the average value is in the 24:00 range of Japan time (probably a happy birthday tweet) You can see that it is expensive. In ↑, I said that it is better to divide by several bands, but probably this is better to use TargetEncoding instead of dividing by band (because it seems difficult to roughly divide by time)

pip install category_encoders

from category_encoders.target_encoder import TargetEncoder
df_train["tweet_time"] = df_train["tweet_time"].astype(str)

TE = TargetEncoder(smoothing=0.1)

df_train["target_enc_tweet_time"] = TE.fit_transform(df_train["tweet_time"],df_train["favorite_count"])
df_test["target_enc_tweet_time"] = TE.transform(df_test["tweet_time"])

Learn using target_enc_tweet_time instead oftweet_timeand see the score

0.6999237089367164
0.6574824327192588

The training data went down, but the verification data went up. By the way, when both tweet_time andtarget_enc_tweet_timeare adopted, it becomes as follows.

0.7210047209796951
0.6457969793382683

The score in the training data is the best, but the score in the validation data is not. All of them are difficult to attach, but let's leave all possibilities and go to the next.

Model change

Right now I'm in Random Forest and I haven't messed with any settings. So I'd like to set the model to XGBoost and useoptunato find the best parameters.

Install optuna

!pip install optuna

Next, we will functionize XGboost so that it can be turned withoptuna. Here, appropriate numerical values are lined up for each hyperparameter value, but this will be fine-tuned as it is repeated many times.

#Import XGboost library
import xgboost as xgb
#Model instantiation
#mod = xgb.XGBRegressor()
import optuna

def objective(trial):
    #Hyperparameter candidate setting
    min_child_samples = trial.suggest_int('min_child_samples', 60, 75)
    max_depth  = trial.suggest_int('max_depth', -60, -40)
    learning_rate   = trial.suggest_uniform('suggest_uniform ', 0.075, 0.076)
    min_child_weight = trial.suggest_uniform('min_child_weight', 0.1, 0.8)
    num_leaves = trial.suggest_int('num_leaves', 2, 3)
    n_estimators = trial.suggest_int('n_estimators', 100, 180)
    subsample_for_bin = trial.suggest_int('subsample_for_bin', 450000, 600000)

    model = xgb.XGBRegressor(min_child_samples = min_child_samples,min_child_weight = min_child_weight,
                          num_leaves = num_leaves,subsample_for_bin = subsample_for_bin,learning_rate = learning_rate,
                          n_estimators = n_estimators)


    #Learning
    model.fit(x_train, t_train)

    #Return rating
    return (1 - model.score(x_test, t_test))

First, let's turn it 100 times.

#Specify the number of trials
study = optuna.create_study()
study.optimize(objective, n_trials=100)

print('Hyperparameters:', study.best_params)
print('accuracy:', 1 - study.best_value)

Let's see the score for each

① Only tweet_time is adopted

0.690093409305073
0.663908038217022

② Only target_enc_tweet_time is adopted

0.6966901697205284
0.667797061960107

③ Adopt both tweet_time and target_enc_tweet_time

0.6972461315076879
0.6669948080176482

Although it is slight, ② seems to be the most accurate.

It's decided that you should try every possible possibility, but which one is best if you narrow down to one from here? (2) and (3) are almost the same, but (2) is the verification data and the score is good, and (3) is the training and the score is good. From here, when making fine adjustments with optuna to improve accuracy ――Do you think that the higher the training score, the higher the growth margin? ――The higher the verification score, the less overfitting, so we will improve the accuracy. Which one should I think about? It may not be a problem to say either, but I would appreciate it if anyone who knows which is the right path in general can tell me.

This time, we will proceed with the person who masters ②.

Once you have a wide range of hyperparameters, increase the number of trials to 1000. Narrow the range near the hyperparameters for the best results obtained, and try again 1000 times. The result obtained by this is as follows.

0.6962221939011508
0.6685252235753019

We got the best verification results so far. I've tried all the accuracy improvement methods I know, so I'll stop here this time.

Finally, let's take a look at the inference result and the histogram of the actual values.

ダウンロード (7).png

It would have been nice to know what the mountains around 1600 were actually pulled from. I don't know what changed from the first low accuracy, so I wonder if I made a mistake in the type of graph to plot ...

Hard to see, orange: real likes, green: inferred likes, blue: error I tried to plot with.

ダウンロード (8).png

Basically, you infer low. After all, Oba Hana is tweeting more than AI predicted ... (Your model is not something that can be called AI) Fin

Recommended Posts

Try to improve the accuracy of Twitter like number estimation
Try to estimate the number of likes on Twitter
10 methods to improve the accuracy of BERT
I tried how to improve the accuracy of my own Neural Network
Try to simulate the movement of the solar system
Use twitter API to get the number of tweets related to a certain keyword
python beginners tried to predict the number of criminals
How to know the port number of the xinetd service
How to get the number of digits in Python
Try to get the contents of Word with Golang
How to find the optimal number of clusters in k-means
Try to get the function list of Python> os package
10. Counting the number of lines
Try to evaluate the performance of machine learning / regression model
Try to analyze Twitter trends
Get the number of digits
Try to evaluate the performance of machine learning / classification model
Try to solve the problems / problems of "Matrix Programmer" (Chapter 0 Functions)
Try using the Twitter API
How to increase the number of machine learning dataset images
[Python] I tried to visualize the follow relationship of Twitter
Try to automate the operation of network devices with Python
Calculate the number of changes
Try to extract the features of the sensor data with CNN
[Note] Let's try to predict the amount of electricity used! (Part 1)
[Cloudian # 9] Try to display the metadata of the object in Python (boto3)
[Python] Try to graph from the image of Ring Fit [OCR]
Make the theme of Pythonista 3 like Monokai (how to make your own theme)
First python ② Try to write code while examining the features of python
Try to solve the N Queens problem with SA of PyQUBO
Try to model the cumulative return of rollovers in futures trading
To improve the reusability and maintainability of workflows created with Luigi
I tried to improve the efficiency of daily work with Python
Try to predict the triplet of boat race by ranking learning
Get the number of views of Qiita
Try to introduce the theme to Pelican
Get the number of Youtube subscribers
Cython to try in the shortest
The fastest way to try EfficientNet
Supplement to the explanation of vscode
The easiest way to try PyQtGraph
[Completed version] Try to find out the number of residents in the town from the address list with Python
Check the number of prime numbers less than or equal to n
Did the number of store closures increase due to the impact of the new coronavirus?
Graph of the history of the number of layers of deep learning and the change in accuracy
[Python] A program that calculates the number of socks to be paired
How to find out the number of CPUs without using the sar command
Try to estimate the parameters of the gamma distribution while simply implementing MCMC
Try to get the road surface condition using big data of road surface management
How to put a line number at the beginning of a CSV file
Get the number of visits to each page with ReportingAPI + Cloud Functions
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
Try using n to downgrade the version of Node.js you have installed
Try to react only the carbon at the end of the chain with SMARTS
How to play a video while watching the number of frames (Mac)
Try to separate the background and moving object of the video with OpenCV
The story of trying to reconnect the client
Count / verify the number of method calls.
Script to change the description of fasta
Try to face the integration by parts
Try to make a kernel of Jupyter