[PYTHON] Try to estimate the number of likes on Twitter

Introduction

The origin of this was trying to learn scraping for the production of horse racing AI, and I was looking for the learning task. Anyway, I think it would be nice to have a sequence of data acquisition → preprocessing → learning → inference. ** Estimate the number of likes from the tweet content of my favorite men Oba Hana ** I decided to make it a learning task. This time, the aim is to start moving with a simple flow, and I will work on the accuracy in the future.

Data collection

Data collection uses the Twitter API. For the registration method, I referred to the article below. Tips for passing Twitter API registration in 30 minutes (with Japanese translation) Well, it's a pain to write considering the reason, so please be prepared. I was crushed here for a day. (Japanese is inconvenient ...)

After registration, you can get Tweet, so I will write the acquisition program. Normally, the Twitter API seems to be able to get only up to 200 Tweets, and this can also be realized by referring to the following article "Getting all tweets of @hana_oba" I will write the code. In addition, since most of them are the round copy of the program of the reference article, only the part I wrote myself is listed here. Get a lot of tweets with TwitterAPI. Consider server-side errors (in python)

get_twitter.py


    #Get by specifying user (screen_name)
    getter = TweetsGetter.byUser('hana_oba')
    df = pd.DataFrame(columns = ['week_day','have_photo','have_video','tweet_time','text_len','favorite_count','retweet_count'])
 
    cnt = 0
    for tweet in getter.collect(total = 10000):
        cnt += 1

        week_day = tweet['created_at'].split()[0]
        tweet_time = tweet['created_at'].split()[3][:2]

        #Specify the column you want to Eoncode in the list. Of course, you can specify more than one.
        list_cols = ['week_day']
        #Specify the column you want to OneHotEncode. Also specify the completion method when Null or unknown.
        ce_ohe = ce.OneHotEncoder(cols=list_cols,handle_unknown='impute')

        photo = 0
        video = 0
        if 'media' in tweet['entities']:

            if 'photo' in tweet['entities']['media'][0]['expanded_url']:
                photo = 1
            else:
                video = 1
        
        df = df.append(pd.Series([week_day, photo, video, int(tweet_time), len(tweet['text']),tweet['favorite_count'],tweet['retweet_count']], index=df.columns),ignore_index=True)
        df_session_ce_onehot = ce_ohe.fit_transform(df)

    df_session_ce_onehot.to_csv('oba_hana_data.csv',index=False)

I haven't devised anything in particular, but it is difficult to handle the data well without understanding the specifications of the Twitter API, so let's get the data you want by google or trial and error. This dataset is --Day of the week --Presence or absence of image --Presence or absence of video --Tweet time --Number of characters in the tweet Only. The days of the week are divided by seven days using one-hot Encoding. We will study this time with only this information.

Preprocessing, learning

Up to this point, I was running the program on my own PC, but although it is not a big deal, the machine specifications are not enough based on future developments, so from here I will implement it on Google Colaboratory.

First, read the data output by the previous program.

import pandas as pd

datapath = '/content/drive/My Drive/data_science/'
df = pd.read_csv(datapath + 'oba_hana_data.csv')

The total number of data is

df.shape

There were 2992 cases.

(2992, 13)

I will use 70% of the total as learning data and the remaining 30% as verification data. The data are arranged in chronological order, and if it is simply 70% from the front, the conditions such as the number of followers at that time will be different, so I would like to randomly acquire 70%. Although it contains some analog elements, this time we will use 2400 cases, which is about 70% of the total, as training data and the rest as verification data.

df_train = df.sample(n=2400)
df_test = df.drop(df_train.index)

x_train = df_train.iloc[:,:11]
t_train = df_train['favorite_count']

x_test = df_test.iloc[:,:11]
t_test = df_test['favorite_count']

Let's learn with this data once in a random forest.

#Model declaration
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=2000, max_depth=10,
                                min_samples_leaf=4, max_features=0.2, random_state=0)

#Model learning
model.fit(x_train, t_train)

Let's see the score

#Model validation
print(model.score(x_train, t_train))
print(model.score(x_test, t_test))
0.5032870524389081
0.3102920436689621

What the score represents is the coefficient of determination because this is a regression problem. Taking a value from 0 to 1, the closer it is to 1, the higher the accuracy.

Looking at the score this time, the accuracy is not good. Now, let's carry out preprocessing to improve accuracy. First, let's look at the contribution of each parameter.

#Variables are stored in descending order of contribution of features
feat_names = x_train.columns.values
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]

plt.figure(figsize=(10, 10))
plt.title('Feature importances')
plt.barh(range(len(indices)), importances[indices])
plt.yticks(range(len(indices)), feat_names[indices], rotation='horizontal')
plt.show();

feature.png

I would like you to think about what is important in making an estimate. It seems that have_photo has the highest contribution, that is, the presence or absence of a photo is large. The videos don't seem to be that important, but the videos are probably not as many as 3% of the total. It seems that the day of the week has almost nothing to do with it. This can be removed from the data.

We will also look at outliers.

#Visualization with graph
plt.figure(figsize=(8, 6))
plt.scatter(range(x_train.shape[0]), np.sort(t_train.values))
plt.xlabel('index', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.show()

glaph.png

You can see that some of the training data are clearly out of alignment. I will remove these data because I may be dragged by these and learn.

#It is the reverse of the explanation order, but if you do not do it in this order, you can not split with outliers removed
#Outlier removal
df_train = df_train[df_train['favorite_count'] < 4500]
df_train.shape

#Delete day of the week
x_train = df_train.iloc[:,7:11]
t_train = df_train['favorite_count']
x_test = df_test.iloc[:,7:11]
t_test = df_test['favorite_count']

Now let's learn again and see the score.

#Model validation
print(model.score(x_train, t_train))
print(model.score(x_test, t_test))
0.5175871090277164
0.34112337762190204

It's better than before. Finally, let's look at the distribution of the actual number of likes and the estimated number of likes in a histogram.

compare.png

Blue is the estimated number and orange is the actual number of likes. There is certainly a gap, but I can't say anything because I haven't seen each of them.

The improvement of accuracy will also be next.

reference

--How to register Twitter API

Tips for passing Twitter API registration in 30 minutes (with Japanese translation)

--Data acquisition using Twitter API

Get a lot of tweets with TwitterAPI. Consider server-side errors (in python)

Recommended Posts

Try to estimate the number of likes on Twitter
Try to improve the accuracy of Twitter like number estimation
Post the subject of Gmail on twitter
Try to estimate the parameters of the gamma distribution while simply implementing MCMC
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
Try to simulate the movement of the solar system
Save the search results on Twitter to CSV.
Visualize the timeline of the number of issues on GitHub assigned to you in Python
Use twitter API to get the number of tweets related to a certain keyword
python beginners tried to predict the number of criminals
How to know the port number of the xinetd service
How to get the number of digits in Python
Try to solve the problems / problems of "Matrix Programmer" (Chapter 1)
Try to get the contents of Word with Golang
Post to your account using the API on Twitter
Tweet the triple forecast of the boat race on Twitter
A tool to follow posters with a large number of likes on instagram [25 minutes to 1 second]
Command to check the total number of CPU physical cores / logical cores / physical memory on Mac
Posted the number of new corona positives in Tokyo to Slack (deployed on Heroku)
10. Counting the number of lines
Try to analyze Twitter trends
Try using the Twitter API
Get the number of digits
Try using the Twitter API
Calculate the number of changes
How to find the optimal number of clusters in k-means
Try to evaluate the performance of machine learning / regression model
Count the number of characters in the text on the clipboard on mac
Try to evaluate the performance of machine learning / classification model
Try to solve the problems / problems of "Matrix Programmer" (Chapter 0 Functions)
How to increase the number of machine learning dataset images
[Python] I tried to visualize the follow relationship of Twitter
Try to automate the operation of network devices with Python
Try to extract the features of the sensor data with CNN
How to know the number of GPUs from python ~ Notes on using multiprocessing with pytorch ~
How to output the number of VIEWs, likes, and stocks of articles posted on Qiita to CSV (created with "Python + Qiita API v2")
[Note] Let's try to predict the amount of electricity used! (Part 1)
Get the number of views of Qiita
[Cloudian # 9] Try to display the metadata of the object in Python (boto3)
Display the image of the camera connected to the personal computer on the GUI.
[Python] Try to graph from the image of Ring Fit [OCR]
First python ② Try to write code while examining the features of python
Try to solve the N Queens problem with SA of PyQUBO
Calculation of the number of Klamer correlations
Try to introduce the theme to Pelican
Django: Fluctuate the number of child forms depending on the number of input items
Try Ajax on the Django page
Try to model the cumulative return of rollovers in futures trading
Disguise the grass on GitHub and try to become an engineer.
How to use Jupyter on the front end of supercomputer ITO
A command to easily check the speed of the network on the console
Get the number of Youtube subscribers
Cython to try in the shortest
I checked the image of Science University on Twitter with Word2Vec.
How to update the python version of Cloud Shell on GCP
Try to predict the triplet of boat race by ranking learning
The fastest way to try EfficientNet
Supplement to the explanation of vscode
I tried to estimate the interval.
Get the number of readers of a treatise on Mendeley in Python
The easiest way to try PyQtGraph