Introduction

This time, we will talk about creating a model that predicts the number of views from the title of Jaru Jaru's video. Since NLP is a complete amateur, I tried to imitate it by referring to other people's articles.

What is Jaru Jaru?

Jaru Jaru is ** the most vigorous comedy combination ** composed of Junpei Goto and Shusuke Fukutoku, who belong to Yoshimoto Kogyo Tokyo Headquarters. Currently, I am posting daily news items on the JARU JARU TOWER project on the Jaru Jaru Official YouTube Channel. I will.

background

It is very costly to check the material posted on Youtube every day. Also, Jaru Jaru videos tend to grow more easily with titles that don't make sense (very subjective). For example, "The guy who was made a bad customer by a bad clerk" "[The dictator's egg guy](https: // www. youtube.com/watch?v=RPXFYBRJVMw) ". If the title contains words such as "Dangerous, crazy", I feel that the number of views is generally high. On the contrary, the story "The guy who sees the story of Chara Bancho" tends to be played less frequently, and everyone in the video with this title * It is an annual event to give a low rating ** without asking questions. スクリーンショット 2019-12-20 13.51.23.png

Development environment

Python
Mecab
scikit-learn
gensim

Data collection

This time, we will use the Youtube Data API to collect video titles and views as a set. This article "Using YouTube Data api v3 from Python to get videos of a specific channel [^ 1]" and "Using YouTube Data api v3 from Python to get the number of videos viewed gently [^ 2] I referred to the article. Also, since you need an API key to use the Youtube API, "How to get the YouTube API API key [^ 3]" I got the key by referring to this article. First, collect the title and video ID of the code video below (required to get the number of views of the video).

`jarujaru_scraping1.py`


import os
import time
import requests

import pandas as pd


API_KEY = os.environ['API_KEY']#Bring the ID added to the environment variable
CHANNEL_ID = 'UChwgNUWPM-ksOP3BbfQHS5Q'


base_url = 'https://www.googleapis.com/youtube/v3'
url = base_url + '/search?key=%s&channelId=%s&part=snippet,id&maxResults=50&order=date'
infos = []

while True:
    time.sleep(30)
    response = requests.get(url % (API_KEY, CHANNEL_ID))
    if response.status_code != 200:
        print('Ends with an error')
        print(response)
        break
    result = response.json()
    infos.extend([
        [item['id']['videoId'], item['snippet']['title'], item['snippet']['description'], item['snippet']['publishedAt']]
        for item in result['items'] if item['id']['kind'] == 'youtube#video'
    ])

    if 'nextPageToken' in result.keys():
        if 'pageToken' in url:
            url = url.split('&pageToken')[0]
        url += f'&pageToken={result["nextPageToken"]}'
    else:
        print('Successful completion')
        break

videos = pd.DataFrame(infos, columns=['videoId', 'title', 'description', 'publishedAt'])
videos.to_csv('data/video1.csv', index=None)

After collecting the video titles and IDs, use the code below to collect the number of views.

`jarujaru_scraping2.py`


import os
import time
import requests

import pandas as pd


API_KEY = os.environ['API_KEY']
videos = pd.read_csv('videos.csv')
base_url = 'https://www.googleapis.com/youtube/v3'
stat_url = base_url + '/videos?key=%s&id=%s&part=statistics'

len_block = 50
video_ids_per_block = []
video_ids = videos.videoId.values

count = 0
end_flag = False
while not end_flag:
    start = count * len_block
    end = (count + 1) * len_block
    if end >= len(video_ids):
        end = len(video_ids)
        end_flag = True

    video_ids_per_block.append(','.join(video_ids[start:end]))

    count += 1

stats = []
for block in video_ids_per_block:
    time.sleep(30)
    response = requests.get(stat_url % (API_KEY, block))
    if response.status_code != 200:
        print('error')
        break
    result = response.json()
    stats.extend([item['statistics'] for item in result['items']])

pd.DataFrame(stats).to_csv('data/stats.csv', index=None)
videos = pd.read_csv('videos.csv')
stasas = pd.read_csv('stats.csv')
pd.merge(videos, stasas, left_index=True, right_index=True).to_csv('data/jarujaru_data.csv')

The following data will be saved. スクリーンショット 2019-12-20 14.43.11.png

Playback prediction model

Labeling

This time, I will divide the number of views into three stages and make it a classification problem. The histogram of the number of views is as follows. From the graph below, we will label with overwhelming subjectivity. The number of views is less than 100,000, 100,000 or more and less than 250,000, and 250,000 or more.

The code below is a code that takes only the control name from the labeling and the title of the video. Jaru Jaru's Tale videos always use the name of the Tale enclosed in "".

`jarujaru_scraping3.py`


import re
import pandas as pd
info = []
df = pd.read_csv('data/jarujaru_data.csv')
for row, item in df.iterrows():
    if '『' in item['title']:
        title = 'x' + item['title']
        title = re.split('[『』]', title)[1]
        if item['viewCount'] >= 250000:
            label = 2
        elif 100000 <= item['viewCount'] < 250000:
            label = 1
        elif item['viewCount'] < 100000:
            label = 0
        info.extend([[title, item['viewCount'], item['likeCount'], item['dislikeCount'], item['commentCount'], label]])
        
pd.DataFrame(info, columns=['title', 'viewCount', 'likeCount', 'dislikeCount', 'commentCount', 'label']).to_csv('data/jarujaru_norm.csv')

Morpheme etc.

Refer to this [^ 4] article to morphologically analyze the title of the control and convert the title to a feature vector (Bag-of-words format). Below is part of the code. All implementations will be posted on GitHub [^ 5].

`jarujaru.py`


import analysis #I will post it on my own code, GitHub.
import pandas as pd
from gensim import corpora
from gensim import matutils

def vec2dense(vec, num_terms):
    return list(matutils.corpus2dense([vec], num_terms=num_terms).T[0])

df = pd.read_csv('data/jarujaru_norm.csv')
words = analysis.get_words(df['title']) #Enter the morphologically parsed title here

#Make a dictionary
dictionary = corpora.Dictionary(words)
dictionary.filter_extremes(no_below=2, keep_tokens=['Chara','Man','Bancho'])
dictionary.save('data/jarujaru.dict')
courpus = [dictionary.doc2bow(word) for word in words]

# Bag-of-Convert to words format
data_all = [vec2dense(dictionary.doc2bow(words[i]),len(dictionary)) for i in range(len(words))]

Model learning

This time, we adopted SVM as the model because the number of data is small. Divide the data into training data and test data and plunge into the model.

`jarujaru.py`


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

#Training / test data settings
train_data = data_all
X_train, X_test, y_train, y_test = train_test_split(train_data, df['label'], test_size=0.2, random_state=1)

#Data standardization
sc = StandardScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)


#Creating a learning model
clf = SVC(C = 1, kernel = 'rbf')
clf.fit(X_train_std, y_train)
import pickle
with open('data/model.pickle', mode='wb') as fp:
     pickle.dump(clf, fp)

Let's evaluate the model.

`jarujaru.py`


score = clf.score(X_test_std, y_test)
print("{:.3g}".format(score))
predicted = clf.predict(X_test_std)

The accuracy was 53%. It's 33%, so I'm able to learn (although it's terrible). Let's also look at the confusion matrix. It seems that most of the videos are a rugged model that predicts more than 100,000 playbacks.

At the end

This time I made a model that predicts the number of views from the title of Jaru Jaru's video. Being an NLP amateur, I didn't know much about vectorization of sentences, but I was able to create a model until the end. All implementations will be posted on GitHub [^ 5]. Next time, I will use this model to develop "LINE bot that will notify you if it is worth watching when a Jaru Jaru video is posted". Also, I would like to study the method of vectorizing sentences and models that handle time series data (LSTM, etc.).

[^ 1]: Get videos for a specific channel using YouTube Data api v3 from Python [^ 2]: Use YouTube Data api v3 from Python to get the number of video views gently [^ 3]: How to get YouTube API API key [^ 4]: Predict the classification of news articles by machine learning [^ 5]: Source code of this time

[PYTHON] "The guy who predicts the number of views from the title of Jaru Jaru's video"

Introduction

What is Jaru Jaru?

background

Development environment

Data collection

jarujaru_scraping1.py

jarujaru_scraping2.py

Playback prediction model

Labeling

jarujaru_scraping3.py

Morpheme etc.

jarujaru.py

Model learning

jarujaru.py

jarujaru.py

At the end

`jarujaru_scraping1.py`

`jarujaru_scraping2.py`

`jarujaru_scraping3.py`

`jarujaru.py`

`jarujaru.py`

`jarujaru.py`