[Python] I tried to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning

Hello. This is Kushima, the first year since joining NTT DoCoMo. In this article on the 16th day of the Advent calendar, we will explain in detail how to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning. The programming language used is Python.

Introduction

This article consists of the following two parts.

-** How to use YouTube Data API in Python ** -** Classification of video thumbnails based on deep learning **

"How to use YouTube Data API in Python" describes from the preparation required to acquire the information of YouTube video to the explanation of the code to actually acquire. By referring to this article,

** You can actually get the number of views of YouTube videos and thumbnail images. **

"Classifying video thumbnails based on deep learning" provides a brief description of the Convolutional Neural Network (CNN), a type of deep learning, to a description of the code that classifies video thumbnails based on it. By referring to this article,

** You can actually build a classification model on CNN and apply it to video thumbnails. **

We hope that it will be helpful for those who want to analyze YouTube video data using the YouTube Data API and those who want to classify images for the time being by deep learning.

Finally, we considered what kind of images are often viewed from the results of image classification.

Thumbnails that are easy to play on YouTube had the following features.

** Features of thumbnails that are easy to play on YouTube **

-** High color saturation ** -** Many colors ** -** Many telop characters ** -** Faces of people and characters are shown **

How to use YouTube Data API in Python

What is YouTube Data API?

YouTube Data API is an API that can acquire information on videos posted on YouTube.

Official YouTube Data API documentation: https://developers.google.com/youtube/v3/getting-started?hl=ja

Below is an example of information about videos that can be obtained with the YouTube Data API.

-** Title ** -** Channel name ** -** Views ** -** Highly rated ** -** Thumbnail URL **

In this article, we used views and thumbnails.

Preparing to use the YouTube Data API

The following preparations are required to use the YouTube Data API.

--Create a Google account -** Creating a new project ** -** API and service activation ** -** Get API key **

I will explain each preparation in detail.

Create a new project

Go to the following and click Create Project to create a new project with any name. http://console.developers.google.com/project

When you create a project, you can access the project management screen from notifications.

API and service activation

Once you have access to the project management screen, click Go to API Overview, then Enable APIs and Services to access the API Library. In the API library, you can search for the API you want to enable. Search for "YouTube Data API" etc., select "YouTube Data API v3" in the search results, and click "Enable" to complete the activation of YouTube Data API.

Get API key

There is a tab called "Credentials" on the left side of the project management screen or the YouTube Data API management screen. Click on it (after selecting YouTube Data API if you clicked from the project management screen), then click "Create Credentials" and then "API Key" to create the API key.

This completes the preparation for using the YouTube Data API.

Preparation on the Python side

To use the YouTube Data API in Python, install the library in advance with the following pip command.

python


pip install google-api-python-client

Get YouTube video information

You can get the information of YouTube video by executing the following code.

python


from apiclient.discovery import build

YOUTUBE_API_KEY = '{Obtained API key}'

youtube = build('youtube', 'v3', developerKey=YOUTUBE_API_KEY)

search_response = youtube.search().list(
part='snippet',
#Search query
q='Game commentary',
#Most viewed
order='viewCount',
type='video',
).execute()

The details of the acquired information can be seen by checking the contents of search_response as shown below.

python


search_response['items'][0]

Example of elements that can be confirmed:

--videoId: Video ID --channelId: Channel ID --title: Video title -- description: Video description --thumbnails: Video thumbnail (URL information) --channelTitle: Channel name --publishTime: Posted date

You can also check the number of views and the number of high ratings of the video by executing the following code using the above videoId.

python


statistics = youtube.videos().list(
#Statistics
part = 'statistics',
id = {Video id of video}
).execute()['items'][0]['statistics']

Example of elements that can be confirmed:

--viewCount: Number of views --likeCount: Highly rated number

If you want to check the information that can be obtained and how to use the API in more detail, please see the official document. Official YouTube Data API documentation: https://developers.google.com/youtube/v3/getting-started?hl=ja

Also, if you want to store the information of multiple videos in a data frame and analyze it, there are the following methods.

First, specify the conditions to search. In the code below, the search query uses various parameters such as game commentary, most viewed, get 50 results, 2020/07/01 --2020/12/01 period, etc. You can specify it.

python


search_response = youtube.search().list(
part='snippet', 
#Search query
q='Game commentary',
#Most viewed
order='viewCount', 
type='video', 
#50 cases
maxResults=50, 
#Upload date is 2020/07/01 or later
publishedAfter='2020-07-01T00:00:00Z', 
#Upload date is 2020/12/Before 01
publishedBefore='2020-12-01T00:00:00Z'
)
output = youtube.search().list(
part='snippet', 
q='Game commentary', 
order='viewCount', 
type='video', 
maxResults=50, 
publishedAfter='2020-07-01T00:00:00Z', 
publishedBefore='2020-12-01T00:00:00Z'
).execute()

Next, use the for statement to store the search results in the list. Please note that the YouTube Data API has a limited number of uses when used for free. There is a tab called "Assignment" on the left side of the YouTube Data API management screen, and if you check that, you will see the notation 10,000 Queries / day. It has not been verified how much it will be consumed by one code execution, but please note that there is a limit for free use.

python


#Number of loops
num = 20
#List to store video information
video_list = []
for i in range(num):        
    video_list = video_list + output['items']
    search_response = youtube.search().list_next(search_response, output)
    output = search_response.execute()

Finally, convert the list created above to a data frame. In the following, the number of views is filtered by a variable called HighViewCount.

python


import pandas as pd

#Function to get statistics
def get_statistics(id):
    statistics = youtube.videos().list(part = 'statistics', id = id).execute()['items'][0]['statistics']
    return statistics

#View count value to filter
HighViewCount = 100000
df = pd.DataFrame(video_list)
df1 = pd.DataFrame(list(df['id']))['videoId']
df2 = pd.DataFrame(list(df['snippet']))[['channelTitle','publishedAt','channelId','title','description']]
df3 = pd.DataFrame(list(pd.DataFrame(list(pd.DataFrame(list(df['snippet']))['thumbnails']))['high']))['url']
ddf = pd.concat([df1, df2, df3], axis = 1)
df_static = pd.DataFrame(list(ddf['videoId'].apply(lambda x : get_statistics(x))))
df_output = pd.concat([ddf,df_static], axis = 1)
df_output['viewCount'] = df_output['viewCount'].astype(int)
#Filter videos by views
df_highview = df_output[df_output['viewCount']>=HighViewCount]

Get thumbnails of YouTube videos

Use the data frame obtained in the previous section to get the video thumbnail itself. Below is a code example to get a thumbnail. ** * Notes are written below. ** **

python


import requests

df_highview = df_highview.drop_duplicates()
df_highview = df_highview.reset_index(drop=True)
df_loop = df_highview
for i in range(len(df_loop)):
		#Enter the URL to get the image itself
    response = requests.get(df_loop.loc[i, 'url'])
    image = response.content
    filename = './image_' + str(i) + '.jpg'
    with open(filename, "wb") as f:
        f.write(image)

In this code, the thumbnail URL part of the information acquired in the previous section is extracted and the image is acquired. ** However, please note that when writing the code to access the image URL as above, please devise so as not to burden the server. The above code is an example, so please take appropriate measures such as spacing the access. ** ** If you would like to find out more about this point, please refer to the following article.

-[For beginners] Download images by specifying a URL in Python -Let's scrape images with Python

This completes the first goal of this article, "** Actually get the number of views of YouTube videos and thumbnail images **".

Classification of video thumbnails based on deep learning

What is Convolutional Neural Network (CNN)?

CNN, a type of deep learning, is a deep learning model that introduces convolutional processing into a neural network. It has a model structure suitable for image recognition and classification, and is a model often used in that field. In this article, we will apply a CNN with a general structure to video thumbnails to solve the classification problem.

Classification problem settings

In this article, we will consider solving the problem of classifying thumbnails with high views and thumbnails with low views by using the information of the number of views and thumbnails. Specifically, for videos for which the search query was acquired as game commentary, the number of views was 100,000 or more and 10,000 or less, and positive and negative examples were separated, and CNN's thumbnail images were used. Build the model.

Loading video thumbnails

Load the image data on the folder with the following code. The number of images used in this article is 749 for positive examples and 748 for negative examples.

python


import glob
import PIL
import keras
from keras.preprocessing import image

#Image size to resize
input_shape = (256, 256, 3)
#Number of classes
num_classes = 2
#image data
x = []
#label(1:Positive example, 0:Negative example)
y = []
#Image file name
z = []

image_list_positive = glob.glob('{Directory of regular image folders}/image_?.jpg')

for f in image_list_positive:
    x.append(image.img_to_array(image.load_img(f, target_size=input_shape[:2])))
    y.append(1)
    z.append(f)
    
image_list_negative = glob.glob('{Negative image folder directory}/image_?.jpg')
    
for f in image_list_negative:
    x.append(image.img_to_array(image.load_img(f, target_size=input_shape[:2])))
    y.append(0)
    z.append(f)

Image preprocessing

Apply the preprocessing to the image with the following code.

python


import numpy as np
from keras.utils import plot_model, to_categorical
from sklearn.model_selection import train_test_split

x = np.asarray(x)
x /= 255
y = np.asarray(y)
#Convert labels to categorical variables
y = keras.utils.to_categorical(y, num_classes)
#Split image dataset for training and testing
x_train, x_test, y_train, y_test, z_train, z_test = train_test_split(x, y, z, test_size=0.33, random_state= 3)
#Divide the training dataset into one for use as is in model training and one for verification
x_train_train, x_train_val, y_train_train, y_train_val, z_train_train, z_train_val = train_test_split(x_train, y_train, z_train, test_size=0.1, random_state = 3)

The volumes of each of the datasets divided as above are listed below.

python


len(x_train), len(x_test)
1002, 495
len(x_train_train), len(x_train_val)
901, 101

Model building

Build the model with the following code.

python


from keras.models import Sequential
from keras.layers import Conv2D
from keras.layers import MaxPool2D
from keras.optimizers import Adam
from keras.layers import Dense, Activation, Dropout, Flatten

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPool2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08)
model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=adam,
              metrics=['accuracy'])

Verification of classification accuracy

Learn by applying training data to the built model.

python


#Batch size
batch_size = 100
#Number of epochs
epochs = 100
history = model.fit(x_train_train, y_train_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_train_val, y_train_val))

Classify the test data using the trained model.

python


predictions = model.predict(x_test)

Check the classification accuracy from the classification result and the true value.

--Correct answer rate

Accuracy: 0.76

--Confusion matrix

True Positive: 178 True Negative: 196 False Positive: 78 False Negative: 43

sklearn_confusion_matrix.png

--Recall rate, precision rate, F value

Recall: 0.81 Precision: 0.70 F-measure: 0.75

Since this article is a trial application of CNN, we can expect higher accuracy by changing the model structure, batch size, and number of epochs to more appropriate ones. Also, since it is deep learning, we should increase the number of images used for learning, but there is a background that we could not collect images more than expected due to usage restrictions when using the YouTube Data API in the free version. Please be careful if you use the YouTube Data API.

This completes the second goal of this article, "** You can actually build a classification model with CNN and apply it to video thumbnails **".

Qualitative evaluation

Finally, we will consider what kind of images are often viewed from the classification results of the test images.

The characteristics of images that were correct in the correct example and images that were incorrect in the negative example (that is, thumbnails that are easy to play on YouTube) are listed below.

** Features of thumbnails that are easy to play on YouTube **

-** High color saturation ** -** Many colors ** -** Many telop characters ** -** Faces of people and characters are shown **

It is a subjective evaluation to the last, but I think there is a slight tendency. By visualizing the feature map and increasing the number of images to be learned, you can see more clear features.

Summary

In this article, we have introduced in detail how to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning. Specifically, we have described the methods for achieving the following two goals.

-** You can actually get the number of views of YouTube videos and thumbnail images ** -** You can actually build a classification model on CNN and apply it to video thumbnails **

What did you think? We hope that this article will be of some help to you.

With the YouTube Data API, you can get a lot of information other than the information used in this article. In the future, I would like to use other information to analyze data and build models. In addition, since this article was a trial application of CNN construction, I would like to take on the challenge of searching for an appropriate model structure and introducing the latest methods.

Reference article

-Get Youtube data in Python using Youtube Data API -Try using YouTube Data API -[For beginners] Download images by specifying a URL in Python -Let's scrape images with Python -Create a machine learning model for image classification (1) CNN from scratch -Image classification with Keras-from preprocessing to classification test-

Recommended Posts

[Python] I tried to analyze the characteristics of thumbnails that are easy to play on YouTube by deep learning
I tried the common story of using Deep Learning to predict the Nikkei 225
I tried to analyze the New Year's card by myself using python
I tried to predict the presence or absence of snow by machine learning.
I tried to summarize the string operations of Python
[LPIC 101] I tried to summarize the command options that are easy to make a mistake
I tried to make it easy to change the setting of authenticated Proxy on Jupyter
I tried to find the entropy of the image with python
[Python] I tried to visualize the follow relationship of Twitter
[Machine learning] I tried to summarize the theory of Adaboost
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to verify the yin and yang classification of Hololive members by machine learning
I tried to make Othello AI that I learned 7.2 million hands by deep learning with Chainer
I tried to summarize the languages that beginners should learn from now on by purpose
I tried to improve the efficiency of daily work with Python
I tried to make a site that makes it easy to see the update information of Azure
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 1
I tried to verify the speaker identification by the Speaker Recognition API of Azure Cognitive Services with Python. # 2
I tried to summarize the contents of each package saved by Python pip in one line
[Python] I tried to make a simple program that works on the command line using argparse.
[Python] I tried to visualize the night on the Galactic Railroad with WordCloud!
I tried to implement deep learning that is not deep with only NumPy
[Python] I tried to analyze the pitcher who achieved no hit no run
I tried to open the latest data of the Excel file managed by date in the folder with Python
A super introduction to Django by Python beginners! Part 2 I tried using the convenient functions of the template
I tried using the trained model VGG16 of the deep learning library Keras
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried with the top 100 PyPI packages> I tried to graph the packages installed on Python
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I tried to streamline the standard role of new employees with Python
I tried to classify Oba Hana and Emiri Otani by deep learning
I tried to get the movie information of TMDb API with Python
I tried to verify the result of A / B test by chi-square test
I tried the common story of predicting the Nikkei 225 using deep learning (backtest)
I refactored "I tried to make a script that saves posted images at once by going back to the tweets of a specific user on Twitter".
Python error messages are concrete and easy to understand "ga" (SyntaxError on the closing side of triple "" "comments)
[Python & SQLite] I tried to analyze the expected value of a race with horses in the 1x win range ①
I tried to wake up the place name that appears in the lyrics of Masashi Sada on the heat map
I'm an amateur on the 14th day of python, but I want to try machine learning with scikit-learn
[Python] I tried to judge the member image of the idol group using Keras
I want to use Python in the environment of pyenv + pipenv on Windows 10
[Python] Deep Learning: I tried to implement deep learning (DBN, SDA) without using a library.
I tried using the Python library "pykakasi" that can convert kanji to romaji.
I tried to summarize the operations that are likely to be used with numpy-stl
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
I tried to automate the 100 yen deposit of Rakuten horse racing (python / selenium)
[IBM Cloud] I tried to access the Db2 on Cloud table from Cloud Funtions (python)
Build a python environment to learn the theory and implementation of deep learning
I tried to predict the change in snowfall for 2 years by machine learning
I tried to refactor the code of Python beginner (junior high school student)
I tried to classify Oba Hana and Emiri Otani by deep learning (Part 2)
I tried to display the infection condition of coronavirus on the heat map of seaborn
I tried to automatically send the literature of the new coronavirus to LINE with Python
I tried web scraping to analyze the lyrics.
Python list comprehensions that are easy to forget
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried using the Datetime module by Python
Qiita Job I tried to analyze the job offer
I tried to predict the price of ETF