An introduction to data analysis using Python-To increase the number of video views-

Introduction

Purpose of this article

When I came up with the idea that "data science seems to be fun and I want to start!", There is information on "methods" such as packages such as pandas and models such as SVM, but there are few introductions of familiar cases. I think so. ~~ I'm not excited about Ayame's classification. ~~

Therefore, the purpose of this article is to let you experience a series of data analysis processes, using my hobby "Analysis to increase the number of views of VOCALOID's first posted work" as an example. I hope I could have some influence on your analysis and motivation.

Structure of this article

We will proceed with the analysis in the following order. I don't refer to anything specific, but I am conscious of the form that incorporates programming technology based on the flow of empirical papers using econometrics. [^ 1]

[^ 1]: This analysis is part of the classical data analysis framework of "making a hypothesis first and then testing it with data." I think the process is different from the data mining approach of "finding useful knowledge from messy data".

  1. Theme setting
  2. Hypothesis setting
  3. Obtaining data using API
  4. Data analysis
  5. Valuation of analysis

Technology covered in this article

The main purpose of this article is to guide you through the process of data analysis, but if you are interested in the following topics used for analysis, please stop by!

--How to hit Nico Nico Douga "Snapshot Search API v2" using Python --Grouping under arbitrary conditions using pandas groupby --Mann-Whitney U test (comparison of median between two groups) --How to eliminate "fake correlation"

Good knowledge to read this article

--Basic knowledge about libraries such as numpy, pandas, matplotlib --Introductory knowledge of statistics

1. Theme setting

You have completed your first work as a new Vocaloid P. When posting a work to a video site, I'm thinking of trying to increase the number of views as much as possible. The quality of the song itself can't be changed anymore, so I'm thinking of devising something for the title and poster comments.

2. Hypothesis setting

The first post of a VOCALOID song is almost always tagged as "VOCALOID virgin work". On the other hand, few people put "first post" in the title of their work. (For example, "[Hatsune Miku Original] ~ Title ~ [First Post]")

If you add "first post" to the title, what kind of image will viewers have when they see it? There are two possible reactions: "Oh, I'll ask you what kind of newcomer you are" and "I can't help but ask you if you're a newcomer with no track record." If the former reaction is predominant, you can expect an increase in the number of views by adding a word to the title to make it more eye-catching.

Therefore, in this article, I would like to test the hypothesis that the number of views may increase by adding "first post" to the title of works tagged with "VOCALOID virgin work".

3. Obtaining data using API

The data required for this analysis is

-The work with the "VOCALOID virgin work" tag --"Number of views" + "Title" -+ "Posted date and time" (used in the second half)

Will be. In addition, we will handle the data for the past 4 years (works posted from 2013 to 2016).

Let's get the work information from Nico Nico Douga "Snapshot Search API v2". How to use is organized in the official guide. http://site.nicovideo.jp/search-api-docs/snapshot.html

The points are as follows.

--By defining the search conditions in dictionary format and encoding with the urllib library, the code becomes easier to see (around url_query in the sample code). --Send get requests in the requests library. Convert the response to json format and handle it ――Since you can get up to 100 works at a time, make full use of offset and loop statements [^ 2] An error may occur if the offset is set to 1601 or higher (is the server overloaded and limited?). Prevent the offset from exceeding 1601 by subdividing the post year filter into 2013, ..., 2016 instead of 2013 to 2016 at once.

[^ 2]: How to use offset: For example, if you sort in the order of the number of playbacks and set offset = 30, you can specify the works after the 30th ranking.


import urllib
import requests
import time
import pandas as pd


class NiconicoApi():

    def __init__(self, keyword):
        self.keyword = keyword

    def gen_url(self, year, offset):
        url_body = 'http://api.search.nicovideo.jp/api/v2/video/contents/search?'
        url_query = urllib.parse.urlencode( {
          'q': self.keyword,
          'filters[startTime][gte]': '%i-01-01T00:00:00' % year,
          'filters[startTime][lt]':  '%i-01-01T00:00:00' % (year + 1),
          '_offset': offset,
          'targets': 'tags',
          'fields': 'title,viewCounter,startTime',
          '_sort': '-viewCounter',
          '_limit': 100,
          '_context': 'apiguide'
        } )
        self.url_ = url_body + url_query
        return self

    def get_json(self):
        response = requests.get(self.url_)
        self.json_ = response.json()
        return self.json_


'''Data acquisition'''
data = []
nicoApi = NiconicoApi('VOCALOID virgin work')

for year in range(2013, 2017):
    offset = 0
    nicoApi.gen_url(year=year, offset=offset)
    json = nicoApi.get_json()

    while json['data']:
        data += json['data']
        offset += 100
        nicoApi.gen_url(year=year, offset=offset)
        json = nicoApi.get_json()
        time.sleep(1)


'''Conversion to DataFrame'''
df = pd.DataFrame.from_dict(data)
df.shape  # => (4579, 3)

The sample size is now 4579 [^ 3]. Since it is data for 4 years, it is calculated that more than 1000 new Vocaloid-Ps are born every year.

[^ 3]: As of 06/03/2017

4. Data analysis

Visualization of ranking and number of views

Before we get into full-scale analysis, let's get a quick overview of the data.

First, plot the number of playbacks on the vertical axis and the ranking on the horizontal axis. At this time, since it can be predicted that there is a large difference in the number of views between popular works and unknown works, we will take the vertical axis on the log scale.


import numpy as np
import matplotlib.pyplot as plt


df = df.sort_values('viewCounter', ascending=False)
ranking = np.arange(len(df.index))
view_counter = df['viewCounter'].values
plt.scatter(ranking, view_counter)
plt.yscale('log')
plt.grid()
plt.title('2013~2016 Ranking & ViewCounter')
plt.xlabel('Ranking')
plt.ylabel('ViewCounter')
plt.show()

result: pic1.png

――It seems that about 3/4 of the works are played 100 to 1000 times. ――The degree of this distortion even though the vertical axis is log. The disparity is terrible ... ――Since we were able to confirm the degree of distortion of the distribution and the existence of outliers, we would like to utilize it for future analysis.

Grouping DataFrames by "whether or not the title includes" first post ""

For analysis, it is necessary to separate works that include "first post" in the title and works that do not. Use the pandas groupby method.

groupby often passes a column name as an argument, but you can also pass a function. Passing a function makes it easy to group under arbitrary conditions.

This time, we want to group by "whether or not the title includes" first post "", so let's define the function of the judgment condition. We will also include "virgin work", which has the same meaning as "first post", in the analysis.

As shown below, define a function with different return values depending on whether the argument contains'first post'or'virgin work' or neither.


def include_keyword(title):
  if 'First post' in title:
    return 'First post'
  elif 'First book' in title:
    return 'First book'
  else:
    return 'Control group'

Pass the defined function as an argument of groupby. Then, the function is applied to each element of index of df1 and grouped according to the return value. It's similar to filter.


#Change index to title column
df1 = df[['viewCounter', 'title']].set_index('title')

#grouping
df1_grouped = df1.groupby(include_keyword)

Find descriptive statistics

Now that the grouping is complete, let's move on to a detailed analysis.

If you calculate the average value and median value for each group and compare them with "works with keywords (hereinafter, treatment group)" and "works without keywords (control group)", "effects of including keywords (effects by including keywords)" Treatment effect) ”can be measured.

Then for each group

Let's ask for.

To find multiple descriptive statistics at once, it is convenient to use agg. By passing multiple methods as an array to the argument of agg, those methods can be applied to df1_grouped at once.


functions = ['count', 'mean', 'median', 'std']
df1_summary = df1_grouped.agg(functions)

result: ――Of the 4579 samples, 109 works have "first post" in the title, and 144 works including "virgin work". ――The results showed that the treatment group had more regenerations than the control group, both by mean and median. In other words, by adding a word "first post" to the title, it seems that it was able to attract attention. pic2.png

Graph comparison

From now on, let's consider two groups, the "treatment group" and the "control group", to simplify the analysis. Change df1_grouped as follows.


def include_keyword(title):
    if 'First post' in title or 'First book' in title:
        return 'Treatment group'
    else:
        return 'Control group'


df1_grouped = df1.groupby(include_keyword)
df1_summary = df1_grouped.agg(functions)
df1_summary

result: pic3.png

Now, let's look at the difference between the two groups in a graph.

--Draw a hist graph with the number of playbacks (log scale) on the horizontal axis and the number of samples (normalized) on the vertical axis. --By giving opacity to plt.hist, the graph becomes translucent and the two-group comparison becomes easier to see.


X_treated   = np.log(df1_grouped.get_group('Treatment group').values)
X_untreated = np.log(df1_grouped.get_group('Control group').values)
plt.hist(X_treated, normed=True, bins=20, label='Treatment group', color='r', alpha=0.5)
plt.hist(X_untreated, normed=True, bins=20, label='Control group', color='b', alpha=0.5)
plt.show()

result: pic4.png

The graph also shows the difference in distribution between the treatment group and the control group.

Mann-Whitney U test

By the way, the desired result (the number of views increases by including "first post" in the title) has been suggested, but the "difference in the number of views between the treatment group and the control group" obtained in the previous section is statistical. Let's verify whether it is significant (isn't it a difference that can be explained by chance)?

As mentioned earlier, the distribution of playbacks is severely distorted, so you should look at the median rather than the mean. It is not the median comparison of the two groups, but an alternative is the Mann-Whitney U test [^ 4].

The Mann-Whitney U test puts the null hypothesis that the shape of the distribution between the two groups is the same [^ 5]. If this null hypothesis is rejected, the number of regenerations in the treatment and control groups happens to be unexplainably different.

The U test is provided in scipy's stats module. Let's test it.


from scipy import stats

result = stats.mannwhitneyu(X_treated, X_untreated)
print(result.pvalue)  # => 0.00137327945838

Since the p value is 0.0014, it can be said that the distribution between the two groups is significantly different. You did it!

[^ 5]: In order to perform the Mann-Whitney U test, it is necessary to assume the homoscedasticity of the two groups, but this is omitted for the sake of simplicity. I also tried the Brunner-Munzel test, which does not require a process of homoscedasticity, and found significant results, so it seems that the results are robust. (For the Brunner-Munzel test, refer to here: http://oku.edu.mie-u.ac.jp/~okumura/stat/brunner-munzel.html)

5. Examining the validity of the analysis

One thing that must be noted in such an analysis is the "fake correlation". It can be said that "correlation ≠ causality".

All we've made so far is that there is a correlation between adding a "first post" to the title and the number of views. We will consider whether this correlation can be interpreted as a causal relationship.

Missing variable bias

For example, there is a correlation between ice cream sales and water accidents. It seems rough to conclude from here that eating ice cream makes you more susceptible to water accidents, right?

The real cause is the "season". In econometric terms, "missing variable bias" is the derivation of the above-mentioned fake correlation by forgetting to consider "season". (In this case, missing variable = season)

Let's go back to the VOCALOID song. If there are "missing variables" in this analysis, what are the possible factors?

I came up with the case where "posted date" is a missing variable. Suppose the following situation holds.

――The VOCALOID culture has entered a period of decline, and the number of views is declining from 2013 to 2016. ――It is an old custom to add "first post" to the title of VOCALOID songs, and it has disappeared recently.

Under this circumstance, the two phenomena of "posting year old-> high number of views" and "posting year old-> adding" first post "to the title" overlap, creating a fake correlation. I understand this.

Let's verify if there is a fake correlation like the one above. Two columns representing the "posting year" and "treatment group or control group" are created in advance, and grouping is performed based on these two columns.


df2 = df.copy()
df2['title'] = df2['title'].apply(include_keyword)
df2['startTime'] = pd.DatetimeIndex(df2['startTime'])
#  tz_To apply localize'startTime'Is specified in index
df2 = df2.set_index('startTime')
df2.index = df2.index.tz_localize('UTC').tz_convert('Asia/Tokyo')
df2['startTime'] = df2.index
df2['startTime'] = df2['startTime'].apply(lambda x: x.year)
df2_grouped = df2.groupby(['startTime', 'title']).viewCounter
df2_summary = df2_grouped.agg(functions)
df2_summary

Results: Even when grouped by posting year, it can be seen that the number of views in the treatment group and the control group is different except in 2013. pic5.png

Multiple regression

Now that we've roughly confirmed in the table above that the year of posting is not a missing variable, let's analyze it more rigorously using multiple regression.

Suppose the number of views is determined as follows:

log (play count_i) = \ beta_0 + \ beta_1 * Treatment dummy _i + \ beta_2 * Time trend _i + error _i

Here, the details of each variable are as follows.

--$ log (plays_i) : Numbers of logs taken to suppress the influence of outliers - Treatment dummy_i : Dummy variable that takes 1 if sample i is treatment group and 0 if sample i is control group - Time Trend_i $: Difference between the latest video and the posting date and time of that video i (unit is day)

Next, I will explain the advantages of using multiple regression. With multiple regression, it is possible to estimate "how much the number of plays differs between the treatment group and the control group when the time trend is fixed ($ \ beta_1 $)". Therefore, it is possible to remove the apparent correlation due to the time trend.

The downside here is that it uses the average number of plays to estimate $ \ beta_1 $, making it more susceptible to outliers.

Let's run it.


import statsmodels.formula.api as smf

def include_keyword(title):
    if 'First post' in title or 'First book' in title:
        return 1
    else:
        return 0

df3 = df.copy()
df3['title'] = df3['title'].apply(include_keyword)
df3['startTime'] = pd.DatetimeIndex(df3['startTime'])
df3['timeTrend'] = df3['startTime'].apply(lambda x: (df3['startTime'].max() - x).days)
df3['lviewCounter'] = np.log(df3['viewCounter'])

mod = smf.ols('lviewCounter ~ title + timeTrend', data=df3).fit()
mod.summary()

result: pic6.png

--From the p value of the coefficient $ \ beta_1 $ of the treatment dummy (variable title), it can be said that the treatment effect is significant even in multiple regression. --The time trend (posting date and time) does not seem to affect the number of views. -It was estimated that $ \ beta_1 = 0.3582 $. This value can be interpreted as "the treatment group has a 36% higher replay rate than the control group."

We can say that this analysis result is more robust because we can confirm that the "post date" does not create a missing variable bias. You did it!

This concludes the "validation of analysis", but if you think "Is it better to verify this factor as well?", Please comment!

in conclusion

It's been a long time, but thank you so much for reading this far.

By the way, based on this analysis, I added "first post" to the title and posted it to Nico Douga, but the number of views was less than 200. Note that statistical trends do not apply to individual cases.

This article ends when the punch line is attached. I would be happy if I could tell you something about the flow of analysis, what you need to be careful about, and the technology you need.

Recommended Posts

An introduction to data analysis using Python-To increase the number of video views-
An introduction to statistical modeling for data analysis
Introduction to Statistical Modeling for Data Analysis Expanding the range of applications of GLM
Reading Note: An Introduction to Data Analysis with Python
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
An introduction to object orientation-let's change the internal state of an object
How to increase the number of machine learning dataset images
[Introduction to logarithmic graph] Predict the end time of each country from the logarithmic graph of infection number data ♬
Get the number of views of Qiita
Recommendation of data analysis using MessagePack
I checked the distribution of the number of video views of "Flag-chan!" [Python] [Graph]
[Python] PCA scratch in the example of "Introduction to multivariate analysis"
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
"The guy who predicts the number of views from the title of Jaru Jaru's video"
Did the number of store closures increase due to the impact of the new coronavirus?
How to find out the number of CPUs without using the sar command
Display the result of video analysis using Cloud Video Intelligence API from Colaboratory.
Try to get the road surface condition using big data of road surface management
Is the number equivalent to an integer?
I tried to perform a cluster analysis of customers using purchasing data
How to play a video while watching the number of frames (Mac)
[Introduction to Data Scientists] Basics of Python ♬
Introduction to Quiz Statistics (1) -Mathematical analysis of question sentences to know the tendency of questions-
Save an array of numpy to a wav file using the wave module
Organize Python tools to speed up the initial movement of data analysis competitions
[Introduction to Python] How to get the index of data with a for statement
Convert data with shape (number of data, 1) to (number of data,) with numpy.
Shortening the analysis time of Openpose using sound
[Introduction to minimize] Data analysis with SEIR model ♬
[Introduction to cx_Oracle] (5th) Handling of Japanese data
An introduction to voice analysis for music apps
Determine the number of classes using the Starges formula
From the introduction of pyethapp to the execution of contract
Check the status of your data using pandas_profiling
Scraping the winning data of Numbers using Docker
Use Pandas to write only the specified lines of the data frame to an excel file
What happens if you graph the number of views and ratings/comments of the video of "Flag-chan!" [Python] [Graph]
An easy way to pad the number with zeros depending on the number of digits [Python]
python beginners tried to predict the number of criminals
How to know the port number of the xinetd service
[Python] [Word] [python-docx] Simple analysis of diff data using python
An Introduction of the Cutting Edge Technologies-AI, ML, DL
Big data analysis using the data flow control framework Luigi
I tried to predict the J-League match (data analysis)
Try to estimate the number of likes on Twitter
Explanation of the concept of regression analysis using Python Part 1
Write data to KINTONE using the Python requests module
I tried using the API of the salmon data project
[Introduction to Python] How to stop the loop using break?
Explanation of the concept of regression analysis using Python Extra 1
[Python] Get the number of views of all posted articles
[Introduction to Python] Basic usage of the library matplotlib
The story of using circleci to build manylinux wheels
I want to increase the security of ssh connections
[Introduction to SIR model] Predict the end time of each country with COVID-19 data fitting ♬
Analyzing the life of technology with Qiita article data ~ Survival time analysis using content logs ~
[For beginners] I want to explain the number of learning times in an easy-to-understand manner.
How to specify an infinite number of tolerances in the numeric argument validation check of argparse
VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
How to calculate the sum or average of time series csv data in an instant
An introduction to statistical modeling for data analysis (Midorimoto) reading notes (in Python and Stan)