[PYTHON] Analyzing the life of technology with Qiita article data ~ Survival time analysis using content logs ~

This article is the 19th day of the NTT DoCoMo SI Department Advent Calendar.

Hello! It is Hashimoto of read-ahead engine team. In the business we are working to develop technologies of personal data analysis for the agent service.

In this article, we will introduce ** a method to quantitatively evaluate the lifespan of content (continuous article posting) by applying survival time analysis to Qiita posted article data **.

Survival time analysis is a method of analyzing the relationship between the time to event occurrence and the event. Generally, it is used to analyze the period until a patient's death event (human life) in the medical field and the period until a component failure event (part life) in the engineering field. This time, in ** Qiita post data, I will analyze the lifespan of technical posts ** by assuming that the user stopped posting articles of a specific technology continuously as an event occurs!: Muscle:

By using the survival time analysis, it is possible to evaluate whether the content has a long / short lifespan, or whether the content utilization rate decreases slowly / suddenly. There are various possible uses for the evaluation results, such as creating features for content classification tasks, creating features for user classification tasks using content usage history, and making decisions for content recommendation measures.

For more information on survival time analysis, please refer to the following articles and books.

https://qiita.com/saltcooky/items/409329485be499a5b270
https://note.com/maxwell/n/nc78c55afe944
https://www.kyoritsu-pub.co.jp/bookdetail/9784320110359

Analysis flow

The rough flow of the analysis in this article is as follows.

** Data preprocessing **: Calculates the period of time that Qiita articles with a specific tag have been posted to all users and the flag of whether or not posting has been stopped.
** Application to Weibull model **: The above is used as input data to fit the Weibull model (estimate the parameters of the Weibull model), and the survival rate curve obtained from the Weibull model is output.
** Comparison of survival rate for each tag **: Perform the above for multiple tags (techniques) and compare the parameters of the survival rate curve.

The execution environment of the program in the article is Python3.6, macOS 10.14.6. We will also use a survival time analysis library called lifelines 0.22.9.

About the dataset

This article uses the Qiita dataset obtained below. I made a dataset from the article posted by Qiita

This dataset is a dataset of user's article posting history acquired from the API provided by Qiita. You can check the posting history from 2011 to 2018. In carrying out this analysis

--It must be the user's content history. --The user must use the same content (this time, the same article tag) on a regular basis.

This data was adopted because it satisfies the above two conditions.

If you read the data with pandas dataframe, it will be as follows.

import pandas as pd

df = pd.read_csv('qiita_data1113.tsv', sep='\t')
df.head()

created_at	updated_at	id	title	user	likes_count	comments_count	page_views_count	url	tags
2011-09-30T22:15:42+09:00	2015-03-14T06:17:52+09:00	95c350bb66e94ecbe55f	Gentoo is cute Gentoo	{'description': ';-)',...	1	0	NaN	https://...	[{'name': 'Gentoo', 'versions': []}]
2011-09-30T21:54:56+09:00	2012-03-16T11:30:14+09:00	758ec4656f23a1a12e48	Earthquake early warning code	{'description': 'Emi Tamak...	2	0	NaN	https://...	[{'name': 'ShellScript', 'versions': []}]
2011-09-30T20:44:49+09:00	2015-03-14T06:17:52+09:00	252447ac2ef7a746d652	parsingdirtyhtmlcodesiskillingmesoftly	{'description': 'Don't call github...	1	0	NaN	https://...	[{'name': 'HTML', 'versions': []}]
2011-09-30T14:46:12+09:00	2012-03-16T11:30:14+09:00	d6be6e81aba24f39e3b3	Objective-How is the following variable x handled in the C class implementation?...	{'description': 'Hello. Hatena...	2	1	NaN	https://...	[{'name': 'Objective-C', 'versions': []}]
2011-09-28T16:18:38+09:00	2012-03-16T11:30:14+09:00	c96f56f31667fd464d40	HTTP::Request->AnyEvent::HTTP->HTTP::Response	{'description'...	1	0	NaN	https://...	[{'name': 'Perl', 'versions': []}]

By the way, in each article, up to 3 tags are extracted from the tags column, and the ranking based on the total number is as follows.

	index	tag
0	JavaScript	14403
1	Ruby	14035
2	Python	13089
3	PHP	10246
4	Rails	9274
5	Android	8147
6	iOS	7663
7	Java	7189
8	Swift	6965
9	AWS	6232

Analytical processing

1. Pretreatment

Extract the necessary data from the DataFrame that has read the above dataset.

df_base = <get tags>
df_base.head()

user_id	time_stamp	tag
kiyoya@github	2011-09-30 22:15:42+09:00	Gentoo
hoimei	2011-09-30 21:54:56+09:00	ShellScript
inutano	2011-09-30 20:44:49+09:00	HTML
hakobe	2011-09-30 14:46:12+09:00	Objective-C
motemen	2011-09-28 16:18:38+09:00	Perl
ichimal	2011-09-28 14:41:56+09:00	common-lisp
l_libra	2011-09-28 08:51:27+09:00	common-lisp
ukyo	2011-09-27 23:57:21+09:00	HTML
g000001	2011-09-27 22:29:04+09:00	common-lisp
suginoy	2011-09-27 10:20:28+09:00	Ruby

Created \ _at and tag were extracted as user_id, time \ _stamp from each record. For those with multiple tags, I took out up to 5 and concated each as one record. Note that tag notation fluctuations (golang and Go, Rails and RubyOnRails, etc.) are not taken into consideration.

Converts the data format to two-column data of lifetime and event flags for input to the lifelines Weibull model. Since this data does not reveal a clear event (stopped posting articles) or survival time (duration of continuous article posting), it is necessary to define it independently.

In this section, an event is defined as an event occurrence when the following two conditions are met.

Event occurs when the period between two adjacent posts is θ days or more
An event occurs when the deadline for the observation period and the period of the latest posting are θ days or more.

If the deadline for the observation period and the period for the latest posting are less than θ days, it will be treated as an observation termination.

It is a little difficult to understand, so I will explain it with a figure.

The above figure shows the timing of article posting arranged in chronological order for 3 users. User A has an observation period deadline and the latest posting period of θ days or more. Therefore, the event will occur after the final post. User B has a period of θ days or more for the last two posts. In this case as well, it is judged as an event occurrence. For User C, the period between two adjacent posts is less than θ, and the deadline for the observation period and the period of the latest post are less than θ days. Therefore, it is treated as an observation discontinuation. Regarding the survival time, the period until the event occurs or the period until the deadline of the observation period is defined as the survival time.

This time, we decided to determine whether or not an event occurred and the survival time based on the above rules. If the above logic is defined and implemented as ``` make_survival_dataset ()` ``, it will be as follows. This time, θ = 365 days. Also, specify 2018/12/01 as the observation deadline. It is assumed that DataFrame filtered by a specific tag is input as an argument.

import datetime
import pytz

def make_survival_dataset(df_qiita_hist, n = 365):
    id_list = []
    duration_list = []
    event_flag_list = []
    
    for index, (userid, df_user) in enumerate(df_qiita_hist.groupby('user_id')):
        #Add observation deadline to the end
        dt = datetime.datetime(2018, 12, 1, tzinfo=pytz.timezone("Asia/Tokyo"))
        last = pd.Series(['test', dt, 'last'], index=['user_id', 'time_stamp', 'tag'], name='last')
        df_user= df_user.append(last)
        
        #Calculate the period between two adjacent posts(The top of the list is None.)
        day_diff_list = df_user.time_stamp.diff().apply(lambda x: x.days).values
        
        #Lists with a length of 2 or less are excluded from the calculation.
        if len(day_diff_list) <= 2:
            continue
            
        #Search for whether or not an event occurs.
        event_flag = False
        #List to calculate the period until the event occurs
        day_list = []
        
        for day in day_diff_list[1:]:
            if day >= n:
                event_flag = True
                break
        
            day_list.append(day)
            
        #Calculate the period until the event occurs
        s = sum(day_list)
        
        #Those with a period of 0 are excluded.
        if s == 0:
            continue
        
        #Create a DataFrame
        id_list.append(userid)
        duration_list.append(s)
        event_flag_list.append(event_flag)
        
        
    return pd.DataFrame({'userid':id_list, 'duration':duration_list, 'event_flag': event_flag_list})

Extract the records with Python tags and enter them in make \ _survival \ _dataset.

df_python = df_base[df_base['tag'] == 'Python'].sort_values('time_stamp')
df_surv = make_survival_dataset(df_python, n=365)
df_surv.head()

userid	duration	event_flag
33yuki	154.0	False
5zm	432.0	False
AketiJyuuzou	57.0	True
AkihikoIkeda	308.0	False
Amebayashi	97.0	True

Now you have the data to input to the Weibull model.

2. Application to Weibull model

Input the data created above into the Weibull model, fit the parameters and plot the survival curve. Here, in addition to Python, let's plot the data with Ruby tags.

import lifelines
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'IPAexGothic'

_, ax = plt.subplots(figsize=(12, 8))

# Python
name = 'Python'
df_surv = make_survival_dataset(df_base[df_base['tag'] == name].sort_values('time_stamp'), n=365)
wf = lifelines.WeibullFitter().fit(df_surv['duration'], df_surv['event_flag'], label=name)
wf.plot_survival_function(ax=ax, grid=True)

# Ruby
name = 'Ruby'
df_surv = make_survival_dataset(df_base[df_base['tag'] == name].sort_values('time_stamp'), n=365)
wf = lifelines.WeibullFitter().fit(df_surv['duration'], df_surv['event_flag'], label=name)
wf.plot_survival_function(ax=ax, grid=True)

ax.set_ylim([0, 1])
ax.set_xlabel('Lifetime(Day)')
ax.set_ylabel('Survival rate')

The vertical axis is the number of days, and the vertical axis is the survival rate (percentage of users who continue to post). As a whole, we can see that the survival rate decreases as the number of days progresses. Focusing on Python, we can see that the survival rate is just below 0.2 at the timing of 1500 days. This means that 20% of people will continue to post 1500 days after they start posting. On the other hand, the remaining 80% means to stop posting continuously. After 1500 days comparing Python and Ruby, you can see that there is a difference of about 10%. As far as I can see, it can be said that ** overall, Python articles have a longer survival time than Ruby articles, and there is a tendency for continuous article posting. ** Python's long life seems to be influenced by the recent increase in demand as a machine learning / data analysis tool.

In this way, by defining the event occurrence and survival time for the content log and performing survival time analysis, the survival time of the content can be compared.

3. Parameter comparison of survival rate curve for each tag

According to the lifelines documentation, the survival curve is based on the formula below. It is plotted based on.

S(t) = \exp(-(t/\lambda)^{\rho}) \ where\ \lambda > 0, \rho > 0

The survival rate curve depends on the parameters λ and ρ, and the fit function of Weibull Fitter is in the form of calculating the above parameters. Therefore, by plotting the values of λ and ρ obtained from WeibullFitter on a two-dimensional graph, the similarity of the survival rate curve for each tag can be visually confirmed.

I narrowed down the tags with 1000 or more posting users in the dataset and plotted them.

Λ is plotted on the vertical axis and ρ is plotted on the horizontal axis.

In general, the larger the value of λ, the longer the survival time, and the larger the value of ρ, the steeper the slope of the survival rate curve with the passage of time. If you roughly classify by the size of λ and ρ, it looks like the following: thinking:

-** Large λ: Long survival ** --PHP (and Laravel), Ruby (and Rails), C #, iOS, Android, etc. --Impression that it is often used in products (many users?) And that there are many programming languages (and frameworks) and mobile development. ――Since it is the language and framework used for the product, there are many stories and it is easy to continue posting articles. ――It seems to be related to the impact of functional changes due to updates

-** Small λ: Short survival time ** --CentOS, Ubuntu, PostgreSQL, Nginx, Git, Slack, etc. --Impression that development platform tools such as OS and middleware and development support tools such as Git and Slack are concentrated ――Since it is a basic part, there is relatively little material, so article posting tends to be short-term

-** ρ is large: The slope of the survival rate curve becomes steeper over time ** --Ssh, Chrome, Git, Slack, Mac, Windows, etc. --Impression that basic tools are concentrated --Many articles related to basic tools are introductory articles, and after the start of posting, continuous posting continues for a while, but it decreases after a while.

-** ρ is small: The slope of the survival rate curve becomes gentler with the passage of time ** --Programming language, middleware, Linux OS, etc. --Impression that relatively highly specialized tools (technologies) are concentrated. ――It is easy to stop posting articles on highly specialized tools (techniques) immediately after starting posting, but some people continue to post.

It is summarized in the figure below.

For most things, I have the impression that it matches the interpretation (?). It is interesting that the difference in parameters is related to the type of actual technology. There are some subtleties such as C # and Objective-C, but ...: thinking:

Summary

We performed a survival time analysis on the Qiita article data and classified the contents from the two viewpoints of the length of survival time and the degree of change in the slope of the survival rate curve. It was a rough interpretation, but I found that it seems to be related to the difference in parameters and the type of technology. I think the content introduced in this article is a method that can be applied to other content logs. If you have a chance to touch the content usage log, please refer to it. Finally, I will share some of the extra analysis results and finish. Well then, have a good year ~: raised_hand: