Survival time analysis learned in Python 2 -Kaplan-Meier estimator

Click here for the previous article Survival time analysis learned with Python 1-What is survival time data? https://qiita.com/Goriwaku/items/8d00696d853da73505bd

For the data used this time, please refer to the previous article or download (whas500.csv) from my git (https://github.com/goriwaku/survival_analysis).

What is a survival function?

I will explain the survival function using WHAS500, which was discussed in the previous article. For time t as a continuous variable, we define survival time as a random variable as T. At this time, the cumulative distribution function of T indicates that the randomly extracted subjects are less than or equal to a certain time t, and are defined as follows.

F(t)=Pr(T \leq t)

Here, the survival function can be expressed by the probability of obtaining a survival time T larger than a certain time t.

S(t)=Pr(T > t)

It will be. At this time, at any point in time t, the following equation holds by the axiom of probability by considering survival as a complementary event of death.

S(t)=1-F(t)

Now let's look at this survival function estimator, especially the Kaplan-Meier estimator, which is the survival function estimator in the presence of right-side censoring.

Survival function estimation

We will actually estimate the survival function. As an example, consider estimating the survival function from the survival time and survival status of 5 people as shown below.

Subject number time Censored due to death
1 6 1
2 44 1
3 21 0
4 14 1
5 62 1

At time 0, all 5 are alive, so $ \ hat {S} (t) = 1.0 $. Since subject number 1 dies on the 6th day, the condition for dying in this small interval is considered, considering the small interval $ (6- \ delta, 6] $ that starts shortly before the 6th day and ends on the 6th day. The estimated probability of attachment is $ 1/5 $ and the probability of survival is $ 4/5 $. Subjects who are alive at any given time are ** risk sets ** at risk of death. The number of people is expressed as number at risk. The estimated survival time can be expressed by ** (probability of living up to that point) x (conditional probability of living in the above minute interval) **. The interval $ I_i $ is divided by the time in the above table in ascending order (ex: $ I_0 = [0, 6) $), $ n_i $ is the number at risk at that time, and $ d_i $ is at that time. The number of people who died. At this time, the estimated mortality probability can be expressed as $ d_1 / n_1 $ and the survival probability can be expressed as $ (n_1 --d_1) / n_1 $ in a small interval close to the day when some event occurs. Therefore, when estimating the survival probabilities on the 6th and 14th days,

\hat{S}(6)=1 \times \frac{4}{5}=\frac{4}{5}\\
\hat{S}(14)=1 \times \frac{4}{5} \times \frac{3}{4} = \frac{3}{5}

It will be. In the next event, subject number 3 will drop out of the at risk set on day 21 with a non-death-independent censorship. In that minute interval, $ n_3 = 3 $ and $ d_3 = 0 $, so the estimated value of the survival function is

\hat{S}(21)=1 \times \frac{4}{5} \times \frac{3}{4} \times \frac{3}{3} = \frac{3}{5}

Not surprisingly, there are no deaths in the interval $ I_3 $, so the survival function estimates do not change. Estimates of the survival function can be obtained for the following intervals as well. The survival function estimate obtained by this method is called the ** Kaplan-Meier estimate **.

Now let's plot the Kaplan-Meier estimate as a step function.

test_data = [[6, 1],[44, 1],[21, 0],[14, 1],[62, 1]]
test_data = sorted(test_data, key=lambda x: x[0], reverse=False)

s = 1
n = 5
pre = 0
for t, censor in test_data:
    plt.plot((pre, t), (s, s), color='blue')
    if censor == 1:
        plt.plot((t, t), (s, s * (n - 1) / n), color='blue')
        s = s * (n - 1) / n
    n -= 1
    pre = t
plt.show()

I stored the table as a two-dimensional array and sorted it by t in ascending order for ease of use. When you want to sort a two-dimensional array, specify what to sort by keyword argument such as key = lambda x: x [0]. If you want to sort by censor, you can enter x [1]. After that, the survival probability is calculated with the for statement, and the plot is shown in the figure below. kaplan_meier_test.png Thus, the Kaplan = Meier curve is output as a step function. It decreases when death is observed and remains constant during that time. At $ t = 62 $, there are no survivors, so the end is $ \ hat {S} (62) = 0 $. In this example, multiple deaths did not occur in the same t (this is called ** tie **), but even if a tie occurs, random numbers are generated and the order is randomly assigned. You can develop a discussion. Furthermore, even if the tie data is treated uniformly in the KM estimator, the final estimator is the same, so there is basically no need to make adjustments. In practice, this is rare, but if you have an extremely large number of ties, you should consider using a discrete-time model instead of a KM estimator. It should also be noted that if the final observation time is censored on the right side, the estimated survival time after that cannot be defined.

Generalization of Kaplan-Meier estimator

The Kaplan-Meier estimator is so commonly used in survival analysis that we give a general formulation here. When t is finite, sorting does not lose generality, so sort t in ascending order and the binary variables $ c_i $, at risk at $ t_i $, whether or not it is a sensor due to death at $ t_i $. Under $ n_1 $, the number of deaths observed $ d_i $, the Kaplan-Meier estimator of the survival function at time point t is

\hat{S}(t)=\prod_{t_i \leq t}\frac{n_i - d_i}{n_i}

However, when $ t \ leq t_1 $, $ \ hat {S} (t) = 1 $.

Kaplan-Meier curve plot -lifelines

I plotted the Kaplan-Meier curve by myself earlier, but python has a library for survival time analysis called lifelines even if I don't bother to implement it myself. In this section, we will use that lifelines to draw the Kaplan-Meier curve of the previous WHAS500. Lifelines is not included by default, so if you have never used it, first do the following in a terminal or command prompt:

pip install lifelines

Now, here is the python code.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import lifelines as ll
from lifelines import KaplanMeierFitter


whas = pd.read_csv('whas500.csv')
kmf = KaplanMeierFitter()
kmf.fit(whas.LENFOL, event_observed=whas.FSTAT)
kmf.plot_survival_function()
print(kmf.survival_function_)
plt.show()

Put the time t in the first argument of fit of KaplanMeierFitter of lifelines, and put the binary variable of whether the event (death) you want to observe has occurred in the keyword argument event_observed, and let it fit. Then you can easily plot the KM curve with the model's .plot_survival_fuction () function. Also, use survival_function_ to get the estimated survival function. It is a DataFrame type of pandas and returns a pair of time point and estimated value. It is also possible to obtain the above-mentioned cumulative distribution function $ F (t) $ by replacing survival_function with cumulative_density. The actual plot of the KM curve is shown below. whas500_km.png The light blue width in the figure is the confidence interval.

This concludes the explanation of the Kaplan-Meier estimator and its implementation. Next time, I would like to interpret the Kaplan-Meier estimator and compare the two groups, so I hope you will continue reading.

References / Reference Links

Introduction to survival time analysis Hosmer DW, Lemeshow S, May S LIFELINES https://lifelines.readthedocs.io/en/latest/index.html Kyoto University OCW Kyoto University Graduate School of Medicine Audit Course Biostatistics for clinical researchers "Basics of survival time analysis" https://www.youtube.com/watch?v=NmZaY2tDKSA&feature=emb_title

Recommended Posts

Survival time analysis learned in Python 2 -Kaplan-Meier estimator
Python: Time Series Analysis
Association analysis in Python
Regression analysis in Python
Refactoring Learned in Python (Basic)
Axisymmetric stress analysis in Python
Python classes learned in chemoinformatics
Simple regression analysis in Python
What I learned in Python
Character code learned in Python
Python functions learned in chemoinformatics
[Understand in the shortest time] Python basics for data analysis
Python command in "Social network analysis learned by open source"
EEG analysis in Python: Python MNE tutorial
First simple regression analysis in Python
Python: Time Series Analysis: Preprocessing Time Series Data
Measure function execution time in Python
I learned about processes in Python
Elementary ITK usage learned in Python
Planar skeleton analysis in Python (2) Hotfix
Code tests around time in Python
Basic Linear Algebra Learned in Python (Part 1)
git / python> git log analysis (v0.1, v0.2)> calculate total working time in minutes from git log
Just print Python elapsed time in seconds
MongoDB for the first time in Python
Residual analysis in Python (Supplement: Cochrane rules)
Time comparison: Correlation coefficient calculation in Python
Introduction to Time Series Analysis ~ Seasonal Adjustment Model ~ Implemented in R and Python
Python: Time Series Analysis: Building a SARIMA Model
Get time series data from k-db.com in Python
Time variation analysis of black holes using python
3 ways to parse time strings in python [Note]
Python variables and data types learned in chemoinformatics
Python: Time Series Analysis: Stationarity, ARMA / ARIMA Model
Statistical test grade 2 probability distribution learned in Python ②
A clever way to time processing in Python
Perform entity analysis using spaCy / GiNZA in Python
Data analysis in Python: A note about line_profiler
[Environment construction] Dependency analysis using CaboCha in Python 2.7
100 language processing knocks Morphological analysis learned in Chapter 4
Statistical test grade 2 probability distribution learned in Python ①
A well-prepared record of data analysis in Python
TensorFlow: Run data learned in Python on Android
To represent date, time, time, and seconds in Python
Quadtree in Python --2
CURL in python
Metaprogramming in Python
First time python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
[Introduction to element decomposition] Let's arrange time series analysis methods in R and python ♬
Data analysis python
Discord in Python
python time measurement
DCI in Python
quicksort in python
nCr in python
N-Gram in Python