Introduction

Nice to meet you. My name is Goribaku, a graduate student majoring in econometrics at a certain university. This time, I would like to study while implementing survival time analysis with python for those who have the minimum knowledge of statistics (up to linear regression, logistic regression). I will write it carefully so that even beginners can understand it, so please understand that it may be redundant for knowledgeable people. Also, the code and files used will be posted on my git (https://github.com/goriwaku/survival_analysis), so I think you can download it, and if you are new to python, you can do it yourself. You may want to move your hand and try typing. It may be a poor sentence, but thank you.

What is survival time analysis?

Survival analysis is a medical statistics technique that identifies factors that affect the time it takes for an event (death, onset of illness, etc.) to occur. For example, regarding the survival time from hospitalization associated with acute myocardial infarction, consider (1) whether women have a longer survival time than men, and (2) how age affects survival time. Is the purpose. I gave medical data as an example, but its use is not limited to the medical field. For example, it is possible to consider the effect on the time it takes for a machine to fail, or the effect on the time it takes for a venture company to go bankrupt, and the most distinctive feature is the effect of the factors over time. It is a place that can be verified. This time, I would like to study while actually implementing this survival time analysis in python.

Data set to use

This analysis uses data from the Worcester Heart Attack Survey (WHAS), a resident of Worcester, Massachusetts, who developed acute myocardial infarction. This was obtained from UCLA's INTRODUCTION TO SURVIVAL ANALYSIS IN SAS, and the file for SAS (sas7bdat) as it is. Therefore, I converted it to a csv file by referring to this site. I have the code and csv file in my git, but just in case, the procedure is described below. First, install the required modules with the pip command.

pip install sas7bdat

After the installation is complete, use python to execute the following command. If you are not running python in the directory containing the sas7bdat file, specify an absolute path instead of a relative path.

from sas7bdat import SAS7BDAT
with SAS7BDAT('whas500.sas7bdat', skip_header=False) as reader:
    df = reader.to_data_frame()
df.to_csv('whas500.csv')

Here, df.to_csv () is one of the commonly used methods of pandas, which is a function that outputs a data frame (df) as csv. Enter the name you want to give to the generated csv file as an argument. Here, it is just whys500.csv. After running, check if you have a csv file with this name.

whas500 has a total of 22 observation variables for 500 samples, the list and meaning of which are summarized in the table below. (Note that this data is a random sampling of 500 samples from the complete whys data and is incomparable to the analysis results for the complete data.)

Variable name	Contents	code/value
id	Subject number	1-500
age	Age at admission	Year
gender	sex	0=Man, 1=woman
hr	Heart rate baseline	Per unit
sysbp	Stretchable blood pressure baseline	mmHg
diasbp	Diastolic blood pressure baseline	mmHg
bmi	BMI	kg/m^2
cvd	History of cardiovascular disease	0=No, 1=Yes
afb	Atrial fibrillation	0=No, 1=Yes
sho	Cardiogenic shock	0=No, 1=Yes
chf	Congestive heart failure	0=No, 1=Yes
av3	Complete atrioventricular block	0=No, 1=Yes
miord	State of myocardial infarction	0=First shot, 1=recurrence
mitype	Types of myocardial infarction	0=No Q wave, 1=With Q wave
year	cohort	1=1997, 2=1999, 3=2001
los	Length of hospital stay	date(mm/dd/yy)
dstat	State at discharge	0=Survival,1=death
lenfol	Total follow-up period	Difference between last day of follow-up and hospitalization date
fstat	State at the time of final pursuit	0=Survival,1=death

Let's load the whys500 saved as a csv file into the pandas DataFrame and name it whys.

import pandas as pd

whas = pd.read_csv('whas500.csv').drop('Unnamed: 0', axis=1)

Once the file is loaded, try using why.head () etc. to make sure it contains the variables listed earlier. Note that .drop () is a function used to drop unnecessary variables, and deletes the corresponding column when the keyword argument axis = 1 and the corresponding row when axis = 0 (default). In the next section, we will explain the survival time data using this data.

Survival time data

In this section, we will discuss the data used for survival analysis. There is something called ** censored ** as data specific to survival time data. This represents a specimen in which the event of interest did not occur until the last tracking date. For example, in the example of survival time from hospitalization associated with acute myocardial infarction, if there is a subject who is still alive at the end of observation, it can be seen that the survival time of that subject has exceeded the observation period. However, it is unknown when the event will actually occur (maybe the day after the observation ends, or it may live for about 20 years). Therefore, the observations for this sample are incomplete. Such observations are called ** right-side censored **. Here, we will look at this censoring using a drawing library called matplotlib in python.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

whas = pd.read_csv('whas500.csv').drop('Unnamed: 0', axis=1)
for i in range(10):
    x = (0, whas.LENFOL[i])
    y = (whas.ID[i], whas.ID[i])
    plt.plot(x, y, color='black')
    if whas.FSTAT[i] == 1:
        plt.plot(whas.LENFOL[i], whas.ID[i], 'x', color='red')
    else:
        plt.plot(whas.LENFOL[i], whas.ID[i], 'o', color='blue')
plt.show()

In practice, survival time can be obtained by counting the difference between the start date and the last observation date. Therefore, when the subject's observation starts and ends is judged only by the number of days, and the observation start date is defined as t = 0. The graph drawn by the above code is below. The horizontal axis is the survival time, and the vertical axis is the subject ID. The blue dot on the far right and the red cross mark indicate whether the subject is alive or dead. The blue dot seen here is an example of right censoring. There are two causes of incomplete observations in survival analysis: ** censored ** and ** censored **. Censoring is a situation in which observations are discontinued before an event of interest is observed, for example, because the subject moves and the test cannot be continued. On the other hand, cutting refers to a sample in which an event of interest did not occur until the observation end date when the experiment was conducted with the observation end date set due to the design of the experiment. At this time, the start of observation is also random, so it is considered appropriate to treat censoring and cutting as the same right-side censoring. Also, although I will not go into detail in this article, there are cases where censoring is an event of interest before the start point of censoring on the left side, and it is not possible to observe continuously and every fixed time such as 2 months. There is a censoring called section censoring where observation results can be obtained.

This time, I was able to see the survival time data by drawing it with python. Next time, I would like to look at the Kaplan-Meier method, which is a univariate analysis method for survival time analysis.

Click here for the next article Survival time analysis learned in Python 2 -Kaplan-Meier estimator https://qiita.com/Goriwaku/items/767cdc2640e29fddf7cc

References / Reference Links

Introduction to survival time analysis Hosmer DW, Lemeshow S, May S Introduction to Survival Analysis in SAS https://stats.idre.ucla.edu/sas/seminars/sas-survival/ sas7bdat 2.2.3 https://pypi.org/project/sas7bdat/ Kyoto University OCW Kyoto University Graduate School of Medicine Audit Course Biostatistics for clinical researchers "Basics of survival time analysis" https://www.youtube.com/watch?v=NmZaY2tDKSA&feature=emb_title

Survival time analysis learned with Python 1-What is survival time data?

Introduction

What is survival time analysis?

Data set to use

Survival time data

References / Reference Links