Nice to meet you. My name is Goribaku, a graduate student majoring in econometrics at a certain university. This time, I would like to study while implementing survival time analysis with python for those who have the minimum knowledge of statistics (up to linear regression, logistic regression). I will write it carefully so that even beginners can understand it, so please understand that it may be redundant for knowledgeable people. Also, the code and files used will be posted on my git (https://github.com/goriwaku/survival_analysis), so I think you can download it, and if you are new to python, you can do it yourself. You may want to move your hand and try typing. It may be a poor sentence, but thank you.
Survival analysis is a medical statistics technique that identifies factors that affect the time it takes for an event (death, onset of illness, etc.) to occur. For example, regarding the survival time from hospitalization associated with acute myocardial infarction, consider (1) whether women have a longer survival time than men, and (2) how age affects survival time. Is the purpose. I gave medical data as an example, but its use is not limited to the medical field. For example, it is possible to consider the effect on the time it takes for a machine to fail, or the effect on the time it takes for a venture company to go bankrupt, and the most distinctive feature is the effect of the factors over time. It is a place that can be verified. This time, I would like to study while actually implementing this survival time analysis in python.
This analysis uses data from the Worcester Heart Attack Survey (WHAS), a resident of Worcester, Massachusetts, who developed acute myocardial infarction. This was obtained from UCLA's INTRODUCTION TO SURVIVAL ANALYSIS IN SAS, and the file for SAS (sas7bdat) as it is. Therefore, I converted it to a csv file by referring to this site. I have the code and csv file in my git, but just in case, the procedure is described below. First, install the required modules with the pip command.
pip install sas7bdat
After the installation is complete, use python to execute the following command. If you are not running python in the directory containing the sas7bdat file, specify an absolute path instead of a relative path.
from sas7bdat import SAS7BDAT
with SAS7BDAT('whas500.sas7bdat', skip_header=False) as reader:
df = reader.to_data_frame()
df.to_csv('whas500.csv')
Here, df.to_csv () is one of the commonly used methods of pandas, which is a function that outputs a data frame (df) as csv. Enter the name you want to give to the generated csv file as an argument. Here, it is just whys500.csv. After running, check if you have a csv file with this name.
whas500 has a total of 22 observation variables for 500 samples, the list and meaning of which are summarized in the table below. (Note that this data is a random sampling of 500 samples from the complete whys data and is incomparable to the analysis results for the complete data.)
Variable name | Contents | code/value |
---|---|---|
id | Subject number | 1-500 |
age | Age at admission | Year |
gender | sex | 0=Man, 1=woman |
hr | Heart rate baseline | Per unit |
sysbp | Stretchable blood pressure baseline | mmHg |
diasbp | Diastolic blood pressure baseline | mmHg |
bmi | BMI | kg/m^2 |
cvd | History of cardiovascular disease | 0=No, 1=Yes |
afb | Atrial fibrillation | 0=No, 1=Yes |
sho | Cardiogenic shock | 0=No, 1=Yes |
chf | Congestive heart failure | 0=No, 1=Yes |
av3 | Complete atrioventricular block | 0=No, 1=Yes |
miord | State of myocardial infarction | 0=First shot, 1=recurrence |
mitype | Types of myocardial infarction | 0=No Q wave, 1=With Q wave |
year | cohort | 1=1997, 2=1999, 3=2001 |
los | Length of hospital stay | date(mm/dd/yy) |
dstat | State at discharge | 0=Survival,1=death |
lenfol | Total follow-up period | Difference between last day of follow-up and hospitalization date |
fstat | State at the time of final pursuit | 0=Survival,1=death |
Let's load the whys500 saved as a csv file into the pandas DataFrame and name it whys.
import pandas as pd
whas = pd.read_csv('whas500.csv').drop('Unnamed: 0', axis=1)
Once the file is loaded, try using why.head () etc. to make sure it contains the variables listed earlier. Note that .drop () is a function used to drop unnecessary variables, and deletes the corresponding column when the keyword argument axis = 1 and the corresponding row when axis = 0 (default). In the next section, we will explain the survival time data using this data.
In this section, we will discuss the data used for survival analysis. There is something called ** censored ** as data specific to survival time data. This represents a specimen in which the event of interest did not occur until the last tracking date. For example, in the example of survival time from hospitalization associated with acute myocardial infarction, if there is a subject who is still alive at the end of observation, it can be seen that the survival time of that subject has exceeded the observation period. However, it is unknown when the event will actually occur (maybe the day after the observation ends, or it may live for about 20 years). Therefore, the observations for this sample are incomplete. Such observations are called ** right-side censored **. Here, we will look at this censoring using a drawing library called matplotlib in python.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
whas = pd.read_csv('whas500.csv').drop('Unnamed: 0', axis=1)
for i in range(10):
x = (0, whas.LENFOL[i])
y = (whas.ID[i], whas.ID[i])
plt.plot(x, y, color='black')
if whas.FSTAT[i] == 1:
plt.plot(whas.LENFOL[i], whas.ID[i], 'x', color='red')
else:
plt.plot(whas.LENFOL[i], whas.ID[i], 'o', color='blue')
plt.show()
In practice, survival time can be obtained by counting the difference between the start date and the last observation date. Therefore, when the subject's observation starts and ends is judged only by the number of days, and the observation start date is defined as t = 0. The graph drawn by the above code is below.
The horizontal axis is the survival time, and the vertical axis is the subject ID. The blue dot on the far right and the red cross mark indicate whether the subject is alive or dead. The blue dot seen here is an example of right censoring. There are two causes of incomplete observations in survival analysis: ** censored ** and ** censored **. Censoring is a situation in which observations are discontinued before an event of interest is observed, for example, because the subject moves and the test cannot be continued. On the other hand, cutting refers to a sample in which an event of interest did not occur until the observation end date when the experiment was conducted with the observation end date set due to the design of the experiment. At this time, the start of observation is also random, so it is considered appropriate to treat censoring and cutting as the same right-side censoring. Also, although I will not go into detail in this article, there are cases where censoring is an event of interest before the start point of censoring on the left side, and it is not possible to observe continuously and every fixed time such as 2 months. There is a censoring called section censoring where observation results can be obtained.
This time, I was able to see the survival time data by drawing it with python. Next time, I would like to look at the Kaplan-Meier method, which is a univariate analysis method for survival time analysis.
Click here for the next article Survival time analysis learned in Python 2 -Kaplan-Meier estimator https://qiita.com/Goriwaku/items/767cdc2640e29fddf7cc
Introduction to survival time analysis Hosmer DW, Lemeshow S, May S Introduction to Survival Analysis in SAS https://stats.idre.ucla.edu/sas/seminars/sas-survival/ sas7bdat 2.2.3 https://pypi.org/project/sas7bdat/ Kyoto University OCW Kyoto University Graduate School of Medicine Audit Course Biostatistics for clinical researchers "Basics of survival time analysis" https://www.youtube.com/watch?v=NmZaY2tDKSA&feature=emb_title
Recommended Posts