Survival time analysis learned with Python 1-What is survival time data?

Introduction

Nice to meet you. My name is Goribaku, a graduate student majoring in econometrics at a certain university. This time, I would like to study while implementing survival time analysis with python for those who have the minimum knowledge of statistics (up to linear regression, logistic regression). I will write it carefully so that even beginners can understand it, so please understand that it may be redundant for knowledgeable people. Also, the code and files used will be posted on my git (https://github.com/goriwaku/survival_analysis), so I think you can download it, and if you are new to python, you can do it yourself. You may want to move your hand and try typing. It may be a poor sentence, but thank you.

What is survival time analysis?

Survival analysis is a medical statistics technique that identifies factors that affect the time it takes for an event (death, onset of illness, etc.) to occur. For example, regarding the survival time from hospitalization associated with acute myocardial infarction, consider (1) whether women have a longer survival time than men, and (2) how age affects survival time. Is the purpose. I gave medical data as an example, but its use is not limited to the medical field. For example, it is possible to consider the effect on the time it takes for a machine to fail, or the effect on the time it takes for a venture company to go bankrupt, and the most distinctive feature is the effect of the factors over time. It is a place that can be verified. This time, I would like to study while actually implementing this survival time analysis in python.

Data set to use

This analysis uses data from the Worcester Heart Attack Survey (WHAS), a resident of Worcester, Massachusetts, who developed acute myocardial infarction. This was obtained from UCLA's INTRODUCTION TO SURVIVAL ANALYSIS IN SAS, and the file for SAS (sas7bdat) as it is. Therefore, I converted it to a csv file by referring to this site. I have the code and csv file in my git, but just in case, the procedure is described below. First, install the required modules with the pip command.

pip install sas7bdat

After the installation is complete, use python to execute the following command. If you are not running python in the directory containing the sas7bdat file, specify an absolute path instead of a relative path.

from sas7bdat import SAS7BDAT
with SAS7BDAT('whas500.sas7bdat', skip_header=False) as reader:
    df = reader.to_data_frame()
df.to_csv('whas500.csv')

Here, df.to_csv () is one of the commonly used methods of pandas, which is a function that outputs a data frame (df) as csv. Enter the name you want to give to the generated csv file as an argument. Here, it is just whys500.csv. After running, check if you have a csv file with this name.

whas500 has a total of 22 observation variables for 500 samples, the list and meaning of which are summarized in the table below. (Note that this data is a random sampling of 500 samples from the complete whys data and is incomparable to the analysis results for the complete data.)

Variable name Contents code/value
id Subject number 1-500
age Age at admission Year
gender sex 0=Man, 1=woman
hr Heart rate baseline Per unit
sysbp Stretchable blood pressure baseline mmHg
diasbp Diastolic blood pressure baseline mmHg
bmi BMI kg/m^2
cvd History of cardiovascular disease 0=No, 1=Yes
afb Atrial fibrillation 0=No, 1=Yes
sho Cardiogenic shock 0=No, 1=Yes
chf Congestive heart failure 0=No, 1=Yes
av3 Complete atrioventricular block 0=No, 1=Yes
miord State of myocardial infarction 0=First shot, 1=recurrence
mitype Types of myocardial infarction 0=No Q wave, 1=With Q wave
year cohort 1=1997, 2=1999, 3=2001
los Length of hospital stay date(mm/dd/yy)
dstat State at discharge 0=Survival,1=death
lenfol Total follow-up period Difference between last day of follow-up and hospitalization date
fstat State at the time of final pursuit 0=Survival,1=death

Let's load the whys500 saved as a csv file into the pandas DataFrame and name it whys.

import pandas as pd

whas = pd.read_csv('whas500.csv').drop('Unnamed: 0', axis=1)

Once the file is loaded, try using why.head () etc. to make sure it contains the variables listed earlier. Note that .drop () is a function used to drop unnecessary variables, and deletes the corresponding column when the keyword argument axis = 1 and the corresponding row when axis = 0 (default). In the next section, we will explain the survival time data using this data.

Survival time data

In this section, we will discuss the data used for survival analysis. There is something called ** censored ** as data specific to survival time data. This represents a specimen in which the event of interest did not occur until the last tracking date. For example, in the example of survival time from hospitalization associated with acute myocardial infarction, if there is a subject who is still alive at the end of observation, it can be seen that the survival time of that subject has exceeded the observation period. However, it is unknown when the event will actually occur (maybe the day after the observation ends, or it may live for about 20 years). Therefore, the observations for this sample are incomplete. Such observations are called ** right-side censored **. Here, we will look at this censoring using a drawing library called matplotlib in python.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

whas = pd.read_csv('whas500.csv').drop('Unnamed: 0', axis=1)
for i in range(10):
    x = (0, whas.LENFOL[i])
    y = (whas.ID[i], whas.ID[i])
    plt.plot(x, y, color='black')
    if whas.FSTAT[i] == 1:
        plt.plot(whas.LENFOL[i], whas.ID[i], 'x', color='red')
    else:
        plt.plot(whas.LENFOL[i], whas.ID[i], 'o', color='blue')
plt.show()

In practice, survival time can be obtained by counting the difference between the start date and the last observation date. Therefore, when the subject's observation starts and ends is judged only by the number of days, and the observation start date is defined as t = 0. The graph drawn by the above code is below. survival_data_plot.png The horizontal axis is the survival time, and the vertical axis is the subject ID. The blue dot on the far right and the red cross mark indicate whether the subject is alive or dead. The blue dot seen here is an example of right censoring. There are two causes of incomplete observations in survival analysis: ** censored ** and ** censored **. Censoring is a situation in which observations are discontinued before an event of interest is observed, for example, because the subject moves and the test cannot be continued. On the other hand, cutting refers to a sample in which an event of interest did not occur until the observation end date when the experiment was conducted with the observation end date set due to the design of the experiment. At this time, the start of observation is also random, so it is considered appropriate to treat censoring and cutting as the same right-side censoring. Also, although I will not go into detail in this article, there are cases where censoring is an event of interest before the start point of censoring on the left side, and it is not possible to observe continuously and every fixed time such as 2 months. There is a censoring called section censoring where observation results can be obtained.

This time, I was able to see the survival time data by drawing it with python. Next time, I would like to look at the Kaplan-Meier method, which is a univariate analysis method for survival time analysis.

Click here for the next article Survival time analysis learned in Python 2 -Kaplan-Meier estimator https://qiita.com/Goriwaku/items/767cdc2640e29fddf7cc

References / Reference Links

Introduction to survival time analysis Hosmer DW, Lemeshow S, May S Introduction to Survival Analysis in SAS https://stats.idre.ucla.edu/sas/seminars/sas-survival/ sas7bdat 2.2.3 https://pypi.org/project/sas7bdat/ Kyoto University OCW Kyoto University Graduate School of Medicine Audit Course Biostatistics for clinical researchers "Basics of survival time analysis" https://www.youtube.com/watch?v=NmZaY2tDKSA&feature=emb_title

Recommended Posts

Survival time analysis learned with Python 1-What is survival time data?
Data analysis with Python
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
What is python
Data analysis starting with python (data preprocessing-machine learning)
What is Python
What are you comparing with Python is and ==?
Python data analysis template
[Python] What is virtualenv
What is God? Make a simple chatbot with python
Execution time measurement with Python With
Japanese morphological analysis with Python
[Python] Python and security-① What is Python?
[Python] * args ** What is kwrgs?
[Python] Plot time series data
Python for Data Analysis Chapter 2
What I learned in Python
Data analysis using python pandas
Time synchronization (Windows) with Python
Muscle jerk analysis with Python
Python for Data Analysis Chapter 3
Read json data with python
Python Basic Course (1 What is Python)
This time I learned python III and IV with Prorate
Challenges for future sales forecasts: (1) What is time series analysis?
What to do with PYTHON release?
Efficient net pick-up learned with Python
Preprocessing template for data analysis (Python)
1. Statistics learned with Python 1-1. Basic statistics (Pandas)
What is the python underscore (_) for?
Easy data visualization with Python seaborn.
Python> What is an extended slice?
Time series analysis 3 Preprocessing of time series data
Process Pubmed .xml data with python
What is Multinomial Logistic Regression Analysis?
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 1: Download sample data
Logistic regression analysis Self-made with python
Python application: Data cleansing # 2: Data cleansing with DataFrame
Python | What you can do with Python
Create a USB boot Ubuntu with a Python environment for data analysis
I tried fMRI data analysis with python (Introduction to brain information decoding)
[Introduction to Python] What is the method of repeating with the continue statement?
[Python] First data analysis / machine learning (Kaggle)
What is a dog? Python installation volume
Tweet analysis with Python, Mecab and CaboCha
What I did with a Python array
1. Statistics learned with Python 1-3. Calculation of various statistics (statistics)
Recommendation of Altair! Data visualization with Python
Let's do MySQL data manipulation with Python
Organize data divided by folder with Python
Two-dimensional unsteady heat conduction analysis with Python
I did Python data analysis training remotely
Python: Simplified morphological analysis with regular expressions
Python 3 Engineer Certified Data Analysis Exam Preparation
Process big data with Dataflow (ApacheBeam) + Python3
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 2: Import data to SQL Server using PowerShell
Easy Python data analysis environment construction with Windows10 Pro x VS Code x Docker
[Python] Flow from web scraping to data analysis
How to measure execution time with Python Part 1
View details of time series data with Remotte