[CovsirPhy] COVID-19 Python Package for Data Analysis: Parameter estimation

Introduction

We are creating a Python package CovsirPhy that allows you to easily download and analyze COVID-19 data (such as the number of PCR positives).

Introductory article:

** This time, we will introduce Parameter estimation (parameter estimation such as SIR-F model). ** **

The English version of the document is Covsir Phy: COVID-19 analysis with phase-dependent SIRs, [Kaggle: COVID-19 data with SIR model]( Please refer to https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model).

1. Execution environment

CovsirPhy can be installed by the following method! Please use Python 3.7 or above, or Google Colaboratory.

--Stable version: pip install covsirphy --upgrade --Development version: pip install" git + https://github.com/lisphilar/covid19-sir.git#egg=covsirphy "

import covsirphy as cs
cs.__version__
# '2.8.2'
Execution environment
OS Windows Subsystem for Linux / Ubuntu
Python version 3.8.5

2. Preparation

The tables and graphs in this article were created using the data as of 9/12/2020. Click here for the code [^ 2] to download the actual data from COVID-19 Data Hub [^ 1]:

data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()

In addition, please confirm the actual data and execute S-R trend analysis [^ 3] with the following code.

[^ 3]: [CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis

# (Optional)Acquisition of data from the Ministry of Health, Labor and Welfare
japan_data = data_loader.japan()
jhu_data.replace(japan_data)
print(japan_data.citation)
#Instantiation of analysis class
snl = cs.Scenario(jhu_data, population_data, country="Japan")
#Confirmation of actual data
snl.records(filename=None)
# S-R trend analysis
snl.trend(filename=None)
#Check Phase settings
snl.summary()

Actual data graph: records.jpg

S-R trend analysis: trend.jpg

Phase setting:

Type Start End Population
0th Past 06Feb2020 21Apr2020 126529100
1st Past 22Apr2020 04Jul2020 126529100
2nd Past 05Jul2020 23Jul2020 126529100
3rd Past 24Jul2020 01Aug2020 126529100
4th Past 02Aug2020 14Aug2020 126529100
5th Past 15Aug2020 29Aug2020 126529100
6th Past 30Aug2020 12Sep2020 126529100

3. Execution example

By S-R trend analysis, it was possible to divide into "Phase" (period during which the parameter is constant). In this article, the parameter values are estimated using the data of each "Phase" (for example, the data of 2020/2/6 --2020/4/21 for the 0th phase).

I will write another article about the estimation mechanism. The parameter values are proposed using the ʻoptunapackage, the numerical solution is calculated byscipy.integrate.solve_ivp`, and the parameter set with little error from the actual data is selected.

How to read the result will be described later, but you can execute and get the result list in the following two lines. This time, I used SIR-F model [^ 4].

[^ 4]: [CovsirPhy] COVID-19 Python package for data analysis: SIR-F model

# Parameter estimation with SIR-F model
snl.estimate(cs.SIRF)
#Get parameter list
snl.summary()
#Example of standard output (processing time varies depending on the number of CPUs, etc.)
#Details below: Latest Phase=After estimating including tau in the 6th phase, fix tau and 0-5th parameter estimation
<SIR-F model: parameter estimation>
Running optimization with 8 CPUs...
        6th phase (30Aug2020 - 12Sep2020): finished  704 trials in 0 min 25 sec
        5th phase (15Aug2020 - 29Aug2020): finished  965 trials in 1 min  0 sec
        3rd phase (24Jul2020 - 01Aug2020): finished  965 trials in 1 min  0 sec
        1st phase (22Apr2020 - 04Jul2020): finished  913 trials in 1 min  0 sec
        4th phase (02Aug2020 - 14Aug2020): finished  969 trials in 1 min  0 sec
        0th phase (06Feb2020 - 21Apr2020): finished  853 trials in 1 min  0 sec
        2nd phase (05Jul2020 - 23Jul2020): finished  964 trials in 1 min  0 sec
Completed optimization. Total: 1 min 27 sec
Type Start End Population ODE Rt theta kappa rho sigma tau 1/alpha2 [day] 1/gamma [day] alpha1 [-] 1/beta [day] RMSLE Trials Runtime
0th Past 06Feb2020 21Apr2020 126529100 SIR-F 5.54 0.0258495 0.0002422 0.0322916 0.00543343 480 1376 61 0.026 10 1.17429 804 1 min 0 sec
1st Past 22Apr2020 04Jul2020 126529100 SIR-F 0.41 0.0730748 0.000267108 0.0118168 0.0264994 480 1247 12 0.073 28 1.11459 861 1 min 0 sec
2nd Past 05Jul2020 23Jul2020 126529100 SIR-F 2.01 0.000344333 7.92419e-05 0.0467789 0.023201 480 4206 14 0 7 0.0331522 910 1 min 0 sec
3rd Past 24Jul2020 01Aug2020 126529100 SIR-F 1.75 0.00169155 4.05087e-05 0.0459332 0.0260965 480 8228 12 0.002 7 0.0201773 923 1 min 0 sec
4th Past 02Aug2020 14Aug2020 126529100 SIR-F 1.46 0.000634554 0.000116581 0.0325815 0.0221345 480 2859 15 0.001 10 0.0751473 909 1 min 0 sec
5th Past 15Aug2020 29Aug2020 126529100 SIR-F 0.8 0.00244294 9.30884e-05 0.0272693 0.0337857 480 3580 9 0.002 12 0.0420563 907 1 min 0 sec
6th Past 30Aug2020 12Sep2020 126529100 SIR-F 0.69 5.36037e-05 0.000467824 0.0219379 0.0312686 480 712 10 0 15 0.0132161 724 0 min 25 sec

4. Parameter estimates

It's long sideways, so let's look at it in order. First is the estimated value of the parameter.

# cs.SIRF.PARAMETERS: SIR-F model parameter name list
cols = ["Start", "End", "ODE", "tau", *cs.SIRF.PARAMETERS]
snl.summary(columns=cols)
Start End ODE tau theta kappa rho sigma
0th 06Feb2020 21Apr2020 SIR-F 480 0.0258495 0.0002422 0.0322916 0.00543343
1st 22Apr2020 04Jul2020 SIR-F 480 0.0730748 0.000267108 0.0118168 0.0264994
2nd 05Jul2020 23Jul2020 SIR-F 480 0.000344333 7.92419e-05 0.0467789 0.023201
3rd 24Jul2020 01Aug2020 SIR-F 480 0.00169155 4.05087e-05 0.0459332 0.0260965
4th 02Aug2020 14Aug2020 SIR-F 480 0.000634554 0.000116581 0.0325815 0.0221345
5th 15Aug2020 29Aug2020 SIR-F 480 0.000644851 0.000383424 0.0274104 0.0337156
6th 30Aug2020 12Sep2020 SIR-F 480 5.36037e-05 0.000467824 0.0219379 0.0312686

SIR-F model[^4]:

\begin{align*}
\mathrm{S} \overset{\beta I}{\longrightarrow} \mathrm{S}^\ast \overset{\alpha_1}{\longrightarrow}\ & \mathrm{F}    \\
\mathrm{S}^\ast \overset{1 - \alpha_1}{\longrightarrow}\ & \mathrm{I} \overset{\gamma}{\longrightarrow} \mathrm{R}    \\
& \mathrm{I} \overset{\alpha_2}{\longrightarrow} \mathrm{F}    \\
\end{align*}

$ \ Alpha_2 $, $ \ beta $, $ \ gamma $ has "time" as a dimension. This "time" is $ f (T + \ Delta T) --f (T) = x (T) \ Delta T $, which is a discretized equation of simultaneous ordinary differential equations $ f'(T) = x (T) $. It depends on \ Delta T $. Since data is acquired every other day, $ \ Delta T $ is less than one day (= 1440 min), but I do not know the specific value. It is difficult to compare the dimensional parameters $ \ alpha_2 $, $ \ beta $, $ \ gamma $ of different countries, which may vary by country or region.

Therefore, we set the variable $ \ tau $, which corresponds to $ \ Delta T $, to make the dimensional parameters dimensionless.

\begin{align*}
(S, I, R, F) = & N \times (x, y, z, w)    \\
(T, \alpha_1, \alpha_2, \beta, \gamma) = & (\tau t, \theta, \tau^{-1}\kappa, \tau^{-1}\rho, \tau^{-1}\sigma)    \\
1 \leq \tau & \leq 1440    \\
\end{align*}

At this time [^ 4],

\begin{align*}
& 0 \leq (x, y, z, w, \theta, \kappa, \rho, \sigma) \leq 1  \\
\end{align*}

After estimating the dimensionless parameters as $ \ tau $ using the latest Phase data so that the value of $ \ tau $ is constant in the same country, the same value is calculated as $ \ tau $ when estimating the parameters of other Phases. I am trying to use it as. This time it was 480 min as shown in the table. I'm programming to use a divisor of 1440 min per day to simplify the calculation.

Now, let's graph the transition of each dimensionless parameter.

\begin{align*}
& \frac{\mathrm{d}x}{\mathrm{d}t}= - \rho x y  \\
& \frac{\mathrm{d}y}{\mathrm{d}t}= \rho (1-\theta) x y - (\sigma + \kappa) y  \\
& \frac{\mathrm{d}z}{\mathrm{d}t}= \sigma y  \\
& \frac{\mathrm{d}w}{\mathrm{d}t}= \rho \theta x y + \kappa y  \\
\end{align*}

transition of rho

When Susceptible comes into contact with Infected, the probability of getting infected $ \ rho $ changes:

snl.history(target="rho", filename=None)

rho.jpg

It is interpreted that the effect of the state of emergency (area limited 2020/4/7, nationwide 2020/4/16, nationwide cancellation 2020/5/25) appeared in the latter half of April and was maintained until the beginning of July. I will. After that, it jumped up and gradually decreased. Details need to be discussed, but it is a parameter that directly reflects the effects of measures such as three-cs avoidance.

Transition of sigma

Probability of transition from Infected to Recovered $ \ sigma $ transition:

snl.history(target="sigma", filename=None)

sigma.jpg

After a large rise in the latter half of April, it is on an upward trend while repeating rising / falling. This parameter reflects the medical care provision system and the development / supply status of new drugs.

Transition of kappa

Changes in infected mortality rate $ \ kappa $:

snl.history(target="kappa", filename=None)

kappa.jpg

It seems that it fluctuates greatly, but since the absolute value of the value is small, it can be considered that it is kept constant to some extent (verification by comparison with other countries is necessary). It is necessary to put in place a medical system and bring $ \ kappa $ as close to 0 as possible with a sufficient supply of new drugs.

Transition of theta

Percentage of infected people who received a definitive diagnosis who died at the time of the definitive diagnosis Changes in $ \ theta $:

snl.history(target="theta", filename=None)

theta.jpg

In the early stages of infection, the medical care provision system itself was extremely tight, so it is difficult to interpret, but it is thought that the value will increase if the examination is delayed and appropriate treatment cannot be performed.

5. Transition of dimensional parameters

The dimensional parameters are also included in the table. To make it easier to interpret, the unit is [day], which is the reciprocal except for the dimensionless $ \ alpha_1 $.

cols = ["Start", "End", "ODE", "tau", *cs.SIRF.DAY_PARAMETERS]
fh.write(snl.summary(columns=cols).to_markdown())
Start End ODE tau alpha1 [-] 1/alpha2 [day] 1/beta [day] 1/gamma [day]
0th 06Feb2020 21Apr2020 SIR-F 480 0.026 1376 10 61
1st 22Apr2020 04Jul2020 SIR-F 480 0.073 1247 28 12
2nd 05Jul2020 23Jul2020 SIR-F 480 0 4206 7 14
3rd 24Jul2020 01Aug2020 SIR-F 480 0.002 8228 7 12
4th 02Aug2020 14Aug2020 SIR-F 480 0.001 2859 10 15
5th 15Aug2020 29Aug2020 SIR-F 480 0.002 3580 12 9
6th 30Aug2020 12Sep2020 SIR-F 480 0 712 15 10

The dimensional parameter is omitted because it shows the same progress as the dimensionless parameter, but the graph can be obtained with the following code.

#For beta->Graph omitted
snl.history(target="1/beta [day]", filename=None)

6. Changes in the number of effective reproductions

The basic / effective reproduction number Rt of SIR-F model [^ 4] is defined as follows.

\begin{align*}
R_t = \rho (1 - \theta) (\sigma + \kappa)^{-1} = \beta (1 - \alpha_1) (\gamma + \alpha_2)^{-1}
\end{align*}

Transition:

#List
cols = ["Start", "End", "ODE", "tau", "Rt"]
snl.summary(columns=cols)
#Graph
snl.history(target="Rt", filename="rt.jpg ")
Start End ODE Rt
0th 06Feb2020 21Apr2020 SIR-F 5.54
1st 22Apr2020 04Jul2020 SIR-F 0.41
2nd 05Jul2020 23Jul2020 SIR-F 2.01
3rd 24Jul2020 01Aug2020 SIR-F 1.75
4th 02Aug2020 14Aug2020 SIR-F 1.46
5th 15Aug2020 29Aug2020 SIR-F 0.8
6th 30Aug2020 12Sep2020 SIR-F 0.69

rt.jpg

Since $ Rt> 1 $ is one indicator of the spread of infection, the horizontal line $ Rt = 1 $ is displayed.

7. Parameter accuracy

The table also includes the RMSLE score, which measures the accuracy of the parameters, the number of parameter sets proposed by the ʻoptuna` package to estimate the parameters, and the execution time.

cols = ["Start", "End", "RMSLE", "Trials", "Runtime"]
snl.summary(columns=cols)
Start End RMSLE Trials Runtime
0th 06Feb2020 21Apr2020 1.17429 690 1 min 1 sec
1st 22Apr2020 04Jul2020 1.11459 764 1 min 0 sec
2nd 05Jul2020 23Jul2020 0.0331522 810 1 min 1 sec
3rd 24Jul2020 01Aug2020 0.0201773 816 1 min 1 sec
4th 02Aug2020 14Aug2020 0.0751473 808 1 min 0 sec
5th 15Aug2020 29Aug2020 0.0420563 804 1 min 0 sec
6th 30Aug2020 12Sep2020 0.0132161 658 0 min 25 sec

The definition formula of RMSLE score (Root Mean Squared Log Error) [^ 5] is as follows. It can be said that the closer it is to 0, the better the actual data is reflected. Although omitted, the verification of the estimation method itself is performed using theoretical data (data on the number of patients theoretically created from the SIR-F model formula and parameter set example).

\begin{align*}
& \sqrt{\cfrac{1}{n}\sum_{i=1}^{n}(log_{10}(A_{i} + 1) - log_{10}(P_{i} + 1))^2}
\end{align*}

$ A $ is the actual data and $ P $ is the predicted value. When $ i = 1, 2, 3, 4 (= n) $, $ A_i $ and $ P_i $ are the actual data / predicted values of $ S, I, R, F $, respectively.

Since it is difficult to get an image with just the value, I made a graph. First, about the 0th phase, which has the largest RMSLE value. The top graph shows the difference between the actual data and the predicted value, and the second, third and fourth show both the actual data and the predicted value for each variable.

snl.estimate_accuracy(phase="0th", filename=None)

accuracy_0th.jpg

There is some error. It seems better to split the 0th phase using Scenario.separate () etc. (a separate article will be created).

On the other hand, in the 6th phase, which has the lowest RMSLE score, the actual data and the predicted value often overlap.

snl.estimate_accuracy(phase="6th", filename=None)

accuracy_6th.jpg

8. Postscript

This time, I explained how to estimate the parameters of each Phase. More than half a year has passed since the epidemic started, and I think we are at a stage where parameter verification, future parameter prediction, and scenario analysis are important. It seems that the attention of COVID-19 as a theme of data science is decreasing, but as the data is accumulated, it has become possible to perform deeper analysis than in the early days when we focused on creating dashboards. It was.

Thank you for your hard work this time as well!

Recommended Posts

[CovsirPhy] COVID-19 Python Package for Data Analysis: Parameter estimation
[CovsirPhy] COVID-19 Python Package for Data Analysis: Data loading
[CovsirPhy] COVID-19 Python package for data analysis: SIR-F model
[CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis
[CovsirPhy] COVID-19 Python Package for Data Analysis: SIR model
Python for Data Analysis Chapter 4
Python for Data Analysis Chapter 2
Python for Data Analysis Chapter 3
Preprocessing template for data analysis (Python)
Python visualization tool for data analysis work
Data analysis python
Data analysis with python 2
Data analysis using Python 0
Data analysis with Python
Let's analyze Covid-19 (Corona) data using Python [For beginners]
[For beginners] How to study Python3 data analysis exam
My python data analysis container
[Python] Notes on data analysis
Python data analysis learning notes
Data analysis using python pandas
Tips for data analysis ・ Notes
Which should I study, R or Python, for data analysis?
<Python> Build a dedicated server for Jupyter Notebook data analysis
HMM parameter estimation implementation in python
Python course for data science_useful techniques
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Data analysis for improving POG 3-Regression analysis-
Data formatting for Python / color plots
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Create a USB boot Ubuntu with a Python environment for data analysis
A summary of Python e-books that are useful for free-to-read data analysis
Detailed Python techniques required for data shaping (1)
[Python] First data analysis / machine learning (Kaggle)
Data analysis starting with python (data preprocessing-machine learning)
How to use "deque" for Python data
Detailed Python techniques required for data shaping (2)
I did Python data analysis training remotely
Python 3 Engineer Certified Data Analysis Exam Preparation
JupyterLab Basic Setting 2 (pip) for data analysis
JupyterLab Basic Setup for Data Analysis (pip)
Analysis for Data Scientists: Qiita Self-Article Summary 2020
Data analysis in Python Summary of sources to look at first for beginners
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Python template for log analysis at explosive speed
Prepare a programming language environment for data analysis
[Examination Report] Python 3 Engineer Certified Data Analysis Exam
Python 3 Engineer Certification Data Analysis Exam Pre-Exam Learning
An introduction to statistical modeling for data analysis
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
How to use data analysis tools for beginners
Display candlesticks for FX (forex) data in Python
Data analysis in Python: A note about line_profiler
[Python] Flow from web scraping to data analysis
A well-prepared record of data analysis in Python
Astro: Python modules / functions often used for analysis
[Updated from time to time] Python memos often used for data analysis [N division, etc.]