[CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis

Introduction

We are creating a Python package CovsirPhy that allows you to easily download and analyze COVID-19 data (such as the number of PCR positives).

Introductory article:

** This time, we would like to introduce S-R trend analysis (an analysis method for trends in the spread of infection). ** **

The English version of the document is Covsir Phy: COVID-19 analysis with phase-dependent SIRs, [Kaggle: COVID-19 data with SIR model]( Please refer to https://www.kaggle.com/lisphilar/covid-19-data-with-sir-model).

1. Execution environment

CovsirPhy can be installed by the following method! Please use Python 3.7 or above, or Google Colaboratory.

--Stable version: pip install covsirphy --upgrade --Development version: pip install" git + https://github.com/lisphilar/covid19-sir.git#egg=covsirphy "

import covsirphy as cs
cs.__version__
# '2.8.2'
Execution environment
OS Windows Subsystem for Linux
Python version 3.8.5

The tables and graphs in this article were created using the data as of 9/11/2020. Click here for the code [^ 2] to download the actual data from COVID-19 Data Hub [^ 1]:

data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()

Regarding the number of infected people in Japan, a value different from the value [^ 3] of Ministry of Health, Labor and Welfare HP is output. It seems. If you want to use the value of MHLW, download the data from COVID-19 dataset in Japan with the following code. Please replace. If ʻImportError / ModuleNotFoundError: requests and aiohttp occur at that time, please install by pip install requests aiohttp` (during the cause investigation, requests should be included in the dependent package ...).

[^ 3]: When the "number of positives", "number of people discharged or canceled" and "number of deaths" defined by the Ministry of Health, Labor and Welfare are treated as confirmed cases, recovered, recovered, and fatal. ..

japan_data = data_loader.japan()
jhu_data.replace(japan_data)
# Citation -> str
print(japan_data.citation)

2. Check the data

Before going into the explanation of S-R trend analysis, let's look at the actual data (Japanese data as an example) using Scenario.records ().

snl = cs.Scenario(jhu_data, population_data, country="Japan")
#When not displaying the graph
snl.records(show_figure=False)
Date Confirmed Infected Fatal Recovered
209 2020-09-07 71856 7957 1363 62536
210 2020-09-08 72234 7575 1377 63282
211 2020-09-09 72726 7233 1393 64100
212 2020-09-10 73221 6980 1406 64835
213 2020-09-11 73901 6899 1412 65590
#Display the graph
# filename:Default value=None (Do not save as an image file)
_ = snl.records(filename=None)

records.jpg

3. Purpose of S-R trend analysis

When dealing with infectious disease data using simultaneous ordinary differential equations such as the SIR model, it is common to perform analysis under one of the following assumptions.

--Model parameters are constant from the beginning to the end of the epidemic --Model parameters fluctuate daily and follow Poisson distribution etc.

However, when analyzing COVID-19, I thought that it was necessary to customize rather than adopt either one.

First of all, in this epidemic, each country and each individual has taken some measures to control the epidemic from around February and March 2020 when the infection began to spread to the world. Parameters such as infection rate fluctuate from time to time and cannot be assumed to be constant until the end (clear from the above-mentioned actual data and SIR-F model waveform [^ 4]).

[^ 4]: [CovsirPhy] COVID-19 Python package for data analysis: SIR-F model

On the other hand, if the parameters fluctuate every day, the analysis will depend on each value of the acquired data. Especially in the case of COVID-19, it is necessary to grasp the outline of the epidemic instead of the daily data because the inspection system was not established at the initial stage and there is a possibility that it has infectivity even if it is asymptomatic. I thought.

Therefore, CovsirPhy decided to take advantage of both by subdividing the parameters into periods that seem to be constant. The "period during which the parameters are constant" is called "Phase". From the start to the end of the epidemic, there are multiple Phases with different parameters. The code will be described in another article, but I think that the execution reproduction number $ R_ {\ mathrm {t}} $ of the SIR-F model, for example, changes stepwise as shown in the graph.

rt.jpg

So how do you set the Phase? Since it takes some time to calculate the parameters [^ 5], I would like to find the branch point (date when the parameter value changes) of Phase without calculating the parameters.

[^ 5]: It takes about 2 minutes for parallel calculation of 10 phases and 8 CPUs. You can wait for the analysis of one country, but it is time-consuming and difficult considering that the data increases every day and the data of many countries may be analyzed.

4. How it works

Using S-R trend analysis, it is possible to find the turning point of the phase from the transition of the number of susceptibility holders $ S $ and the number of recoverers $ R $. I will omit the process until I come up with it because it will be long, but [Kaggle: COVID-19 data with SIR model, SR trend analysis](https://www.kaggle.com/lisphilar/covid-19-data-with- sir-model # SR-trend-analysis), so please refer to it.

In the simultaneous ordinary differential equations of SIR model [^ 6] and SIR-F model [^ 4], it is as follows for $ S $ and $ R $.

[^ 6]: [CovsirPhy] COVID-19 Python package for data analysis: SIR model

\begin{align*}
& \frac{\mathrm{d}S}{\mathrm{d}T}= - N^{-1}\beta S I  \\
& \frac{\mathrm{d}R}{\mathrm{d}T}= \gamma I  \\
\end{align*}

Let's divide $ \ frac {\ mathrm {d} S} {\ mathrm {d} T} $ by $ \ frac {\ mathrm {d} R} {\ mathrm {d} T} $.

\begin{align*}
\cfrac{\mathrm{d}S}{\mathrm{d}R} &= - \cfrac{\beta}{N \gamma} S  \\
\end{align*}

If you integrate this and put $ a = \ frac {\ beta} {N \ gamma} $ [^ 7],

[^ 7]: Pioneer in analyzing SR planes in SIR model: Balkew, Teshome Mogessie, "The SIR Model When S (t) is a Multi-Exponential Function." (2010) .Electronic Theses and Dissertations.Paper 1747 .

S_{(R)} = N e^{-a R}

Divide both sides by logarithm,

\log S_{(R)} = - a R + \log N

** When the model parameters are constant, draw a semi-log graph with the x-axis as the number of recoverers $ R $ and the y-axis as the number of sensitivity holders $ S $, and it will be a straight line! ** **

(Graph: From the Kaggle Notebook mentioned above, the difference in the slope of the straight line when the parameters are changed)

__results___208_0.png

5. Code example

It can be executed in one line by the Scenario.trend () method. A semi-logarithmic graph of S-R is displayed. Phase is named Initial (0th) phase, 1st phase, 2nd phase, ... [^ 8].

[^ 8]: Related technology: [Python] Convert natural numbers to ordinal numbers

# show_figure:Display the graph
# filename:Default value=None (Do not save as an image file)
snl.trend(show_figure=True, filename=None)

trend.jpg

A list of start and end dates for each Phase can be obtained in data frame format using the Scenario.summary () method.

snl.summary()
Type Start End Population
0th Past 06Feb2020 21Apr2020 126529100
1st Past 22Apr2020 03Jul2020 126529100
2nd Past 04Jul2020 23Jul2020 126529100
3rd Past 24Jul2020 31Jul2020 126529100
4th Past 01Aug2020 10Aug2020 126529100
5th Past 11Aug2020 21Aug2020 126529100
6th Past 22Aug2020 29Aug2020 126529100
7th Past 30Aug2020 11Sep2020 126529100

Since the value of the number of susceptibility holders $ S $ changes depending on the value of the total population, it is displayed together as an important analysis condition. In addition, although not shown in this article, parameter estimates can be listed using the Scenario.summary () method. If there is a Phase for which the actual data is not registered, the Type column of the corresponding line will be" Future ".

Although omitted this time, the above Phase setting is performed by Scenario.add (), Scenario.combine (), Scenario.delete (), Scenario.separate (), Scenario.clear (). It is also possible to edit.

6. Implementation

I'm using the ruptures package to find the number of branch points in Phase and calculate the number of recoverers that will be the branch points. I originally used fbprophet, which is famous as a time series analysis package, but there was a problem that the number of branch points had to be set manually. Ilyass Tabial of Colaborator told me [^ 9], and I made it possible to automatically estimate the number of branch points using ruptures.

From covsirphy.ChangeFinder implementation code, the main steps are as follows.

  1. Get the S, R values for each date
  2. Calculate log10 (S)
  3. Set the value of log10 (S) for R = 1, 2, 3. .: If it is not in the data, fill in the values before and after.
  4. Sampling to reduce the amount of calculation: Sampling number = number of original data
  5. Using the Ruptures package, find the set (R, log10 (S)) of points where log10 (S) changes significantly with respect to the amount of change in the number of recoverers.
  6. Collate log10 (S) of the set of change points with the original data to obtain the date of each change point (approximate value if it is not in the original data. Log10 (S) because the number of confirmed cases has a larger variation than R. It was used)
  7. Calculate the start and end dates of each Phase from the date of change

Afterword

Next time, I will explain how to estimate the parameters for each Phase.

Thank you for your hard work!

Recommended Posts

[CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis
[CovsirPhy] COVID-19 Python package for data analysis: SIR-F model
[CovsirPhy] COVID-19 Python Package for Data Analysis: SIR model
[CovsirPhy] COVID-19 Python Package for Data Analysis: Parameter estimation
[CovsirPhy] COVID-19 Python Package for Data Analysis: Scenario Analysis (Parameter Comparison)
Python for Data Analysis Chapter 4
Python for Data Analysis Chapter 2
Python for Data Analysis Chapter 3
Preprocessing template for data analysis (Python)
Python visualization tool for data analysis work
Data analysis with python 2
Data analysis using Python 0
Data analysis overview python
Python data analysis template
Data analysis with Python
Let's analyze Covid-19 (Corona) data using Python [For beginners]
Data analysis for improving POG 1 ~ Web scraping with Python ~
My python data analysis container
[Python] Notes on data analysis
Data analysis using python pandas
Tips for data analysis ・ Notes
Analyzing Twitter Data | Trend Analysis
[Understand in the shortest time] Python basics for data analysis
Which should I study, R or Python, for data analysis?
<Python> Build a dedicated server for Jupyter Notebook data analysis
Python: Time Series Analysis: Preprocessing Time Series Data
Python course for data science_useful techniques
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Data analysis for improving POG 3-Regression analysis-
Data formatting for Python / color plots
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Create a USB boot Ubuntu with a Python environment for data analysis
A summary of Python e-books that are useful for free-to-read data analysis
Program for Twitter Trend Analysis (Personal Note)
Detailed Python techniques required for data shaping (1)
[Python] First data analysis / machine learning (Kaggle)
Data analysis starting with python (data preprocessing-machine learning)
How to use "deque" for Python data
Detailed Python techniques required for data shaping (2)
I did Python data analysis training remotely
Python 3 Engineer Certified Data Analysis Exam Preparation
JupyterLab Basic Setting 2 (pip) for data analysis
JupyterLab Basic Setup for Data Analysis (pip)
Analysis for Data Scientists: Qiita Self-Article Summary 2020
Data analysis in Python Summary of sources to look at first for beginners
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Python template for log analysis at explosive speed
[Examination Report] Python 3 Engineer Certified Data Analysis Exam
Analysis for Data Scientists: Qiita Self-Article Summary 2020 (Practice)
Python3 Engineer Certification Data Analysis Exam Self-made Questions
Python 3 Engineer Certification Data Analysis Exam Pre-Exam Learning
An introduction to statistical modeling for data analysis
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
How to use data analysis tools for beginners
Display candlesticks for FX (forex) data in Python
Data analysis in Python: A note about line_profiler
[Python] Flow from web scraping to data analysis
A well-prepared record of data analysis in Python
Astro: Python modules / functions often used for analysis
[Updated from time to time] Python memos often used for data analysis [N division, etc.]