[Notice] Sequel asks for the case fatality rate of COVID-19.
--Verify whether the mortality rate of COVID-19 is really high --Try the stochastic programming language Stan
It is a programming language that solves statistical relationships (models) in a nice way. As an implementation, it seems to perform sampling by the Markov chain Monte Carlo method.
It is not a replacement for procedural programming languages because they have completely different roles and methods.
Stan is one of the stochastic programming languages.
Use the infection status of the Diamond Princess. This is because there is no space in the world where the infection status has been investigated in more detail, and it is expected that the most accurate verification will be possible.
item | data | Date and time | Source |
---|---|---|---|
Number of infected people | 696 people | 3/5 | Current affairs dot com[1] |
Number of infected people(Symptoms) | 301 people | 2/20 | National Institute of Infectious Diseases[2] |
Number of infected people(Asymptomatic) | 318 people | 2/20 | National Institute of Infectious Diseases[2:1] |
Number of deaths | 7 people | 3/7 | Yomiuri Shimbun[3] |
I also want to compare it with influenza by age, so I will process the CDC statistics [^ 4] so that it can be compared with the data from the National Institute of Infectious Diseases [^ 2]. The class conversion was done simply by its width ratio. Since the CDC statistics only have numbers for the number of symptomatic influenza infections, we will also use the symptomatic infections data for COVID-19.
class | CODIV-19 Number of symptomatic infections | Influenza case fatality rate(Estimate) |
---|---|---|
0-9 | 0 | 0.0050% |
10-19 | 2 | 0.0063% |
20-29 | 25 | 0.0206% |
30-39 | 27 | 0.0206% |
40-49 | 19 | 0.0206% |
50-59 | 28 | 0.0614% |
60-69 | 76 | 0.4465% |
70-79 | 95 | 0.8315% |
80-89 | 27 | 0.8315% |
90-99 | 2 | 0.8315% |
Total | 301 | 0.0962% |
Install Stan. Stan can be run on the command line alone, but it's a lot easier to use a wrapper.
This time we will use PyStan, a Python interface that can be easily introduced with pip
.
You need numpy
and cython
, as well as scipy
and matplotlib
to display the graph, so put them together.
When using Anaconda
$ conda create -n dp-corona-stan python=3.7 numpy cython scipy matplotlib pystan
$ conda activate dp-corona-stan
Case fatality rate is estimated by Stan from the number of infected people (696 people) and the number of deaths (7 people).
import pystan
model = pystan.StanModel(model_code='''
data {
int N; //Number of infected people
int D; //Number of deaths
}
parameters {
real<lower=0, upper=1> p;
}
model {
D ~ binomial(N, p);
}
''')
data = {
'N': 696,
'D': 7
}
fit = model.sampling(data=data, chains=4)
print(fit)
The string you are passing to StanModel
is the Stan code.
This article doesn't go into too much detail on how to write Stan, but in a nutshell, it means entering the data specified in data
and optimizing the variables specified in parameters
to satisfy model
. It is a code.
The event of how many of the infected people died will statistically follow a binomial distribution. That is, the description of model
,
D ~ binomial(N, p);
Is a model of the situation where D
people died when each infected person N
died with a probability of p
.
You can run this code to estimate the case fatality rate p
for all infected individuals.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
p 0.01 9.8e-5 4.1e-3 5.0e-3 8.4e-3 0.01 0.01 0.02 1724 1.0
lp__ -44.22 0.02 0.73 -46.3 -44.4 -43.95 -43.76 -43.7 1531 1.0
p
is estimated to be around 1%.
It is said by the public that it is 2%, but from this data, 2% is on the hit line of the 95% confidence interval (5.0e-3
to 0.02
), and the possibility is rejected. I can't guess, but it won't be that expensive.
However, half of infected people are asymptomatic, so if we reposition the case fatality rate and criteria for symptomatic infected people, it will certainly be around 2%. (By the way, it seems that 1/3 of influenza is asymptomatic [^ 5].)
However, more than half of the Diamond Princess crew and passengers were over 60 years old. This is different from the population distribution in the world, so it is necessary to consider that point in order to obtain a case fatality rate that is realistic.
Therefore, if the case fatality rate of COVID-19 is equivalent to influenza, we estimate how many deaths will occur if there is a Diamond Princess-scale infection, and compare it with the actual number of deaths (7 people). I will.
Save the data prepared in Data to a py file.
S_xx
is the number of symptomatic infections by age group, and p_flu_xx
is the case fatality rate estimate for the number of symptomatic influenza infections by age group.
# data.py
data = {
'S_0x': 0,
'S_1x': 2,
'S_2x': 25,
'S_3x': 27,
'S_4x': 19,
'S_5x': 28,
'S_6x': 76,
'S_7x': 95,
'S_8x': 27,
'S_9x': 2,
'p_flu_0x': 0.000050,
'p_flu_1x': 0.000063,
'p_flu_2x': 0.000206,
'p_flu_3x': 0.000206,
'p_flu_4x': 0.000206,
'p_flu_5x': 0.000614,
'p_flu_6x': 0.004465,
'p_flu_6x': 0.008315,
'p_flu_7x': 0.008315,
'p_flu_8x': 0.008315,
'p_flu_9x': 0.000962
}
NumPy has the ability to generate random numbers that follow a binomial distribution. If you use that function, you can easily perform a simulation.
import numpy as np
from data import data
N = 10000
MIN_DEATH = 7
sample = (np.random.binomial(data['S_0x'], data['p_flu_0x'], N) +
np.random.binomial(data['S_1x'], data['p_flu_1x'], N) +
np.random.binomial(data['S_2x'], data['p_flu_2x'], N) +
np.random.binomial(data['S_3x'], data['p_flu_3x'], N) +
np.random.binomial(data['S_4x'], data['p_flu_4x'], N) +
np.random.binomial(data['S_5x'], data['p_flu_5x'], N) +
np.random.binomial(data['S_6x'], data['p_flu_6x'], N) +
np.random.binomial(data['S_7x'], data['p_flu_7x'], N) +
np.random.binomial(data['S_8x'], data['p_flu_8x'], N) +
np.random.binomial(data['S_9x'], data['p_flu_9x'], N))
probability = float(sum(sample >= MIN_DEATH)) / N
print('Average # of deaths: %.2f' % (sample.mean()))
print('# of deaths >= %d: %.2f%%' % (MIN_DEATH, probability * 100))
I won't go into this either, but I'll explain it briefly.
np.random.binomial (n, p, N)
repeats the task of trying an event with probability p
n
times and recording the number of occurrences N
times. For example, np.random.binomial (2, 0.5, 3)
will return ʻarray ([2, 0, 1])`.
An array of NumPy is an element-by-element addition when added normally, which gives the number of occurrences across all classes.
Also, if you use a comparison operator on an array of NumPy, it will return an array created by authenticity judgment for each element. For example, np.array ([2, 0, 1])> 1
will return ʻarray ([True, False, False])`.
Python's sum
function considers True
to be 1
and False
to be 0
, so in the end, float (sum (sample> = MIN_DEATH)) / N
is used to try N
. , You can calculate the percentage of deaths exceeding MIN_DEATH
.
Result is
Average # of deaths: 1.67
# of deaths >= 7: 0.21%
It became like.
If the mortality rate of COVID-19 is similar to that of influenza, it seems quite unlikely that as many as 7 deaths will occur for 301 symptomatic infected individuals.
You should be able to write something similar to the Stan code above, with the only difference being that the optimization target changes from probability to deaths and that it considers the age group.
If the estimated number of deaths by age is d_xx
,
data {
int S_0x;
int S_1x;
int S_2x;
int S_3x;
int S_4x;
int S_5x;
int S_6x;
int S_7x;
int S_8x;
int S_9x;
real p_flu_0x;
real p_flu_1x;
real p_flu_2x;
real p_flu_3x;
real p_flu_4x;
real p_flu_5x;
real p_flu_6x;
real p_flu_7x;
real p_flu_8x;
real p_flu_9x;
}
parameters {
// upper = S_xx + 1 so that S_xx can be 0
int<lower=0, upper=S_0x+1> d_0x;
int<lower=0, upper=S_1x+1> d_1x;
int<lower=0, upper=S_2x+1> d_2x;
int<lower=0, upper=S_3x+1> d_3x;
int<lower=0, upper=S_4x+1> d_4x;
int<lower=0, upper=S_5x+1> d_5x;
int<lower=0, upper=S_6x+1> d_6x;
int<lower=0, upper=S_7x+1> d_7x;
int<lower=0, upper=S_8x+1> d_8x;
int<lower=0, upper=S_9x+1> d_9x;
}
transformed parameters {
int d;
d = d_0x + d_1x + d_2x + d_3x + d_4x + d_5x + d_6x + d_7x + d_8x + d_9x;
}
model {
d_0x ~ binomial(S_0x, p_flu_0x);
d_1x ~ binomial(S_1x, p_flu_1x);
d_2x ~ binomial(S_2x, p_flu_2x);
d_3x ~ binomial(S_3x, p_flu_3x);
d_4x ~ binomial(S_4x, p_flu_4x);
d_5x ~ binomial(S_5x, p_flu_5x);
d_6x ~ binomial(S_6x, p_flu_6x);
d_7x ~ binomial(S_7x, p_flu_7x);
d_8x ~ binomial(S_8x, p_flu_8x);
d_9x ~ binomial(S_9x, p_flu_9x);
}
When you run
ValueError: Failed to parse Stan model 'anon_model_fecb1e77228fe372ef4eb9bc4bcc8086'. Error message:
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Parameters or transformed parameters cannot be integer or integer array; found int variable declaration, name=d_0x
error in 'unknown file name' at line 26, column 35
-------------------------------------------------
24: parameters {
25: // upper = S_xx + 1 so that S_xx can be 0
26: int<lower=0, upper=S_0x+1> d_0x;
^
27: int<lower=0, upper=S_1x+1> d_1x;
-------------------------------------------------
I was angry because I couldn't use integer. .. ..
On page 20 of the Bayesian Statistical Modeling Reading Club Ch.9 [^ 6] with Stan and R, Stan's weakness was that ʻintcould not be used for
parameters`.
I can't really use it.
Ω \ ζ ゜) Chain ♪
Since binomial
asks for ʻint on the left side, the trick of declaring
d_xxs with
real instead of ʻint
cannot be used.
Ω \ ζ ゜) Chain ♪
Since the binomial distribution was not good, I tried using the beta distribution.
Speaking of the preparedness that Masakari will fly, the beta distribution is the opposite of the binomial distribution, which is the probability distribution of the probability of occurrence p
in the event that follows the binomial distribution. Click here for details.
d_xx ~ binomial(S_xx, p_flu_xx);
What was written
p_flu_xx ~ beta(d_xx + 1, S_xx - d_xx + 1);
Rewrite to and change the type of d_xx
to real
.
import pystan
from data import data
model = pystan.StanModel(model_code="""
data {
int S_0x;
int S_1x;
int S_2x;
int S_3x;
int S_4x;
int S_5x;
int S_6x;
int S_7x;
int S_8x;
int S_9x;
real p_flu_0x;
real p_flu_1x;
real p_flu_2x;
real p_flu_3x;
real p_flu_4x;
real p_flu_5x;
real p_flu_6x;
real p_flu_7x;
real p_flu_8x;
real p_flu_9x;
}
parameters {
// upper = S_xx + 1 so that S_xx can be 0
real<lower=0, upper=S_0x+1> d_0x;
real<lower=0, upper=S_1x+1> d_1x;
real<lower=0, upper=S_2x+1> d_2x;
real<lower=0, upper=S_3x+1> d_3x;
real<lower=0, upper=S_4x+1> d_4x;
real<lower=0, upper=S_5x+1> d_5x;
real<lower=0, upper=S_6x+1> d_6x;
real<lower=0, upper=S_7x+1> d_7x;
real<lower=0, upper=S_8x+1> d_8x;
real<lower=0, upper=S_9x+1> d_9x;
}
transformed parameters {
real d;
d = d_0x + d_1x + d_2x + d_3x + d_4x + d_5x + d_6x + d_7x + d_8x + d_9x;
}
model {
p_flu_0x ~ beta(d_0x + 1, S_0x - d_0x + 1);
p_flu_1x ~ beta(d_1x + 1, S_1x - d_1x + 1);
p_flu_2x ~ beta(d_2x + 1, S_2x - d_2x + 1);
p_flu_3x ~ beta(d_3x + 1, S_3x - d_3x + 1);
p_flu_4x ~ beta(d_4x + 1, S_4x - d_4x + 1);
p_flu_5x ~ beta(d_5x + 1, S_5x - d_5x + 1);
p_flu_6x ~ beta(d_6x + 1, S_6x - d_6x + 1);
p_flu_7x ~ beta(d_7x + 1, S_7x - d_7x + 1);
p_flu_8x ~ beta(d_8x + 1, S_8x - d_8x + 1);
p_flu_9x ~ beta(d_9x + 1, S_9x - d_9x + 1);
}
""")
fit = model.sampling(data=data, chains=4)
print(fit)
Here are the results of running:
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
d_0x 0.1 1.4e-3 0.09 2.3e-3 0.03 0.07 0.14 0.35 4361 1.0
d_1x 0.12 1.7e-3 0.12 2.5e-3 0.03 0.08 0.16 0.43 4474 1.0
d_2x 0.19 2.6e-3 0.18 5.0e-3 0.06 0.14 0.27 0.67 4976 1.0
d_3x 0.2 3.1e-3 0.19 4.4e-3 0.06 0.14 0.27 0.71 3641 1.0
d_4x 0.19 2.5e-3 0.19 3.5e-3 0.05 0.13 0.26 0.69 5315 1.0
d_5x 0.25 3.2e-3 0.23 8.0e-3 0.08 0.18 0.34 0.85 5100 1.0
d_6x 0.93 0.01 0.73 0.04 0.36 0.77 1.33 2.8 4387 1.0
d_7x 1.06 0.01 0.8 0.04 0.43 0.9 1.51 3.01 4028 1.0
d_8x 0.54 7.0e-3 0.48 0.01 0.18 0.42 0.76 1.82 4729 1.0
d_9x 0.17 2.4e-3 0.16 4.3e-3 0.05 0.12 0.24 0.58 4580 1.0
d 3.73 0.02 1.25 1.66 2.84 3.59 4.51 6.57 4367 1.0
lp__ -1.64 0.08 2.58 -7.78 -3.13 -1.28 0.24 2.37 1048 1.0
d
is the number of deaths, but50%
is 3.59
, which is clearly higher than the 1.67
obtained by NumPy.
Ω \ ζ ゜) Chain ♪
I wonder why.
This is probably a problem because d
is defined as real
. It seems that d_xx
, which can only take integer values, can take decimal numbers, which deviates from the actual problem.
In fact, if you try to calculate by converting d_xx
to an integer with floor
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
...
d 1.82 0.03 1.29 0.01 1.01 1.69 2.5 4.93 1710 1.0
It became a value like that.
floor
, the gradient will disappear and it will not converge well, so you need to smooth it well using ʻinv_logit` or something.However, if you want to go this far, you can use NumPy. It's hard to say that it's correct as stochastic modeling anymore.
Is there any good way to solve this problem with Stan?
Partially, we were able to use Stan to estimate the risk of the new coronavirus.
From the situation of Diamond Princess, it seems quite certain that COVID-19 is more vicious in terms of case fatality rate than conventional influenza. Another threat is that while many people already have antibodies to influenza, few have antibodies to COVID-19. You should be vigilant for the sake of society as a whole, rather than for individuals.
However, it is estimated that about 1.67 people will die if infected with influenza at this scale, so if you are so afraid of COVID-19, you should be more cautious about ordinary influenza. From my personal point of view, I would like to make a statement.
In the sequel, we are also challenging the task of finding the case fatality rate of COVID-19. Please take a look.
Source code: https://github.com/akeyhero/dp-corona-stan
Recommended Posts