[Notice] Sequel asks for the case fatality rate of COVID-19.

Purpose

--Verify whether the mortality rate of COVID-19 is really high --Try the stochastic programming language Stan

What is a stochastic programming language?

It is a programming language that solves statistical relationships (models) in a nice way. As an implementation, it seems to perform sampling by the Markov chain Monte Carlo method.

It is not a replacement for procedural programming languages because they have completely different roles and methods.

Stan is one of the stochastic programming languages.

Preparation

data

Use the infection status of the Diamond Princess. This is because there is no space in the world where the infection status has been investigated in more detail, and it is expected that the most accurate verification will be possible.

item	data	Date and time	Source
Number of infected people	696 people	3/5	Current affairs dot com^[1]
Number of infected people(Symptoms)	301 people	2/20	National Institute of Infectious Diseases^[2]
Number of infected people(Asymptomatic)	318 people	2/20	National Institute of Infectious Diseases^[2:1]
Number of deaths	7 people	3/7	Yomiuri Shimbun^[3]

I also want to compare it with influenza by age, so I will process the CDC statistics [^ 4] so that it can be compared with the data from the National Institute of Infectious Diseases [^ 2]. The class conversion was done simply by its width ratio. Since the CDC statistics only have numbers for the number of symptomatic influenza infections, we will also use the symptomatic infections data for COVID-19.

class	CODIV-19 Number of symptomatic infections	Influenza case fatality rate(Estimate)
0-9	0	0.0050%
10-19	2	0.0063%
20-29	25	0.0206%
30-39	27	0.0206%
40-49	19	0.0206%
50-59	28	0.0614%
60-69	76	0.4465%
70-79	95	0.8315%
80-89	27	0.8315%
90-99	2	0.8315%
Total	301	0.0962%

Execution environment

Install Stan. Stan can be run on the command line alone, but it's a lot easier to use a wrapper.

This time we will use PyStan, a Python interface that can be easily introduced with pip. You need numpy and cython, as well as scipy and matplotlib to display the graph, so put them together.

When using Anaconda

$ conda create -n dp-corona-stan python=3.7 numpy cython scipy matplotlib pystan
$ conda activate dp-corona-stan

From the jab for the time being

Case fatality rate is estimated by Stan from the number of infected people (696 people) and the number of deaths (7 people).

import pystan

model = pystan.StanModel(model_code='''
data {
    int N; //Number of infected people
    int D; //Number of deaths
}
parameters {
    real<lower=0, upper=1> p;
}
model {
    D ~ binomial(N, p);
}
''')

data = {
    'N': 696,
    'D':   7
}

fit = model.sampling(data=data, chains=4)
print(fit)

The string you are passing to StanModel is the Stan code.

This article doesn't go into too much detail on how to write Stan, but in a nutshell, it means entering the data specified in data and optimizing the variables specified in parameters to satisfy model. It is a code.

The event of how many of the infected people died will statistically follow a binomial distribution. That is, the description of model,

    D ~ binomial(N, p);

Is a model of the situation where D people died when each infected person N died with a probability of p.

You can run this code to estimate the case fatality rate p for all infected individuals.

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
p      0.01  9.8e-5 4.1e-3 5.0e-3 8.4e-3   0.01   0.01   0.02   1724    1.0
lp__ -44.22    0.02   0.73  -46.3  -44.4 -43.95 -43.76  -43.7   1531    1.0

p is estimated to be around 1%. It is said by the public that it is 2%, but from this data, 2% is on the hit line of the 95% confidence interval (5.0e-3 to 0.02), and the possibility is rejected. I can't guess, but it won't be that expensive.

However, half of infected people are asymptomatic, so if we reposition the case fatality rate and criteria for symptomatic infected people, it will certainly be around 2%. (By the way, it seems that 1/3 of influenza is asymptomatic [^ 5].)

Consider age

However, more than half of the Diamond Princess crew and passengers were over 60 years old. This is different from the population distribution in the world, so it is necessary to consider that point in order to obtain a case fatality rate that is realistic.

Therefore, if the case fatality rate of COVID-19 is equivalent to influenza, we estimate how many deaths will occur if there is a Diamond Princess-scale infection, and compare it with the actual number of deaths (7 people). I will.

Preparation: Representing the data in Python

Save the data prepared in Data to a py file.

S_xx is the number of symptomatic infections by age group, and p_flu_xx is the case fatality rate estimate for the number of symptomatic influenza infections by age group.

# data.py
data = {
    'S_0x':  0,
    'S_1x':  2,
    'S_2x': 25,
    'S_3x': 27,
    'S_4x': 19,
    'S_5x': 28,
    'S_6x': 76,
    'S_7x': 95,
    'S_8x': 27,
    'S_9x':  2,
    'p_flu_0x': 0.000050,
    'p_flu_1x': 0.000063,
    'p_flu_2x': 0.000206,
    'p_flu_3x': 0.000206,
    'p_flu_4x': 0.000206,
    'p_flu_5x': 0.000614,
    'p_flu_6x': 0.004465,
    'p_flu_6x': 0.008315,
    'p_flu_7x': 0.008315,
    'p_flu_8x': 0.008315,
    'p_flu_9x': 0.000962
}

First verify with NumPy

NumPy has the ability to generate random numbers that follow a binomial distribution. If you use that function, you can easily perform a simulation.

import numpy as np
from data import data

N = 10000
MIN_DEATH = 7

sample = (np.random.binomial(data['S_0x'], data['p_flu_0x'], N) +
          np.random.binomial(data['S_1x'], data['p_flu_1x'], N) +
          np.random.binomial(data['S_2x'], data['p_flu_2x'], N) +
          np.random.binomial(data['S_3x'], data['p_flu_3x'], N) +
          np.random.binomial(data['S_4x'], data['p_flu_4x'], N) +
          np.random.binomial(data['S_5x'], data['p_flu_5x'], N) +
          np.random.binomial(data['S_6x'], data['p_flu_6x'], N) +
          np.random.binomial(data['S_7x'], data['p_flu_7x'], N) +
          np.random.binomial(data['S_8x'], data['p_flu_8x'], N) +
          np.random.binomial(data['S_9x'], data['p_flu_9x'], N))
probability = float(sum(sample >= MIN_DEATH)) / N

print('Average # of deaths: %.2f' % (sample.mean()))
print('# of deaths >= %d: %.2f%%' % (MIN_DEATH, probability * 100))

I won't go into this either, but I'll explain it briefly.

np.random.binomial (n, p, N) repeats the task of trying an event with probability p n times and recording the number of occurrences N times. For example, np.random.binomial (2, 0.5, 3) will return ʻarray ([2, 0, 1])`.

An array of NumPy is an element-by-element addition when added normally, which gives the number of occurrences across all classes.

Also, if you use a comparison operator on an array of NumPy, it will return an array created by authenticity judgment for each element. For example, np.array ([2, 0, 1])> 1 will return ʻarray ([True, False, False])`.

Python's sum function considers True to be 1 and False to be 0, so in the end, float (sum (sample> = MIN_DEATH)) / N is used to try N. , You can calculate the percentage of deaths exceeding MIN_DEATH.

Result is

Average # of deaths: 1.67
# of deaths >= 7: 0.21%

It became like.

If the mortality rate of COVID-19 is similar to that of influenza, it seems quite unlikely that as many as 7 deaths will occur for 301 symptomatic infected individuals.

The data of symptomatic infected persons is as of 2/20, and the number of infections increased further when the 7th death was confirmed (3/7), but even considering that, COVID- It is unlikely that the odds of dying will exceed 1%.

Let's verify with Stan

Try to solve with the binomial distribution

You should be able to write something similar to the Stan code above, with the only difference being that the optimization target changes from probability to deaths and that it considers the age group.

If the estimated number of deaths by age is d_xx,

data {
    int S_0x;
    int S_1x;
    int S_2x;
    int S_3x;
    int S_4x;
    int S_5x;
    int S_6x;
    int S_7x;
    int S_8x;
    int S_9x;
    real p_flu_0x;
    real p_flu_1x;
    real p_flu_2x;
    real p_flu_3x;
    real p_flu_4x;
    real p_flu_5x;
    real p_flu_6x;
    real p_flu_7x;
    real p_flu_8x;
    real p_flu_9x;
}
parameters {
    // upper = S_xx + 1 so that S_xx can be 0
    int<lower=0, upper=S_0x+1> d_0x;
    int<lower=0, upper=S_1x+1> d_1x;
    int<lower=0, upper=S_2x+1> d_2x;
    int<lower=0, upper=S_3x+1> d_3x;
    int<lower=0, upper=S_4x+1> d_4x;
    int<lower=0, upper=S_5x+1> d_5x;
    int<lower=0, upper=S_6x+1> d_6x;
    int<lower=0, upper=S_7x+1> d_7x;
    int<lower=0, upper=S_8x+1> d_8x;
    int<lower=0, upper=S_9x+1> d_9x;
}
transformed parameters {
    int d;
    d = d_0x + d_1x + d_2x + d_3x + d_4x + d_5x + d_6x + d_7x + d_8x + d_9x;
}
model {
    d_0x ~ binomial(S_0x, p_flu_0x);
    d_1x ~ binomial(S_1x, p_flu_1x);
    d_2x ~ binomial(S_2x, p_flu_2x);
    d_3x ~ binomial(S_3x, p_flu_3x);
    d_4x ~ binomial(S_4x, p_flu_4x);
    d_5x ~ binomial(S_5x, p_flu_5x);
    d_6x ~ binomial(S_6x, p_flu_6x);
    d_7x ~ binomial(S_7x, p_flu_7x);
    d_8x ~ binomial(S_8x, p_flu_8x);
    d_9x ~ binomial(S_9x, p_flu_9x);
}

Stan can handle arrays, so I can write it more clearly, but I didn't like to put the class into the subscript of the array, so I list it honestly.

When you run

ValueError: Failed to parse Stan model 'anon_model_fecb1e77228fe372ef4eb9bc4bcc8086'. Error message:
SYNTAX ERROR, MESSAGE(S) FROM PARSER:
Parameters or transformed parameters cannot be integer or integer array;  found int variable declaration, name=d_0x
 error in 'unknown file name' at line 26, column 35
  -------------------------------------------------
    24: parameters {
    25:     // upper = S_xx + 1 so that S_xx can be 0
    26:     int<lower=0, upper=S_0x+1> d_0x;
                                          ^
    27:     int<lower=0, upper=S_1x+1> d_1x;
  -------------------------------------------------

I was angry because I couldn't use integer. .. ..

On page 20 of the Bayesian Statistical Modeling Reading Club Ch.9 [^ 6] with Stan and R, Stan's weakness was that ʻintcould not be used forparameters`. I can't really use it.

Ω \ ζ ゜) Chain ♪

Since binomial asks for ʻint on the left side, the trick of declaring d_xxs with real instead of ʻint cannot be used.

Ω \ ζ ゜) Chain ♪

Try to solve it with beta distribution

Since the binomial distribution was not good, I tried using the beta distribution.

Speaking of the preparedness that Masakari will fly, the beta distribution is the opposite of the binomial distribution, which is the probability distribution of the probability of occurrence p in the event that follows the binomial distribution. Click here for details.

    d_xx ~ binomial(S_xx, p_flu_xx);

What was written

    p_flu_xx ~ beta(d_xx + 1, S_xx - d_xx + 1);

Rewrite to and change the type of d_xx to real.

import pystan
from data import data

model = pystan.StanModel(model_code="""
data {
    int S_0x;
    int S_1x;
    int S_2x;
    int S_3x;
    int S_4x;
    int S_5x;
    int S_6x;
    int S_7x;
    int S_8x;
    int S_9x;
    real p_flu_0x;
    real p_flu_1x;
    real p_flu_2x;
    real p_flu_3x;
    real p_flu_4x;
    real p_flu_5x;
    real p_flu_6x;
    real p_flu_7x;
    real p_flu_8x;
    real p_flu_9x;
}
parameters {
    // upper = S_xx + 1 so that S_xx can be 0
    real<lower=0, upper=S_0x+1> d_0x;
    real<lower=0, upper=S_1x+1> d_1x;
    real<lower=0, upper=S_2x+1> d_2x;
    real<lower=0, upper=S_3x+1> d_3x;
    real<lower=0, upper=S_4x+1> d_4x;
    real<lower=0, upper=S_5x+1> d_5x;
    real<lower=0, upper=S_6x+1> d_6x;
    real<lower=0, upper=S_7x+1> d_7x;
    real<lower=0, upper=S_8x+1> d_8x;
    real<lower=0, upper=S_9x+1> d_9x;
}
transformed parameters {
    real d;
    d = d_0x + d_1x + d_2x + d_3x + d_4x + d_5x + d_6x + d_7x + d_8x + d_9x;
}
model {
    p_flu_0x ~ beta(d_0x + 1, S_0x - d_0x + 1);
    p_flu_1x ~ beta(d_1x + 1, S_1x - d_1x + 1);
    p_flu_2x ~ beta(d_2x + 1, S_2x - d_2x + 1);
    p_flu_3x ~ beta(d_3x + 1, S_3x - d_3x + 1);
    p_flu_4x ~ beta(d_4x + 1, S_4x - d_4x + 1);
    p_flu_5x ~ beta(d_5x + 1, S_5x - d_5x + 1);
    p_flu_6x ~ beta(d_6x + 1, S_6x - d_6x + 1);
    p_flu_7x ~ beta(d_7x + 1, S_7x - d_7x + 1);
    p_flu_8x ~ beta(d_8x + 1, S_8x - d_8x + 1);
    p_flu_9x ~ beta(d_9x + 1, S_9x - d_9x + 1);
}
""")

fit = model.sampling(data=data, chains=4)
print(fit)

Here are the results of running:

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
d_0x    0.1  1.4e-3   0.09 2.3e-3   0.03   0.07   0.14   0.35   4361    1.0
d_1x   0.12  1.7e-3   0.12 2.5e-3   0.03   0.08   0.16   0.43   4474    1.0
d_2x   0.19  2.6e-3   0.18 5.0e-3   0.06   0.14   0.27   0.67   4976    1.0
d_3x    0.2  3.1e-3   0.19 4.4e-3   0.06   0.14   0.27   0.71   3641    1.0
d_4x   0.19  2.5e-3   0.19 3.5e-3   0.05   0.13   0.26   0.69   5315    1.0
d_5x   0.25  3.2e-3   0.23 8.0e-3   0.08   0.18   0.34   0.85   5100    1.0
d_6x   0.93    0.01   0.73   0.04   0.36   0.77   1.33    2.8   4387    1.0
d_7x   1.06    0.01    0.8   0.04   0.43    0.9   1.51   3.01   4028    1.0
d_8x   0.54  7.0e-3   0.48   0.01   0.18   0.42   0.76   1.82   4729    1.0
d_9x   0.17  2.4e-3   0.16 4.3e-3   0.05   0.12   0.24   0.58   4580    1.0
d      3.73    0.02   1.25   1.66   2.84   3.59   4.51   6.57   4367    1.0
lp__  -1.64    0.08   2.58  -7.78  -3.13  -1.28   0.24   2.37   1048    1.0

d is the number of deaths, but50%is 3.59, which is clearly higher than the 1.67 obtained by NumPy.

Ω \ ζ ゜) Chain ♪

I wonder why.

This is probably a problem because d is defined as real. It seems that d_xx, which can only take integer values, can take decimal numbers, which deviates from the actual problem.

In fact, if you try to calculate by converting d_xx to an integer with floor

       mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
...
d      1.82    0.03   1.29   0.01   1.01   1.69    2.5   4.93   1710    1.0

It became a value like that.

If you simply take floor, the gradient will disappear and it will not converge well, so you need to smooth it well using ʻinv_logit` or something.

However, if you want to go this far, you can use NumPy. It's hard to say that it's correct as stochastic modeling anymore.

Is there any good way to solve this problem with Stan?

Summary

Partially, we were able to use Stan to estimate the risk of the new coronavirus.

From the situation of Diamond Princess, it seems quite certain that COVID-19 is more vicious in terms of case fatality rate than conventional influenza. Another threat is that while many people already have antibodies to influenza, few have antibodies to COVID-19. You should be vigilant for the sake of society as a whole, rather than for individuals.

However, it is estimated that about 1.67 people will die if infected with influenza at this scale, so if you are so afraid of COVID-19, you should be more cautious about ordinary influenza. From my personal point of view, I would like to make a statement.

In the sequel, we are also challenging the task of finding the case fatality rate of COVID-19. Please take a look.

appendix

Source code: https://github.com/akeyhero/dp-corona-stan

https://www.jiji.com/jc/article?k=2020030501355&g=soc ↩︎
https://www.niid.go.jp/niid/ja/diseases/ka/corona-virus/2019-ncov/2484-idsc/9422-covid-dp-2.html ↩︎ ↩︎
https://www.yomiuri.co.jp/national/20200307-OYT1T50228/ ↩︎

[PYTHON] Is the new corona really a threat? Validated with Stan (was)

Purpose

What is a stochastic programming language?

Preparation

data

Execution environment

From the jab for the time being

Consider age

Preparation: Representing the data in Python

First verify with NumPy

Let's verify with Stan

Try to solve with the binomial distribution

Try to solve it with beta distribution

Summary

appendix