[PYTHON] [Sequel] Is the new corona really a threat? Case fatality rate calculated with Stan

The last post "Is the new corona really a threat? I verified it with Stan (was)" was indigestive in the following two points:

  1. After all, I don't know what the case fatality rate of COVID-19 is.
  2. Finally Stan is not working well

So, this time, as a revenge, I will try to find the "true case fatality rate" using Stan.

Previous article I have not repeatedly explained what I explained, so please refer to that as well: bow:

Disclaimer

I'm not an expert in Bayesian modeling or Stan, so there may be mistakes and better techniques in the content of this article.

Also, if you find something wrong, I would appreciate it if you could let me know. : bow:

Assumption

In my last post, I argued that the infection survey [^ 1] on the Diamond Princess was the most accurate in the world. We will follow that policy this time as well.

However, the number of cases is small with the Diamond Princess alone, and it is not possible to know the case fatality rate by age. Therefore, we will use the data [^ 2] centered on the Chinese Center for Disease Control and Prevention.

Since the populations of these two data are completely different, I set assumptions and somehow integrate them. Here, we assume that the number of infections in China's data was obtained from all true infected individuals with a certain probability of $ c $, regardless of age group or case fatality rate.

This assumption is clearly inaccurate, as age groups vary in symptom tendencies and, as a result, the proportion of infected individuals captured may vary. Therefore, the conclusions in this article are subject to error due to this assumption. However, according to data from the National Institute of Infectious Diseases [^ 1], the proportion of asymptomatic confirmed cases to asymptomatic confirmed cases is rather lower in younger people. There are also reports [^ 3] that the severity rate of young people is unexpectedly high. From this, I think that it is a more absurd assumption than intuition.

We combine the two data by assuming that the case fatality rate for the Diamond Princess follows the case fatality rate for each age group calculated using the case fatality rate of $ c $ assumed here.

On the other hand, it is assumed that the capture of the dead is done without exception in the Chinese statistics.

Next, the death toll on the Diamond Princess was six. The seventh death was confirmed on March 7, but it is possible that he was not infected at that time, as more than two weeks have passed since the data from the National Institute of Infectious Diseases [^ 1]. It is based on the fact that it is high, that it has not been announced whether it is a death related to COVID-19, and that the number of infected people was higher than the time of the data based on March 7th.

Finally, differences by race, differences in the distribution of infected people (actually, the distribution of infected people in China and the distribution of infected people in the world should be different), and differences in the medical environment (especially when the number of patients increases rapidly, medical collapse occurs I decided not to consider (which should increase the case fatality rate). However, the difference in the number of infected people between China and the Diamond Princess is exceptionally taken into consideration.

The assumptions can be summarized as follows.

--The $ c $ capture rate for infected individuals in China is constant regardless of age group. --The death rate in China is 100% --Diamond Princess statistics are accurate (= capture rate 100%) --The case fatality rate on the Diamond Princess matches the case fatality rate in China multiplied by the capture rate $ c $ for each age group. --The death toll on the Diamond Princess is six --When calculating the case fatality rate, it is not necessary to consider differences due to race, differences due to the distribution of infected persons, and differences in the medical environment.

model

Below, $ N $ is the number of infected people, $ D $ is the number of deaths, and $ p $ is the case fatality rate. Unless otherwise noted, superscripts indicate age group and subscripts indicate location ($ {\ rm dp} $ = Diamond Princess, $ {\ rm ch} $ = China). The absence of superscripts represents all age groups.

For example, $ N_ {{\ rm dp}} ^ {60-69} $ indicates "the number of infected people aged 60-69 as identified on the Diamond Princess". (Actually, I will not mention a specific age group from now on, but write $ N_ {{\ rm dp}} ^ {a} $, "Infection of the age group $ a $ confirmed on the Diamond Princess" It is used to represent the number of people.)

Let's assume that the estimated case fatality rate is $ p $ when the assumption is correct. Also, this estimated case fatality rate $ p $ is divided by age group and expressed as $ p ^ {60-69} $ (without subscripts).

The number of deaths among infected people in the age group $ a $ in China follows a binomial distribution.

D_{{\rm ch}}^a \sim {\rm binomial}(N_{{\rm ch}}^a, p_{{\rm ch}}^a)

Can be modeled as.

Similarly, the number of deaths among infected people in the Diamond Princess is also the same.

D_{{\rm dp}} \sim {\rm binomial}(N_{{\rm dp}}, p_{{\rm dp}})

It will be. Please note that this is not modeled by age group as we do not have the age distribution of the Diamond Princess deaths.

Connect the two with the capture rate $ c $ mentioned in the Assumptions chapter (#Assumptions).

Since there is no information about the capture rate, we assume a uniform distribution.

c \sim {\rm uniform}(0, 1)

We have assumed that $ c $ is constant regardless of age group and case fatality rate, so it is independent of $ p_ {{\ rm ch}} ^ a $. So the estimated case fatality rate for the age group $ a $ is

p^a = c \times p_{{\rm ch}}^a

At all ages

p = c \times {1 \over N_{{\rm ch}}} \sum_a{p_{{\rm ch}}^a \times N_{{\rm ch}}^a} = {1 \over N_{{\rm ch}}} \sum_a{p^a \times N_{{\rm ch}}^a}

It will be.

We also assumed that the case fatality rate of the Diamond Princess follows the estimated case fatality rate ($ p_ {{\ rm dp}} ^ a = p ^ a $), so $ p_ {{\ rm dp}} $ is

p_{{\rm dp}} = {1 \over N_{{\rm dp}}} \sum_a{p^a \times N_{{\rm dp}}^a}

It will be.

Implementation

Now that we have all the actors, we'll implement it in Stan.

Input data

The variable name is _6x (ie $ N_ {{\ rm ch}} ^ {60-69} $ = N_ch_6x) if the age group is 60-69 years old, and _80_up (ie $ N_) if the age group is 80 years old or older. It is almost the same as the symbol of Model Chapter except that it is expressed as {{\ rm ch}} ^ {80-} $ = N_ch_80_up).

Please note that the Diamond Princess data [^ 1] has data for the 90s, but the Chinese data [^ 2] is for people over 80 years old.

data {
    //Number of deaths on the Diamond Princess
    int D_dp;

    //Number of infected people by age group on the Diamond Princess
    int N_dp_0x;
    int N_dp_1x;
    int N_dp_2x;
    int N_dp_3x;
    int N_dp_4x;
    int N_dp_5x;
    int N_dp_6x;
    int N_dp_7x;
    int N_dp_8x;
    int N_dp_9x;

    //Number of deaths by age group in China
    int D_ch_0x;
    int D_ch_1x;
    int D_ch_2x;
    int D_ch_3x;
    int D_ch_4x;
    int D_ch_5x;
    int D_ch_6x;
    int D_ch_7x;
    int D_ch_80_up;

    //Number of infected people by age group in China
    int N_ch_0x;
    int N_ch_1x;
    int N_ch_2x;
    int N_ch_3x;
    int N_ch_4x;
    int N_ch_5x;
    int N_ch_6x;
    int N_ch_7x;
    int N_ch_80_up;
}

Since Stan can define processed data for convenience, we will calculate the values for all age groups.

transformed data {
    //Number of infected people on the Diamond Princess
    int N_dp;

    //Number of infected people in China
    int N_ch;

    N_dp = N_dp_0x + N_dp_1x + N_dp_2x + N_dp_3x + N_dp_4x + N_dp_5x + N_dp_6x + N_dp_7x + N_dp_8x + N_dp_9x;
    N_ch = N_ch_0x + N_ch_1x + N_ch_2x + N_ch_3x + N_ch_4x + N_ch_5x + N_ch_6x + N_ch_7x + N_ch_80_up;
}

Estimated parameters

Specify the parameters to be estimated.

parameters {
    //Capture rate c
    real<lower=0, upper=1> c;

    //Case fatality rate in China
    real<lower=0, upper=1> p_ch_0x;
    real<lower=0, upper=1> p_ch_1x;
    real<lower=0, upper=1> p_ch_2x;
    real<lower=0, upper=1> p_ch_3x;
    real<lower=0, upper=1> p_ch_4x;
    real<lower=0, upper=1> p_ch_5x;
    real<lower=0, upper=1> p_ch_6x;
    real<lower=0, upper=1> p_ch_7x;
    real<lower=0, upper=1> p_ch_80_up;
}

If you write it again, some people think that the case fatality rate in China can be calculated by $ D_ {{\ rm ch}} \ over N_ {{\ rm ch}} $, so it may be an estimation target. I think.

In fact, the case fatality rate in China has not yet been determined with this model. For example, if you roll the dice twice and get a 6 on both times, you can't say that the dice is a dice with a 100% 6 on the dice. You can't do it.

The good thing about Bayesian modeling is that you can easily handle such uncertain things without being uncertain, which is the good point of Stan.

Now, just like data, parameters can convert values, so let's do it.

transformed parameters {
    //Estimated case fatality rate by age
    real<lower=0, upper=1> p_0x;
    real<lower=0, upper=1> p_1x;
    real<lower=0, upper=1> p_2x;
    real<lower=0, upper=1> p_3x;
    real<lower=0, upper=1> p_4x;
    real<lower=0, upper=1> p_5x;
    real<lower=0, upper=1> p_6x;
    real<lower=0, upper=1> p_7x;
    real<lower=0, upper=1> p_80_up;

    //Estimated case fatality rate
    real<lower=0, upper=1> p;

    //Case fatality rate on the Diamond Princess
    real<lower=0, upper=1> p_dp;

    p_0x    = c * p_ch_0x;
    p_1x    = c * p_ch_1x;
    p_2x    = c * p_ch_2x;
    p_3x    = c * p_ch_3x;
    p_4x    = c * p_ch_4x;
    p_5x    = c * p_ch_5x;
    p_6x    = c * p_ch_6x;
    p_7x    = c * p_ch_7x;
    p_80_up = c * p_ch_80_up;

    p = (p_0x    * N_ch_0x +
         p_1x    * N_ch_1x +
         p_2x    * N_ch_2x +
         p_3x    * N_ch_3x +
         p_4x    * N_ch_4x +
         p_5x    * N_ch_5x +
         p_6x    * N_ch_6x +
         p_7x    * N_ch_7x +
         p_80_up * N_ch_80_up
         ) / N_ch;

    p_dp = (p_0x    * N_dp_0x +
            p_1x    * N_dp_1x +
            p_2x    * N_dp_2x +
            p_3x    * N_dp_3x +
            p_4x    * N_dp_4x +
            p_5x    * N_dp_5x +
            p_6x    * N_dp_6x +
            p_7x    * N_dp_7x +
            p_80_up * (N_dp_8x + N_dp_9x)
            ) / N_dp;
}

model

There is nothing special to say because it is as written in Model Chapter. This expressiveness is what makes Stan so good.

model {
    c ~ uniform(0, 1);

    D_ch_0x    ~ binomial(N_ch_0x,    p_ch_0x);
    D_ch_1x    ~ binomial(N_ch_1x,    p_ch_1x);
    D_ch_2x    ~ binomial(N_ch_2x,    p_ch_2x);
    D_ch_3x    ~ binomial(N_ch_3x,    p_ch_3x);
    D_ch_4x    ~ binomial(N_ch_4x,    p_ch_4x);
    D_ch_5x    ~ binomial(N_ch_5x,    p_ch_5x);
    D_ch_6x    ~ binomial(N_ch_6x,    p_ch_6x);
    D_ch_7x    ~ binomial(N_ch_7x,    p_ch_7x);
    D_ch_80_up ~ binomial(N_ch_80_up, p_ch_80_up);

    D_dp ~ binomial(N_dp, p_dp);
}

Run

data

I entered the following data. These are based on the data described in the Assumptions chapter (#Assumptions).

data = {
    #Number of deaths on the Diamond Princess
    'D_dp': 6,

    #Number of infected people by age group on the Diamond Princess
    'N_dp_0x':   1,
    'N_dp_1x':   5,
    'N_dp_2x':  28,
    'N_dp_3x':  34,
    'N_dp_4x':  27,
    'N_dp_5x':  59,
    'N_dp_6x': 177,
    'N_dp_7x': 234,
    'N_dp_8x':  52,
    'N_dp_9x':   2,

    #Number of deaths in China
    'D_ch_0x':      0,
    'D_ch_1x':      1,
    'D_ch_2x':      7,
    'D_ch_3x':     18,
    'D_ch_4x':     38,
    'D_ch_5x':    130,
    'D_ch_6x':    309,
    'D_ch_7x':    312,
    'D_ch_80_up': 208,

    #Number of infected people by age group in China
    'N_ch_0x':     416,
    'N_ch_1x':     549,
    'N_ch_2x':    3619,
    'N_ch_3x':    7600,
    'N_ch_4x':    8571,
    'N_ch_5x':   10008,
    'N_ch_6x':    8583,
    'N_ch_7x':    3918,
    'N_ch_80_up': 1408,
}

Execution environment

Like previous, using PyStan I ran it.

Click here for the entire source code

result

I've set a lot of parameters, so when I run it, I get some huge results.

             mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
c            0.21  8.7e-4   0.08   0.08   0.15    0.2   0.25   0.38   7792    1.0
p_ch_0x    2.4e-3  3.0e-5 2.3e-3 7.1e-5 7.1e-4 1.7e-3 3.2e-3 8.3e-3   5850    1.0
p_ch_1x    3.6e-3  3.3e-5 2.6e-3 4.3e-4 1.7e-3 3.0e-3 4.9e-3   0.01   6364    1.0
p_ch_2x    2.2e-3  9.7e-6 8.0e-4 9.5e-4 1.6e-3 2.1e-3 2.7e-3 4.0e-3   6894    1.0
p_ch_3x    2.5e-3  6.3e-6 5.7e-4 1.5e-3 2.1e-3 2.5e-3 2.9e-3 3.7e-3   8234    1.0
p_ch_4x    4.6e-3  8.7e-6 7.4e-4 3.2e-3 4.1e-3 4.5e-3 5.0e-3 6.1e-3   7259    1.0
p_ch_5x      0.01  1.3e-5 1.1e-3   0.01   0.01   0.01   0.01   0.02   7562    1.0
p_ch_6x      0.04  2.4e-5 2.0e-3   0.03   0.03   0.04   0.04   0.04   7096    1.0
p_ch_7x      0.08  5.2e-5 4.3e-3   0.07   0.08   0.08   0.08   0.09   6650    1.0
p_ch_80_up   0.15  1.1e-4 9.2e-3   0.13   0.14   0.15   0.15   0.17   7552    1.0
p_0x       4.8e-4  7.1e-6 5.3e-4 1.2e-5 1.3e-4 3.1e-4 6.5e-4 1.9e-3   5530    1.0
p_1x       7.4e-4  8.5e-6 6.3e-4 7.0e-5 3.0e-4 5.8e-4 9.8e-4 2.4e-3   5462    1.0
p_2x       4.5e-4  3.2e-6 2.4e-4 1.3e-4 2.8e-4 4.0e-4 5.7e-4 1.1e-3   5637    1.0
p_3x       5.2e-4  2.7e-6 2.3e-4 1.8e-4 3.4e-4 4.8e-4 6.4e-4 1.1e-3   7336    1.0
p_4x       9.4e-4  4.6e-6 3.9e-4 3.6e-4 6.5e-4 8.8e-4 1.2e-3 1.9e-3   7248    1.0
p_5x       2.7e-3  1.2e-5 1.0e-3 1.1e-3 1.9e-3 2.5e-3 3.3e-3 5.0e-3   7484    1.0
p_6x       7.4e-3  3.1e-5 2.8e-3 3.0e-3 5.4e-3 7.0e-3 9.0e-3   0.01   7796    1.0
p_7x         0.02  6.9e-5 6.1e-3 6.6e-3   0.01   0.02   0.02   0.03   7770    1.0
p_80_up      0.03  1.3e-4   0.01   0.01   0.02   0.03   0.04   0.06   7937    1.0
p          4.7e-3  2.0e-5 1.8e-3 1.9e-3 3.5e-3 4.5e-3 5.8e-3 8.6e-3   7853    1.0
p_dp         0.01  4.7e-5 4.2e-3 4.5e-3 8.3e-3   0.01   0.01   0.02   7865    1.0
lp__        -4215    0.06   2.35  -4220  -4216  -4214  -4213  -4211   1641    1.0

If you extract the important part

             mean se_mean     sd   2.5%    25%    50%    75%  97.5%  n_eff   Rhat
c            0.21  8.7e-4   0.08   0.08   0.15    0.2   0.25   0.38   7792    1.0
p          4.7e-3  2.0e-5 1.8e-3 1.9e-3 3.5e-3 4.5e-3 5.8e-3 8.6e-3   7853    1.0

It will be.

That is, ** if the assumptions are correct **, the case fatality rate in China is around 20%, and the case fatality rate is around 0.5%, with a 95% confidence interval of 0.19% to 0.86%. Was estimated.

As mentioned earlier, in reality ** this assumption is not accurate **, so we can't trust anything in the 95% confidence interval, but the case fatality rate on the Diamond Princess, where the majority of passengers were over 60 years old. Is about 1% of the actual measurement, so I expect that this estimate is not largely missed.

Summary

This time, unlike previous, we were able to estimate the mortality rate of the new coronavirus using Stan.

I think I was able to express the goodness of Stan more than before. (I couldn't do it too much)

As a result, we obtained an estimated case fatality rate of 0.19% to 0.86%, which corresponds to the infection situation in China, although we had to spit on the eyebrows because we made a strong assumption.

Digression

In the previous post, [I made an estimate by converting the binomial distribution to a beta distribution](https://qiita.com/akeyhero/items/894dd3b5c206325191ce#%E3%83%99%E3%83%BC% E3% 82% BF% E5% 88% 86% E5% B8% 83% E3% 81% A7% E7% 84% A1% E7% 90% 86% E3% 81% 8F% E3% 82% 8A% E8% A7% A3% E3% 81% 84% E3% 81% A6% E3% 81% BF% E3% 82% 8B), but again, the number of deaths among infected people of the age group $ a $ in China Modeling

D_{{\rm ch}}^a \sim {\rm binomial}(N_{{\rm ch}}^a, p_{{\rm ch}}^a)

not,

p_{{\rm ch}}^a \sim {\rm beta}(D_{{\rm ch}}^a + 1, N_{{\rm ch}}^a - D_{{\rm ch}}^a + 1)

You can infer in the same way by replacing it with. If you are interested, please give it a try.

Recommended Posts

[Sequel] Is the new corona really a threat? Case fatality rate calculated with Stan
Is the new corona really a threat? Validated with Stan (was)
[New Corona] Is the next peak in December? I tried trend analysis with Python!