[PYTHON] Considering the situation in Japan by statistician Nate Silver, "The number of people infected with coronavirus is meaningless"

Introduction

American statistician Nate Silver famous for the election prediction of the US presidential election % E3% 83% BB% E3% 82% B7% E3% 83% AB% E3% 83% 90% E3% 83% BC) on a site called FiveThirtyEight , "Coronavirus Case Counts Are Meaningless" I will.

The claim in this article is ** "Unless you know enough about how the test is done, the reported number of people infected with COVID-19 is not a useful indicator [^ 1]." * It's called *. While I intuitively think "that's right", virus infection is an exponential phenomenon, so even if the number of tests is narrowed down (possibly ** arbitrarily **), the reproduction ratio I also see and hear claims that (explained later) should be observable. From a completely different point of view, there are claims that medical care will collapse if the tests are not narrowed down. Nate's article includes a simple simulation that readers can try out as an Excel Spreadsheet (https://fivethirtyeight.com/wp-content/uploads/2020/04/covidia_1.05_updated.xlsx). So that you can experience how the number of infected people is reported and the true number of infected people is different depending on the test policy (selection of test targets and total number of tests) and infection control (social distancing and lockdown). It has become. You can also observe the difference between the true reproduction rate and the observed apparent reproduction rate.

The spreadsheet model [^ 2] is convenient to easily play with the parameters and see the difference in the results, but each cell in the spreadsheet says $ \ text {BH32} $ "column" It is represented by a combination of "= alphabet" and "line = number", and their relationship is written in a BASIC-like formula, so it is not very suitable for understanding what is calculated and how. ..

[^ 2]: Nate Silver didn't read the spreadsheet as a model. It's not meant to predict anything.

So I converted the spreadsheet model into a Python program (manually) to understand what I was doing. In addition, I gave a parameter to the current situation in Japan (April 22, 2020) and tried it. Then, although the number of reported infections has just exceeded 10,000, the actual number of cases actually exceeds ** 7.5 million , which means that about 6% of the total population is infected. I got it. With this, you should think, " If you go outside, the next person is infected **".

program

You can find the jupyter notebook on github.

If you think of the infection phenomenon as a phenomenon that occurs continuously, it becomes a differential equation and it is difficult for an amateur to handle it, so it is regarded as a very rough discrete phenomenon. In other words, consider the chain of infection as a chain from the previous generation [^ 3] to the next generation, and further simplify the calculation by fixing the time difference between generations to a certain constant (for example, 5 days).

The most important parameter is the reproduction rate ($ R $), which specifies how many $ n + 1 $ infected individuals in the $ n $ generation will pass the disease. .. This depends on the infectivity of the disease, but also on the behavior of people in society. The reproduction rate of cruise ships would have been very high, and the reproduction rate of Wuhan would have been low since it was locked down and all outings were banned. If $ R $ is greater than 1, the infection will spread, and if it is less than 1, the disease will eventually disappear.

[^ 3]: Of course, this generation is not the baby boomer generation or the millennial generation, but the first infected people are called the 0th generation, the people infected from there are called the 1st generation, and so on.

parameter settings

Parameters related to disease spread are expressed by Disease_spread_params class.

Population parameters are represented by Population_params class.

Parameters related to PCR testing are expressed by Testing_params class.

The values of these parameters are given as default values by selecting plausible ones from various papers.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time, re
import datetime

iround = lambda x: int((x * 2 + 1) // 2)

class Disease_spread_params:
    def __init__(self,
                 R_uncontrolled = 2.7,
                 R_intermediate = 1.4,
                 R_lockdown = 0.7,
                 cluster = 0.5,
                 begin_intermediate = 11,
                 begin_lockdown = 15,
                 mild = 0.6,
                 asymptomatic = 0.3,
                 zero_date = '2020-01-01',
                 serial = 5):
        self.R_uncontrolled = R_uncontrolled
        self.R_intermediate = R_intermediate
        self.R_lockdown = R_lockdown
        self.cluster = cluster
        self.begin_intermediate = begin_intermediate
        self.begin_lockdown = begin_lockdown
        self.severe = 1 - mild - asymptomatic
        self.mild = mild
        self.asymptomatic = asymptomatic
        self.zero_date = zero_date
        self.serial = serial

class Population_params:
    def __init__(self,
                 population = 10_000_000,
                 initial_infection = 1,
                 faux_severe = 0.001,
                 faux_mild = 0.025,
                 desire_severe = 1.0,
                 desire_mild = 0.5,
                 desire_asymptomatic = 0.02):
        self.population = population
        self.initial_infection = initial_infection
        self.faux_severe = faux_severe
        self.faux_mild = faux_mild
        self.desire_severe = desire_severe
        self.desire_mild = desire_mild
        self.desire_asymptomatic = desire_asymptomatic

class Testing_params:
    def __init__(self,
                 initial_tests = 1_000,
                 ramp_period = 3,
                 test_growth_rate = 0.5,
                 test_max = 10_000_000,
                 rationed_test_ratio = 0.75,
                 false_negative = 0.20,
                 false_positive = 0.002,
                 delay = 2):
        self.initial_tests = initial_tests
        self.ramp_period = ramp_period
        self.test_growth_rate = test_growth_rate
        self.test_max = test_max
        self.rationed_test_ratio = rationed_test_ratio
        self.false_negative = false_negative
        self.false_positive = false_positive
        self.delay = delay

Here, as a function to round a real number to an integer, I am using my own function called ʻiround ()instead of the Python built-inround (). This is for compatibility with Excel's ROUND` function [^ 4].

[^ 4]: Python's round () is rounded to an even number, while EXCEL's round () is rounded.

Simulator

While looking at the cell-by-cell dependencies of the spreadsheet I referred to, I understood how it was calculated and reimplemented it. To be honest, it's dirty (not humble).

The Simulator class takes a parameter, instantiates it, calls therun ()method, and returns the result as pandas.DataFrame.

class Simulator:
    def __init__(self,
                 disease_spread_params,
                 population_params,
                 testing_params):
        self.disease_spread_params = disease_spread_params
        self.population_params = population_params
        self.testing_params = testing_params
        
        self.columns = [ 'date', 'actual.R', 'doubling time in days' ]
        self.date_regex = re.compile('(\d+)-(\d+)-(\d+)')

    def decode_date(self, date):
            match = self.date_regex.fullmatch(date)
            if match:
                y, m, d = match.group(1, 2, 3)
                timestamp = time.mktime((int(y), int(m), int(d), 0, 0, 0, -1, -1, -1))
                return timestamp
            return None
    
    def encode_date(self, timestamp):
        t = time.localtime(timestamp)
        return '{0:04d}-{1:02d}-{2:02d}'.format(t[0], t[1], t[2])

    def run(self, zero_day = '2020-01-01', generations = 40):
        self.generations = generations
        self.df = pd.DataFrame(index = range(0, self.generations))
        t0 = self.decode_date(zero_day)
        self.df['date'] = [ self.encode_date(t0 + d * self.disease_spread_params.serial * 24 * 60 * 60) for d in range(0, self.generations) ]

        self.set_target_R()
        self.compute_actual_infection()
        self.compute_tests()
        
        return self.df

    def set_target_R(self):
        begin_lockdown = self.disease_spread_params.begin_lockdown
        begin_intermediate = self.disease_spread_params.begin_intermediate
        self.df['target_R'] = np.NaN
        
        for i in range(0, self.generations):
            if begin_lockdown != None and i >= begin_lockdown:
                self.df.at[i, 'target_R'] = self.disease_spread_params.R_lockdown
            elif  begin_intermediate != None and i >= begin_intermediate:
                self.df.at[i, 'target_R'] = self.disease_spread_params.R_intermediate
            else:
                self.df.at[i, 'target_R'] = self.disease_spread_params.R_uncontrolled

    def compute_actual_infection(self):
        population = self.population_params.population
        initial_infection = self.population_params.initial_infection
        cluster = self.disease_spread_params.cluster
        serial = self.disease_spread_params.serial

        df = self.df
        df['susceptible'] = np.NaN
        df.at[0, 'susceptible'] = population - initial_infection
        df['new_infection'] = np.NaN
        df.at[0, 'new_infection'] = initial_infection
        df['cumulative_infection'] = np.NaN
        df.at[0, 'cumulative_infection'] = initial_infection
        df['actual_R'] = np.NaN
        df['actual_doubling_time'] = np.NaN
        
        for i in range(1, self.generations):
            df.at[i, 'new_infection'] = iround(df.at[i-1, 'susceptible']*(1-(1-((df.at[i-1, 'target_R']*(df.at[i-1, 'susceptible']/population)**cluster)/population))**df.at[i-1, 'new_infection']))
            df.at[i, 'cumulative_infection'] = df.at[i-1, 'cumulative_infection'] + df.at[i, 'new_infection']
            df.at[i, 'susceptible'] = population - df.at[i, 'cumulative_infection']
            df.at[i-1, 'actual_R'] = df.at[i, 'new_infection'] / df.at[i-1, 'new_infection']
            if df.at[i-1, 'cumulative_infection'] != 0:
                df.at[i-1, 'actual_doubling_time'] = np.inf if df.at[i, 'new_infection'] == 0 else serial * np.log(2) / np.log(df.at[i, 'cumulative_infection']/df.at[i-1, 'cumulative_infection'])

    def compute_tests(self):
        population = self.population_params.population
        ramp_period = self.testing_params.ramp_period
        tests_max = self.testing_params.test_max
        test_growth_rate = self.testing_params.test_growth_rate
        rationed_test_ratio = self.testing_params.rationed_test_ratio
        mild = self.disease_spread_params.mild
        asymptomatic = self.disease_spread_params.asymptomatic
        faux_severe = self.population_params.faux_severe
        faux_mild = self.population_params.faux_mild
        desire_severe = self.population_params.desire_severe
        desire_mild = self.population_params.desire_mild
        desire_asymptomatic = self.population_params.desire_asymptomatic
        false_negative = self.testing_params.false_negative
        false_positive = self.testing_params.false_positive
        delay = self.testing_params.delay
        serial = self.disease_spread_params.serial        

        cumulative_tests_conducted = 0
        cumulative_detected_cases = 0
        
        df = self.df
        df['tests_available'] = 0
        df['new_detected_cases'] = 0
        df['cumulative_detected_cases'] = 0
        
        for i in range(0, self.generations):
            if i  == 0:
                df.at[i, 'tests_available'] = 0
            elif i == 1:
                df.at[i, 'tests_available'] = self.testing_params.initial_tests
            elif i < ramp_period:
                df.at[i, 'tests_available'] = df.at[i-1, 'tests_available']
            else:
                df.at[i, 'tests_available'] = iround(min(tests_max, df.at[i-1, 'tests_available'] * (1 + test_growth_rate)))
            tests_available = df.at[i, 'tests_available']
            rationed_tests =  iround(tests_available * rationed_test_ratio)
            on_demand_tests = tests_available - rationed_tests

            new_infection_severe = iround(df.at[i, 'new_infection'] * (1 - mild - asymptomatic))
            new_infection_mild = iround(df.at[i, 'new_infection'] * mild)
            new_infection_asymptomatic = df.at[i, 'new_infection'] - new_infection_severe - new_infection_mild
            
            population_severe = iround((population - df.at[i, 'new_infection']) * faux_severe) + new_infection_severe
            population_mild = iround((population - df.at[i, 'new_infection']) * faux_mild) + new_infection_mild
            population_asymptomatic = population - population_severe - population_mild
            
            desiring_tests_severe = iround(population_severe * desire_severe * (1 - cumulative_tests_conducted/population))
            desiring_tests_mild = iround(population_mild * desire_mild * (1 - cumulative_tests_conducted/population))
            desiring_tests_asymptomatic = iround(population_asymptomatic * desire_asymptomatic * (1 - cumulative_tests_conducted/population))
            
            alloc_rationed_tests_severe = min(rationed_tests, desiring_tests_severe)
            alloc_rationed_tests_mild = min(desiring_tests_mild, rationed_tests-alloc_rationed_tests_severe)
            alloc_rationed_tests_asymptomatic = min(desiring_tests_asymptomatic, rationed_tests-alloc_rationed_tests_severe-alloc_rationed_tests_mild)

            unfilled_test_demand_severe = desiring_tests_severe - alloc_rationed_tests_severe
            unfilled_test_demand_mild = desiring_tests_mild - alloc_rationed_tests_mild
            unfilled_test_demand_asymptomatic = desiring_tests_asymptomatic - alloc_rationed_tests_asymptomatic
            unfilled_test_demand = unfilled_test_demand_severe + unfilled_test_demand_mild + unfilled_test_demand_asymptomatic
            
            alloc_on_demand_tests_severe = 0 if unfilled_test_demand == 0 else iround(on_demand_tests * unfilled_test_demand_severe / unfilled_test_demand)
            alloc_on_demand_tests_mild = 0 if unfilled_test_demand == 0 else iround(on_demand_tests * unfilled_test_demand_mild / unfilled_test_demand)
            alloc_on_demand_tests_asymptomatic = 0 if unfilled_test_demand == 0 else iround(on_demand_tests * unfilled_test_demand_asymptomatic / unfilled_test_demand)
            
            tests_conducted_severe = alloc_rationed_tests_severe + alloc_on_demand_tests_severe
            tests_conducted_mild = alloc_rationed_tests_mild + alloc_on_demand_tests_mild
            tests_conducted_asymptomatic = alloc_rationed_tests_asymptomatic + alloc_on_demand_tests_asymptomatic
            df.at[i, 'tests_conducted_severe'] = tests_conducted_severe
            df.at[i, 'tests_conducted_mild'] = tests_conducted_mild
            df.at[i, 'tests_conducted_asymptomatic'] = tests_conducted_asymptomatic
            tests_conducted = tests_conducted_severe + tests_conducted_mild + tests_conducted_asymptomatic
            
            cumulative_tests_conducted += tests_conducted
            
            positive_tests_severe = iround(tests_conducted_severe * new_infection_severe / population_severe * (1 - false_negative)) + \
              iround(tests_conducted_severe * (1 - new_infection_severe / population_severe) * false_positive)
            positive_tests_mild = iround(tests_conducted_mild * new_infection_mild / population_mild * (1 - false_negative)) + \
              iround(tests_conducted_mild * (1 - new_infection_mild / population_mild) * false_positive)
            positive_tests_asymptomatic = iround(tests_conducted_asymptomatic * new_infection_asymptomatic / population_asymptomatic * (1 - false_negative)) + \
              iround(tests_conducted_asymptomatic * (1 - new_infection_asymptomatic / population_asymptomatic) * false_positive)
            if i+delay < self.generations:
                df.at[i+delay, 'new_detected_cases'] = positive_tests_severe + positive_tests_mild + positive_tests_asymptomatic

            cumulative_detected_cases += df.at[i, 'new_detected_cases']
            df.at[i, 'cumulative_detected_cases'] = cumulative_detected_cases

            if i > 0 and df.at[i-1, 'new_detected_cases'] > 0:
                df.at[i-1, 'observed_R'] = df.at[i, 'new_detected_cases'] / df.at[i-1, 'new_detected_cases']
                df.at[i-1, 'observed_doubling_time'] = np.inf if df.at[i, 'new_detected_cases'] == 0 else serial * np.log(2) / np.log(df.at[i, 'cumulative_detected_cases']/df.at[i-1, 'cumulative_detected_cases'])

simulation

Covidia scenario 1

Nate Silver's article features a simulation of the virtual country Covidia. The graph below is a visualization of the results. Covidia has a population of 10 million and will begin to spread with the arrival of the first infected person on January 1, 2020. We assume an initial reproduction rate of $ 2.7 $, an intermediate stage of $ 1.4 $, and a lockdown of $ 0.7 $.

fig1.png

The blue line shows the number of true new infections, and the yellow line shows the number of new infections reported by the test. In this scenario, the initial test capacity is small at 1,000 per generation, but by increasing the number of tests indefinitely, it seems that the number of reports is increasing following the true number of infected people, although there is a time lag. ..

However, if you look closely, you can see that in the phase where the number of infected people is increasing, it is underestimated by 20 times or more, and when it tends to decrease, it is overestimated.

Japanese scenario?

Japan has been criticized for keeping the number of PCR tests low. The Ministry of Health, Labor and Welfare reports daily [^ 5], but it is said that the number of inspections is still not keeping up with demand. Therefore, I tried what would happen if the number of PCR tests was set low for the Japanese population of 126 million. By the way, I also plotted the number of positive confirmed persons that have been actually reported so far. Whereas the graph above shows the number of newly infected people, this is the cumulative number of infected people [^ 6].

[^ 5]: For example, in Latest Report, 7826 new PCR tests were performed from the previous day's report. You can see that. Unlike the previous figure, it is a plot of the cumulative number of infected people. [^ 6]: Actual data is given cumulatively, and I thought that was more familiar to me.

![fig2.png] (https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/223481/193b7bce-f37f-4b9e-727b-223c95b838f5.png)

Here, the green line is the actual number of infected people in Japan [^ 6].

[^ 6]: I am using the data published by John Hopkins University on github.

I do not intend to predict anything with this simulation, but first I would like you to see the magnitude of the difference between the true value (blue line) and the simulation result and the reported value (green line). The following can be observed.

But what if we had increased our testing capacity early on and had unlimited testing? The following graph shows the results when the initial value of the inspection capacity is set to 10,000 and it is gradually but unlimitedly increased.

fig3.png

Even in this case, it is still not possible to capture the true number of infected people, but we were able to observe about 10 times the current number of reporters. In addition, although the rise in the number of reporters (yellow) is a little slow, the slope is close to the slope of the true number of infected people, so the difference between the two can be regarded as the difference in the left-right direction (time delay), which is correct. It seems that the production rate of $ R $ can be observed.

Summary

I think it is true that there is no difference in terms of treatment because there are no hospitals or quarantine facilities where the infected person is known. But I think that people's daily behavior may change a little if they find out that the true number of infected people is 10 times higher than reported, and in some cases 100 times higher.

I wonder if the infectious disease experts on TV are making a much more detailed version of this model to predict the future.

I don't know.

Recommended Posts

Considering the situation in Japan by statistician Nate Silver, "The number of people infected with coronavirus is meaningless"
I tried to predict the number of people infected with coronavirus in Japan by the method of the latest paper in China
Let's visualize the number of people infected with coronavirus with matplotlib
Predict the number of people infected with COVID-19 with Prophet
I tried to predict the number of people infected with coronavirus in consideration of the effect of refraining from going out
Create a BOT that displays the number of infected people in the new corona
A server that returns the number of people in front of the camera with bottle.py and OpenCV
[Homology] Count the number of holes in data with Python
I tried to predict the number of domestically infected people of the new corona with a mathematical model
Create a bot that posts the number of people positive for the new coronavirus in Tokyo to Slack
Generate a list packed with the number of days in the current month.
Display the status of COVID 19 infection in Japan with Splunk (GitHub version)
Let's calculate the transition of the basic reproduction number of the new coronavirus by prefecture