[PYTHON] Considering the situation in Japan by statistician Nate Silver, "The number of people infected with coronavirus is meaningless"

Introduction

American statistician Nate Silver famous for the election prediction of the US presidential election % E3% 83% BB% E3% 82% B7% E3% 83% AB% E3% 83% 90% E3% 83% BC) on a site called FiveThirtyEight , "Coronavirus Case Counts Are Meaningless" I will.

The claim in this article is ** "Unless you know enough about how the test is done, the reported number of people infected with COVID-19 is not a useful indicator [^ 1]." * It's called *. While I intuitively think "that's right", virus infection is an exponential phenomenon, so even if the number of tests is narrowed down (possibly ** arbitrarily **), the reproduction ratio I also see and hear claims that (explained later) should be observable. From a completely different point of view, there are claims that medical care will collapse if the tests are not narrowed down. Nate's article includes a simple simulation that readers can try out as an Excel Spreadsheet (https://fivethirtyeight.com/wp-content/uploads/2020/04/covidia_1.05_updated.xlsx). So that you can experience how the number of infected people is reported and the true number of infected people is different depending on the test policy (selection of test targets and total number of tests) and infection control (social distancing and lockdown). It has become. You can also observe the difference between the true reproduction rate and the observed apparent reproduction rate.

The spreadsheet model [^ 2] is convenient to easily play with the parameters and see the difference in the results, but each cell in the spreadsheet says $ \ text {BH32} $ "column" It is represented by a combination of "= alphabet" and "line = number", and their relationship is written in a BASIC-like formula, so it is not very suitable for understanding what is calculated and how. ..

[^ 2]: Nate Silver didn't read the spreadsheet as a model. It's not meant to predict anything.

So I converted the spreadsheet model into a Python program (manually) to understand what I was doing. In addition, I gave a parameter to the current situation in Japan (April 22, 2020) and tried it. Then, although the number of reported infections has just exceeded 10,000, the actual number of cases actually exceeds ** 7.5 million , which means that about 6% of the total population is infected. I got it. With this, you should think, " If you go outside, the next person is infected **".

program

You can find the jupyter notebook on github.

If you think of the infection phenomenon as a phenomenon that occurs continuously, it becomes a differential equation and it is difficult for an amateur to handle it, so it is regarded as a very rough discrete phenomenon. In other words, consider the chain of infection as a chain from the previous generation [^ 3] to the next generation, and further simplify the calculation by fixing the time difference between generations to a certain constant (for example, 5 days).

The most important parameter is the reproduction rate ($ R $), which specifies how many $ n + 1 $ infected individuals in the $ n $ generation will pass the disease. .. This depends on the infectivity of the disease, but also on the behavior of people in society. The reproduction rate of cruise ships would have been very high, and the reproduction rate of Wuhan would have been low since it was locked down and all outings were banned. If $ R $ is greater than 1, the infection will spread, and if it is less than 1, the disease will eventually disappear.

[^ 3]: Of course, this generation is not the baby boomer generation or the millennial generation, but the first infected people are called the 0th generation, the people infected from there are called the 1st generation, and so on.

parameter settings

Parameters related to disease spread are expressed by Disease_spread_params class.

R_uncontrolled: Reproduction rate when no action is taken.
R_intermediate: Intermediate reproduction rates that people have become more careful about, such as social distance, hand-washing practices, and wearing masks.
R_lockdown: Reproduction rate when lockdown is enforced.
cluster: Parameters to adjust when the proportion of infected people in the population increases (0 = no adjust, 0.5: slightly, 1: moderately, 2: significantly)
begin_internmediate: Generation number that enters the intermediate stage after the no-measure period ends
begin_lockdown: Generation number where lockdown begins
mild: Rate of infection and moderate disease
ʻAsymptomatic`: Asymptomatic rate of infection
(The rate of infection and severe illness is 1-mild-asymptomatic)

Population parameters are represented by Population_params class.

population: Total population of this country
ʻInitial_infection`: Number of first (0th generation) infected
faux_severe: Percentage of critically ill patients with COVID-19-like causes other than coronavirus
faux_mild: Percentage of moderately ill patients with COVID-19-like causes other than coronavirus
desire_severe: Percentage of critically ill patients seeking PCR testing
desire_mild: Percentage of moderately ill patients seeking a PCR test
desire_asymptomatic: Percentage of asymptomatic people seeking a PCR test

Parameters related to PCR testing are expressed by Testing_params class.

ʻInitial_tests`: PCR test capacity provided for first generation
ramp_period: Generation number to start increasing the number of checks
test_growth_rate: Rate of increase since starting to increase the number of tests
test_max: Maximum number of tests
rationed_test_ratio: Severe> Moderate> Percentage of tests ensured to adhere to asymptomatic priorities
false_negative: Percentage of false negative PCR tests
false_positive: Percentage of false positives in PCR tests
delay: Delay from inspection to result (number of generations)

The values of these parameters are given as default values by selecting plausible ones from various papers.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import time, re
import datetime

iround = lambda x: int((x * 2 + 1) // 2)

class Disease_spread_params:
    def __init__(self,
                 R_uncontrolled = 2.7,
                 R_intermediate = 1.4,
                 R_lockdown = 0.7,
                 cluster = 0.5,
                 begin_intermediate = 11,
                 begin_lockdown = 15,
                 mild = 0.6,
                 asymptomatic = 0.3,
                 zero_date = '2020-01-01',
                 serial = 5):
        self.R_uncontrolled = R_uncontrolled
        self.R_intermediate = R_intermediate
        self.R_lockdown = R_lockdown
        self.cluster = cluster
        self.begin_intermediate = begin_intermediate
        self.begin_lockdown = begin_lockdown
        self.severe = 1 - mild - asymptomatic
        self.mild = mild
        self.asymptomatic = asymptomatic
        self.zero_date = zero_date
        self.serial = serial

class Population_params:
    def __init__(self,
                 population = 10_000_000,
                 initial_infection = 1,
                 faux_severe = 0.001,
                 faux_mild = 0.025,
                 desire_severe = 1.0,
                 desire_mild = 0.5,
                 desire_asymptomatic = 0.02):
        self.population = population
        self.initial_infection = initial_infection
        self.faux_severe = faux_severe
        self.faux_mild = faux_mild
        self.desire_severe = desire_severe
        self.desire_mild = desire_mild
        self.desire_asymptomatic = desire_asymptomatic

class Testing_params:
    def __init__(self,
                 initial_tests = 1_000,
                 ramp_period = 3,
                 test_growth_rate = 0.5,
                 test_max = 10_000_000,
                 rationed_test_ratio = 0.75,
                 false_negative = 0.20,
                 false_positive = 0.002,
                 delay = 2):
        self.initial_tests = initial_tests
        self.ramp_period = ramp_period
        self.test_growth_rate = test_growth_rate
        self.test_max = test_max
        self.rationed_test_ratio = rationed_test_ratio
        self.false_negative = false_negative
        self.false_positive = false_positive
        self.delay = delay

Here, as a function to round a real number to an integer, I am using my own function called ʻiround ()instead of the Python built-inround (). This is for compatibility with Excel's ROUND` function [^ 4].

[^ 4]: Python's round () is rounded to an even number, while EXCEL's round () is rounded.

Simulator

While looking at the cell-by-cell dependencies of the spreadsheet I referred to, I understood how it was calculated and reimplemented it. To be honest, it's dirty (not humble).

The Simulator class takes a parameter, instantiates it, calls therun ()method, and returns the result as pandas.DataFrame.

class Simulator:
    def __init__(self,
                 disease_spread_params,
                 population_params,
                 testing_params):
        self.disease_spread_params = disease_spread_params
        self.population_params = population_params
        self.testing_params = testing_params
        
        self.columns = [ 'date', 'actual.R', 'doubling time in days' ]
        self.date_regex = re.compile('(\d+)-(\d+)-(\d+)')

    def decode_date(self, date):
            match = self.date_regex.fullmatch(date)
            if match:
                y, m, d = match.group(1, 2, 3)
                timestamp = time.mktime((int(y), int(m), int(d), 0, 0, 0, -1, -1, -1))
                return timestamp
            return None
    
    def encode_date(self, timestamp):
        t = time.localtime(timestamp)
        return '{0:04d}-{1:02d}-{2:02d}'.format(t[0], t[1], t[2])

    def run(self, zero_day = '2020-01-01', generations = 40):
        self.generations = generations
        self.df = pd.DataFrame(index = range(0, self.generations))
        t0 = self.decode_date(zero_day)
        self.df['date'] = [ self.encode_date(t0 + d * self.disease_spread_params.serial * 24 * 60 * 60) for d in range(0, self.generations) ]

        self.set_target_R()
        self.compute_actual_infection()
        self.compute_tests()
        
        return self.df

    def set_target_R(self):
        begin_lockdown = self.disease_spread_params.begin_lockdown
        begin_intermediate = self.disease_spread_params.begin_intermediate
        self.df['target_R'] = np.NaN
        
        for i in range(0, self.generations):
            if begin_lockdown != None and i >= begin_lockdown:
                self.df.at[i, 'target_R'] = self.disease_spread_params.R_lockdown
            elif  begin_intermediate != None and i >= begin_intermediate:
                self.df.at[i, 'target_R'] = self.disease_spread_params.R_intermediate
            else:
                self.df.at[i, 'target_R'] = self.disease_spread_params.R_uncontrolled

    def compute_actual_infection(self):
        population = self.population_params.population
        initial_infection = self.population_params.initial_infection
        cluster = self.disease_spread_params.cluster
        serial = self.disease_spread_params.serial

        df = self.df
        df['susceptible'] = np.NaN
        df.at[0, 'susceptible'] = population - initial_infection
        df['new_infection'] = np.NaN
        df.at[0, 'new_infection'] = initial_infection
        df['cumulative_infection'] = np.NaN
        df.at[0, 'cumulative_infection'] = initial_infection
        df['actual_R'] = np.NaN
        df['actual_doubling_time'] = np.NaN
        
        for i in range(1, self.generations):
            df.at[i, 'new_infection'] = iround(df.at[i-1, 'susceptible']*(1-(1-((df.at[i-1, 'target_R']*(df.at[i-1, 'susceptible']/population)**cluster)/population))**df.at[i-1, 'new_infection']))
            df.at[i, 'cumulative_infection'] = df.at[i-1, 'cumulative_infection'] + df.at[i, 'new_infection']
            df.at[i, 'susceptible'] = population - df.at[i, 'cumulative_infection']
            df.at[i-1, 'actual_R'] = df.at[i, 'new_infection'] / df.at[i-1, 'new_infection']
            if df.at[i-1, 'cumulative_infection'] != 0:
                df.at[i-1, 'actual_doubling_time'] = np.inf if df.at[i, 'new_infection'] == 0 else serial * np.log(2) / np.log(df.at[i, 'cumulative_infection']/df.at[i-1, 'cumulative_infection'])

    def compute_tests(self):
        population = self.population_params.population
        ramp_period = self.testing_params.ramp_period
        tests_max = self.testing_params.test_max
        test_growth_rate = self.testing_params.test_growth_rate
        rationed_test_ratio = self.testing_params.rationed_test_ratio
        mild = self.disease_spread_params.mild
        asymptomatic = self.disease_spread_params.asymptomatic
        faux_severe = self.population_params.faux_severe
        faux_mild = self.population_params.faux_mild
        desire_severe = self.population_params.desire_severe
        desire_mild = self.population_params.desire_mild
        desire_asymptomatic = self.population_params.desire_asymptomatic
        false_negative = self.testing_params.false_negative
        false_positive = self.testing_params.false_positive
        delay = self.testing_params.delay
        serial = self.disease_spread_params.serial        

        cumulative_tests_conducted = 0
        cumulative_detected_cases = 0
        
        df = self.df
        df['tests_available'] = 0
        df['new_detected_cases'] = 0
        df['cumulative_detected_cases'] = 0
        
        for i in range(0, self.generations):
            if i  == 0:
                df.at[i, 'tests_available'] = 0
            elif i == 1:
                df.at[i, 'tests_available'] = self.testing_params.initial_tests
            elif i < ramp_period:
                df.at[i, 'tests_available'] = df.at[i-1, 'tests_available']
            else:
                df.at[i, 'tests_available'] = iround(min(tests_max, df.at[i-1, 'tests_available'] * (1 + test_growth_rate)))
            tests_available = df.at[i, 'tests_available']
            rationed_tests =  iround(tests_available * rationed_test_ratio)
            on_demand_tests = tests_available - rationed_tests

            new_infection_severe = iround(df.at[i, 'new_infection'] * (1 - mild - asymptomatic))
            new_infection_mild = iround(df.at[i, 'new_infection'] * mild)
            new_infection_asymptomatic = df.at[i, 'new_infection'] - new_infection_severe - new_infection_mild
            
            population_severe = iround((population - df.at[i, 'new_infection']) * faux_severe) + new_infection_severe
            population_mild = iround((population - df.at[i, 'new_infection']) * faux_mild) + new_infection_mild
            population_asymptomatic = population - population_severe - population_mild
            
            desiring_tests_severe = iround(population_severe * desire_severe * (1 - cumulative_tests_conducted/population))
            desiring_tests_mild = iround(population_mild * desire_mild * (1 - cumulative_tests_conducted/population))
            desiring_tests_asymptomatic = iround(population_asymptomatic * desire_asymptomatic * (1 - cumulative_tests_conducted/population))
            
            alloc_rationed_tests_severe = min(rationed_tests, desiring_tests_severe)
            alloc_rationed_tests_mild = min(desiring_tests_mild, rationed_tests-alloc_rationed_tests_severe)
            alloc_rationed_tests_asymptomatic = min(desiring_tests_asymptomatic, rationed_tests-alloc_rationed_tests_severe-alloc_rationed_tests_mild)

            unfilled_test_demand_severe = desiring_tests_severe - alloc_rationed_tests_severe
            unfilled_test_demand_mild = desiring_tests_mild - alloc_rationed_tests_mild
            unfilled_test_demand_asymptomatic = desiring_tests_asymptomatic - alloc_rationed_tests_asymptomatic
            unfilled_test_demand = unfilled_test_demand_severe + unfilled_test_demand_mild + unfilled_test_demand_asymptomatic
            
            alloc_on_demand_tests_severe = 0 if unfilled_test_demand == 0 else iround(on_demand_tests * unfilled_test_demand_severe / unfilled_test_demand)
            alloc_on_demand_tests_mild = 0 if unfilled_test_demand == 0 else iround(on_demand_tests * unfilled_test_demand_mild / unfilled_test_demand)
            alloc_on_demand_tests_asymptomatic = 0 if unfilled_test_demand == 0 else iround(on_demand_tests * unfilled_test_demand_asymptomatic / unfilled_test_demand)
            
            tests_conducted_severe = alloc_rationed_tests_severe + alloc_on_demand_tests_severe
            tests_conducted_mild = alloc_rationed_tests_mild + alloc_on_demand_tests_mild
            tests_conducted_asymptomatic = alloc_rationed_tests_asymptomatic + alloc_on_demand_tests_asymptomatic
            df.at[i, 'tests_conducted_severe'] = tests_conducted_severe
            df.at[i, 'tests_conducted_mild'] = tests_conducted_mild
            df.at[i, 'tests_conducted_asymptomatic'] = tests_conducted_asymptomatic
            tests_conducted = tests_conducted_severe + tests_conducted_mild + tests_conducted_asymptomatic
            
            cumulative_tests_conducted += tests_conducted
            
            positive_tests_severe = iround(tests_conducted_severe * new_infection_severe / population_severe * (1 - false_negative)) + \
              iround(tests_conducted_severe * (1 - new_infection_severe / population_severe) * false_positive)
            positive_tests_mild = iround(tests_conducted_mild * new_infection_mild / population_mild * (1 - false_negative)) + \
              iround(tests_conducted_mild * (1 - new_infection_mild / population_mild) * false_positive)
            positive_tests_asymptomatic = iround(tests_conducted_asymptomatic * new_infection_asymptomatic / population_asymptomatic * (1 - false_negative)) + \
              iround(tests_conducted_asymptomatic * (1 - new_infection_asymptomatic / population_asymptomatic) * false_positive)
            if i+delay < self.generations:
                df.at[i+delay, 'new_detected_cases'] = positive_tests_severe + positive_tests_mild + positive_tests_asymptomatic

            cumulative_detected_cases += df.at[i, 'new_detected_cases']
            df.at[i, 'cumulative_detected_cases'] = cumulative_detected_cases

            if i > 0 and df.at[i-1, 'new_detected_cases'] > 0:
                df.at[i-1, 'observed_R'] = df.at[i, 'new_detected_cases'] / df.at[i-1, 'new_detected_cases']
                df.at[i-1, 'observed_doubling_time'] = np.inf if df.at[i, 'new_detected_cases'] == 0 else serial * np.log(2) / np.log(df.at[i, 'cumulative_detected_cases']/df.at[i-1, 'cumulative_detected_cases'])

simulation

Covidia scenario 1

Nate Silver's article features a simulation of the virtual country Covidia. The graph below is a visualization of the results. Covidia has a population of 10 million and will begin to spread with the arrival of the first infected person on January 1, 2020. We assume an initial reproduction rate of $ 2.7 $, an intermediate stage of $ 1.4 $, and a lockdown of $ 0.7 $.

The blue line shows the number of true new infections, and the yellow line shows the number of new infections reported by the test. In this scenario, the initial test capacity is small at 1,000 per generation, but by increasing the number of tests indefinitely, it seems that the number of reports is increasing following the true number of infected people, although there is a time lag. ..

However, if you look closely, you can see that in the phase where the number of infected people is increasing, it is underestimated by 20 times or more, and when it tends to decrease, it is overestimated.

Japanese scenario?

Japan has been criticized for keeping the number of PCR tests low. The Ministry of Health, Labor and Welfare reports daily [^ 5], but it is said that the number of inspections is still not keeping up with demand. Therefore, I tried what would happen if the number of PCR tests was set low for the Japanese population of 126 million. By the way, I also plotted the number of positive confirmed persons that have been actually reported so far. Whereas the graph above shows the number of newly infected people, this is the cumulative number of infected people [^ 6].

[^ 5]: For example, in Latest Report, 7826 new PCR tests were performed from the previous day's report. You can see that. Unlike the previous figure, it is a plot of the cumulative number of infected people. [^ 6]: Actual data is given cumulatively, and I thought that was more familiar to me.

![fig2.png] (https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/223481/193b7bce-f37f-4b9e-727b-223c95b838f5.png)

Here, the green line is the actual number of infected people in Japan [^ 6].

[^ 6]: I am using the data published by John Hopkins University on github.

I do not intend to predict anything with this simulation, but first I would like you to see the magnitude of the difference between the true value (blue line) and the simulation result and the reported value (green line). The following can be observed.

If you read the values around April 20, 2020 in the vertical direction, the true number of infected people calculated by the model exceeds 7 million people, compared to the reported number of infected people (about 10,000 people). I will. There is a difference of 700 times.
Both the number of reports (yellow, green) and the number of true infections are increasing exponentially because they are increasing linearly on the log scale graph, but the slopes are different. Having a different slope means that the reproduction rate $ R $ looks different. In other words, the claim that "even if the number of PCR tests is narrowed down, the number of reproductions can be observed" is wrong.

But what if we had increased our testing capacity early on and had unlimited testing? The following graph shows the results when the initial value of the inspection capacity is set to 10,000 and it is gradually but unlimitedly increased.

Even in this case, it is still not possible to capture the true number of infected people, but we were able to observe about 10 times the current number of reporters. In addition, although the rise in the number of reporters (yellow) is a little slow, the slope is close to the slope of the true number of infected people, so the difference between the two can be regarded as the difference in the left-right direction (time delay), which is correct. It seems that the production rate of $ R $ can be observed.

Summary

I think it is true that there is no difference in terms of treatment because there are no hospitals or quarantine facilities where the infected person is known. But I think that people's daily behavior may change a little if they find out that the true number of infected people is 10 times higher than reported, and in some cases 100 times higher.

I wonder if the infectious disease experts on TV are making a much more detailed version of this model to predict the future.

I don't know.