[PYTHON] Efficient PCR test by pool method

Introduction

This article is an article of a certain Asahi Shimbun Digital "Can you increase the number of many samples" collectively "PCR tests? I thought that the PCR test method described in " was interesting, and I posted the results and considerations of various experiments on my own. ** I have no expertise in the PCR test itself. Think of this article as a thought experiment in an extremely idealized situation. **

What is a PCR test?

What is a PCR test according to Wikipedia?

A series of reactions or techniques that amplify a specific region from a DNA sample millions to billions of times.

A small amount of DNA by exponentially amplifying a copy of an arbitrary gene region or genomic region through a series of temperature change cycles using the action of an enzyme called DNA polymerase. The purpose is to amplify the sample to an amount sufficient to study its details.

It is a technique like. I can't say anything about the specialized content because I'm an outsider, but it seems that one test takes a very long time, and since this is done for each patient, it seems that it costs a lot of money.

Pool method PCR test

According to the Asahi Shimbun Digital article mentioned above

Theoretically, if even one of the 500 samples contains the virus, the nucleic acid of the virus should be amplified and positive.

As you can see, instead of performing PCR tests individually, it seems that we are trying to improve efficiency by first hitting positive samples by testing several samples at once. What a dynamic ** Is that really okay? It feels like **, but the following can be inferred from the description in this article. --Mixing samples is easier than performing PCR tests --If the sample contains at least one virus, it can be detected by PCR after mixing it with other samples. ――Even if you mix samples and perform PCR test, the time required for one test does not change much.

If the positive rate is low enough, the PCR test can be made more efficient by testing the mixture once and then retesting one by one if it is positive, rather than testing one by one. I think. It seems that such a method is called ** pool method ** and so on.

Problem definition

Let's simulate how efficient it can actually be. ** I think there are various problems in reality **, but in the following, I will forget about it and define the problem in an extremely idealized situation. Suppose * N * samples are obtained. Suppose you want to perform the following operations to determine whether all samples are positive or not ** with as few tests as possible **.

-* N * samples are divided into * M * samples and mixed (pooling), which can be done in a shorter time than the PCR test. --PCR test of pooled sample, but it is assumed that the time required for one test is constant no matter how much the sample is pooled. --If the test result of the pooled sample is negative, the test is completed. --If positive, retest all * M * samples used for pooling one by one.

Obviously, if all the samples are PCR tested one by one, it will be completed in * N * times. The following simulates how efficient pooling can be.

simulation

The sample size * N = 100,000 * and the number of samples * 1000 * were set, and various pooling sizes * M * and the positive rate of each sample were set for simulation. Let's see the result while showing the Python code used. Import the required libraries

import numpy as np
import matplotlib.pyplot as plt
import japanize_matplotlib
from numpy.random import rand

Set various parameters. (Here, the case of positive rate 0.01 is shown)

sample_number = 10**5 #sample size
positive_rate = 0.01 #Positive rate
sample_size = 1000 #The number of samples
#Pooling size candidates
pool_sizes = [2,3,4,5,8,10,15,20,25,50,75,100,200,500,1000,5000]

Here we define a function that returns the total number of checks for the sample and pooling size.

#A function that returns the total number of checks
def pcr(pool_size,sample):
#pool_size:Number of samples to pool in one inspection
#sample:sample
    count=0
    count += (len(sample)+pool_size-1)//pool_size#Number of pooling inspections
    count += len(np.unique(sample[0:-1:pool_size]))*pool_size#Retest for positive pooling
    return count

Now that we are ready, we will actually generate data with random numbers and experiment.

#Data generation
samples = [np.insert((rand(sample_number)<positive_rate).cumsum(),0,0) for _ in range(sample_size)] 
#Cumsum to improve the efficiency of the subsequent processing(Cumulative sum)Is taking
#Perform simulation and save results
rslt = dict()
for pool_size in pool_sizes:
    rslt[pool_size] = [pcr(pool_size,sample) for sample in samples]
rslt_binary[positive_rate] = [binary_pcr(sample) for sample in samples] 

Finally, a box plot visualizes the results for each pooling size.

#Plot box plot
plt.rcParams["font.size"] = 18
points = tuple(rslt.values())
fig, ax = plt.subplots()
bp = ax.boxplot(points)
ax.set_xticklabels(tuple(rslt.keys()))
plt.title("Positive rate"+str(int(100*positive_rate))+"%in the case of")
plt.xlabel("Pooling size")
plt.xticks(rotation=45)
plt.ylabel("Total number of inspections")
plt.axhline(sample_number,color='black',ls='--')

simulation result

The results when the positive rate is 1%, 5%, 10%, 30% are as follows. スクリーンショット 2020-07-15 0.37.25.png You can see that the efficiency of each graph increases when the pooling size is increased at the beginning, but when it is increased too much, the efficiency decreases. As you can see, the pooling size cannot be too large or too small, and there seems to be an optimum value depending on the positive rate. In addition, the smaller the positive rate of each sample, the more efficient it can be expected, and when the positive rate is 1%, it can be seen that a maximum of about 20000 tests are sufficient for 100,000 samples (about 5 times). Efficiency). On the other hand, when the positive rate of each sample is high, it was confirmed that pooling does not improve efficiency, and that it is less efficient than testing one by one.

More efficient pooling sizing algorithm

Is there a way to perform the inspection more efficiently? In the previous method, the pooling size was fixed regardless of the result of the inspection in the middle, but it seems that it is possible to inspect more efficiently by dynamically changing the pooling size according to the inspection result (for example, first). Takes a large pooling size and then decreases the pooling size). Here, the efficiency is further improved by changing the pooling size in a binary search. I will describe the specific algorithm.

  1. Divide the sample into exactly half of the first half and the second half (if there are odd numbers, put the fraction in the second half)
  2. Pool and inspect the first half and the second half respectively
  3. About each of the first half and the second half
  4. If negative, test ends
  5. If positive, divide into the first half and the second half and repeat from 2

Let's call this algorithm ** bisection pooling test **. In the two-minute pooling PCR test, it is not necessary to fix the pooling size first, and the test can be carried out while dynamically changing the number of pooling by pooling the sample in half according to the test result each time. I can do it. Intuitively, this seems to be more efficient than fixing the pooling size. Let's implement it. The dichotomous pooling test can be written relatively neatly using a recursive function.

def binary_pcr(sample):
   N = len(sample)
   if(N<=1): #Inspect properly if there is less than one sample
       return N
   if(sample[0] == sample[-1]): #If the pooled one is negative, no further testing
       return 1
   #Otherwise, mix again in the left and right halves and inspect
   return binary_pcr(spcm[:N//2]) + binary_pcr(spcm[N//2:])

Results of bi-minute pooling test

The simulation result of the bisection pooling test is as follows. bi.png

You can see that the inspection is more efficient than the previous method where the pooling size was fixed. (Especially when the positive rate is 1%, the total number of tests is less than 6000 for 100,000 samples, so efficiency can be expected to increase by 16 times.)

Summary

Simulations have confirmed that the number of costly PCR tests can be reduced by performing PCR tests on the samples together in an ideal problem setting. However, in the method of fixing the pooling size first, it is necessary to select an appropriate value according to the positive rate. On the other hand, the dichotomous pool test, which dynamically determines the number of pooled tests for the next test according to the result of the pooled test, does not need to consider such a thing and has high performance. In both methods, the lower the positive rate, the greater the degree of efficiency.

Notes

  1. ** First of all, it seems that PCR testing by the pool method is not performed in Japan as a major premise **
  2. In the first place, false positives and false negatives appear in the PCR test with a certain probability, so even if you use the pool method, you cannot perform a perfect test.
  3. In reality, it seems that pooled samples may slip through the test.
  4. The assumption that samples can be pooled endlessly is also quite suspicious in reality.

In reality, there may be a reason that only specialists and people in the field can understand why we do not carry out pool-type PCR tests.

Recommended Posts

Efficient PCR test by pool method
Primality test by Python
Summary of test method
Classify data by k-means method