[Python] Pre-processing when graphing open data regarding COVID-19 infection status

Introduction

We recommend that you read the following articles before reading this article. COVID-19 infection status graph often seen on Twitter recently

The most important thing in the above article is, ** The images generated by the program I made do not accurately visualize the infection status of COVID-19 ** is.

The purpose of this article is not to visualize the exact infection status, It is to describe the details of data preprocessing. We hope that you will read this with this in mind.

If you want to see the source code, click the link below https://github.com/takumi13/COVID19_DataVisualizer

table of contents

  1. What I did

  2. Operating environment

  3. Program execution result

  4. Pretreatment details 4.1. CSV file structure 4.2. CSV file preprocessing 4.3. Processing of moving average 4.4 Graphing

  5. Summary

1. What I did

GitHub published by Kazuki Ogino of Toyo Keizai Online Editorial Department (https://github.com/kaz-ogiwara/covid19/) I received a data source on the number of people infected with COVID-19 and the number of PCR tests. Graph the number of newly infected persons (persons), the number of new PCR tests (persons), and the positive rate (%) in Tokyo. I created a program to save as image data. Please clone the latest version of the data source for graph generation from the above URL as appropriate.

git clone https://github.com/kaz-ogiwara/covid19.git

In this article, we will describe in detail an example of CSV data processing method and moving average processing.

2. Operating environment

Windows10 python 3.7.5

3. Program execution result

The program can be executed by the following command. It works if the main statement is executed obediently.

git clone https://github.com/takumi13/COVID19_DataVisualizer.git
cd COVID19_DataVisualizer
python main.py

The program will generate the image shown below. As a reminder, this graph, especially the positive rate before May (red line), is It is far from the actual situation (actually it is about 20 to 30%). tokyo_pcr_all.png In addition to the above, when the program is executed, multiple images with graphs drawn in the img folder will be generated. The generated image is as follows.

4. Pretreatment details

4.1. CSV file structure

prefectures.csv


"year","month","date","prefectureNameJ","prefectureNameE","testedPositive","peopleTested","discharged","deaths"
2020,3,11,"Hokkaido","Hokkaido",118,1069,"",""
2020,3,11,"Aomori Prefecture","Aomori",0,58,"",""
2020,3,11,"Iwate Prefecture","Iwate",0,20,"",""
...
2020,7,14,"Miyazaki prefecture","Miyazaki",20,1716,17,0
2020,7,14,"Kagoshima prefecture","Kagoshima",155,7150,30,0
2020,7,14,"Okinawa Prefecture","Okinawa",148,3326,141,7

From left to right Year, month, day, "prefecture", "prefecture (English notation)", cumulative number of infected people, cumulative number of PCR tests, cumulative number of severely ill, cumulative number of deaths

4.2. CSV file preprocessing

Preprocessing of CSV file is performed via the following data structure.

  1. List [year, month, day, "prefecture", "prefecture (English notation)", cumulative number of positives, cumulative number of PCR tests, cumulative number of severely ill, cumulative number of deaths]
  2. List ['Monday', Cumulative number of positives in a certain prefecture, Cumulative number of tests in a certain prefecture]
  3. Dictionary {'month and day': [cumulative number of positives, cumulative number of PCR tests]}
  4. List ['month and day', cumulative number of positives, cumulative number of PCR tests]
  5. Dictionary {'Monday': Number of new positives, number of new PCR tests}

Below, the code is divided into stages, but think of it as actually connected from start to finish.


  1. List [year, month, day, "prefecture", "prefecture (English notation)", cumulative number of infected people, cumulative number of PCR tests, cumulative number of severely ill, cumulative number of deaths]
  2. List ['Monday', Cumulative number of positives in a certain prefecture, Cumulative number of tests in a certain prefecture]

with open('prefectures.csv', encoding='utf-8') as f:  #Read csv file
    lines = f.readlines()
lines = lines[1:]   #Delete the line title of the first line
data_japan = []     #Data for all prefectures
data_tokyo = []     #Data for Tokyo only
for line in lines:
    line = line.split(',')
    #    1.[Year,Month,Day,Prefectures,Cumulative number of infected people,Cumulative number of PCR tests]
    line = [line[0], line[1], line[2], line[3],     line[5],      line[6]]
    if line[5] == '""':               #If the number of PCR tests is undefined
        line[5] = '0'                 # '0'Substitute(Normalization process)
    line[1] = line[1].rjust(2, '0')   #Moon'01'~'12'Normalized to
    line[2] = line[2].rjust(2, '0')   #The day'01'~'31'Normalized to
    if 'Tokyo' in line[3]:
        data_tokyo.append([line[1]+line[2], int(line[4]), int(line[5])])
    #               2.[time,Cumulative number of positives in a prefecture,Cumulative number of inspections in a certain prefecture]
    data_japan.append([line[1]+line[2], int(line[4]), int(line[5])])

  1. Dictionary {'month and day': [cumulative number of positives, cumulative number of PCR tests]}
# d[0]To'time'Because it contains,Add it as a dictionary key
#here,Add up all the number of positives and PCR tests in 47 prefectures on the applicable date

dic_japan = {}                      #Japan 3.{'time' : [Cumulative number of positive people,Cumulative number of PCR tests]}
dic_tokyo = {}                      #Tokyo 3.{'time' : [Cumulative number of positive people,Cumulative number of PCR tests]}

for d in data_tokyo:                #Cumulative number of positive people on each month and day,Initialization of cumulative PCR test number
    dic_japan[d[0]] = [0, 0]
    dic_tokyo[d[0]] = [0, 0]

for d in data_japan:                #Cumulative number of positive people using the corresponding date as a key,Calculate the cumulative number of PCR tests
    dic_japan[d[0]][0] += d[1]      # d[0]Add up the cumulative number of positive people in the prefecture
    dic_japan[d[0]][1] += d[2]      # d[0]Add the cumulative number of PCR tests in the prefecture

for d in data_tokyo:                #Cumulative number of positive people using the corresponding date as a key,Calculate the cumulative number of PCR tests
    dic_tokyo[d[0]][0] += d[1]      # d[0]Add up the cumulative number of positive people in the prefecture
    dic_tokyo[d[0]][1] += d[2]      # d[0]Add the cumulative number of PCR tests in the prefecture

  1. List ['month and day', cumulative number of positives, cumulative number of PCR tests]
#Change the dictionary type to list type to make it easier to handle in the subsequent processing

list_japan = []   #Japan 4.['time',Cumulative number of positive people,Cumulative number of PCR tests]
list_tokyo = []   #Tokyo 4.['time',Cumulative number of positive people,Cumulative number of PCR tests]
for day in dic_japan:
    list_japan.append([day, dic_japan[day][0], dic_japan[day][1]])
    list_tokyo.append([day, dic_tokyo[day][0], dic_tokyo[day][1]])

  1. Dictionary {'Monday': [Number of new positives, number of new PCR tests]}
dic_day_japan = {}   #Japan 5.{'time' : [Number of new positives,Number of new PCR tests]}
dic_day_tokyo = {}   #Tokyo 5.{'time' : [Number of new positives,Number of new PCR tests]}

for i, day in enumerate(dic_japan):
    if i > 0:
        dic_day_japan[day] = [list_japan[i][1] - list_japan[i-1][1], list_japan[i][2] - list_japan[i-1][2]] # {'time' :Number of new positives,Number of new PCR tests}
        dic_day_tokyo[day] = [list_tokyo[i][1] - list_tokyo[i-1][1], list_tokyo[i][2] - list_tokyo[i-1][2]] # {'time' :Number of new positives,Number of new PCR tests}

Thus, in the end, A dictionary-type variable having a list of [number of new positives, number of new PCR tests] for that month and day as an element is completed.

4.3. Processing of moving average

As for the number of PCR tests nationwide, the data for several days may be reflected at a later date, so if you graph the raw data as it is, Number of tests 0, number of positives 30 It is very inconvenient because a day like this appears. The way to improve this is to process the moving average.

japan_plus  = []    #Number of new positives nationwide
japan_all   = []    #Number of new inspections nationwide
tokyo_plus  = []    #Number of new positives in Tokyo
tokyo_all   = []    #Number of new inspections in Tokyo

for key in dic_day_tokyo:
    tokyo_plus.append(dic_day_tokyo[key][0])
    tokyo_all.append(dic_day_tokyo[key][1])

for key in dic_day_all:
    japan_plus.append(dic_day_all[key][0])
    japan_all.append(dic_day_all[key][1])

move_ave_n = 7                      #Moving average width
k = move_ave_n//2                   #Center of width of moving average

japan_plus_ave = japan_plus[k:-k]   #In a relationship that takes an average,Ignore the oldest and latest 3 days of data
tokyo_plus_ave = tokyo_plus[k:-k]   #Same as above

japan_all_ave = []                  #Japan:Number of new PCR tests considering moving average
tokyo_all_ave = []                  #Tokyo:Number of new PCR tests considering moving average
for i in range(k, len(tokyo_all)-k):
    japan_all_ave.append(sum(japan_all[i-k : i+k+1]) / move_ave_n)
    tokyo_all_ave.append(sum(tokyo_all[i-k : i+k+1]) / move_ave_n)

japan_ratio_ave = []                #Japan:Positive rate calculated based on the number of new PCR tests considering the moving average
tokyo_ratio_ave = []                #Tokyo:Positive rate calculated based on the number of new PCR tests considering the moving average
for i in range(len(japan_all_ave)):
    japan_ratio_ave.append(round(100 * japan_plus_ave[i] / japan_all_ave[i], 1))
    tokyo_ratio_ave.append(round(100 * tokyo_plus_ave[i] / tokyo_all_ave[i], 1))

This part is especially important

japan_all_ave = []                  #Japan:Number of new PCR tests considering moving average
tokyo_all_ave = []                  #Tokyo:Number of new PCR tests considering moving average
for i in range(k, len(tokyo_all)-k):
    japan_all_ave.append(sum(japan_all[i-k : i+k+1]) / move_ave_n)
    tokyo_all_ave.append(sum(tokyo_all[i-k : i+k+1]) / move_ave_n)

By sum (japan_all [i-k: i + k + 1]), the sum of the three days before and after the day ʻiwas calculated. By dividing it bymove_ave_n`, we calculate the moving average for 7 days.

4.4 Graphing

Follow the above steps to finally graph the normalized data. The code around matplotlib is based on the link below, so please take a look there.

Make a 2-axis graph with Python / matplotlib [Matplotlib] Display multiple column charts side by side

5. Summary

We have described a series of processes from reading a CSV file to preprocessing for displaying a graph. 4.4 I will describe the details of the graphing part in a separate article when I have time.

Thank you for watching until the end.

Recommended Posts

[Python] Pre-processing when graphing open data regarding COVID-19 infection status
Python Pandas Data Preprocessing Personal Notes
Preprocessing template for data analysis (Python)
The story of verifying the open data of COVID-19
Python: Preprocessing in machine learning: Data acquisition
Python: Preprocessing in machine learning: Data conversion