We recommend that you read the following articles before reading this article. COVID-19 infection status graph often seen on Twitter recently
The most important thing in the above article is, ** The images generated by the program I made do not accurately visualize the infection status of COVID-19 ** is.
The purpose of this article is not to visualize the exact infection status, It is to describe the details of data preprocessing. We hope that you will read this with this in mind.
If you want to see the source code, click the link below https://github.com/takumi13/COVID19_DataVisualizer
What I did
Operating environment
Program execution result
Pretreatment details 4.1. CSV file structure 4.2. CSV file preprocessing 4.3. Processing of moving average 4.4 Graphing
Summary
GitHub published by Kazuki Ogino of Toyo Keizai Online Editorial Department (https://github.com/kaz-ogiwara/covid19/) I received a data source on the number of people infected with COVID-19 and the number of PCR tests. Graph the number of newly infected persons (persons), the number of new PCR tests (persons), and the positive rate (%) in Tokyo. I created a program to save as image data. Please clone the latest version of the data source for graph generation from the above URL as appropriate.
git clone https://github.com/kaz-ogiwara/covid19.git
In this article, we will describe in detail an example of CSV data processing method and moving average processing.
Windows10 python 3.7.5
The program can be executed by the following command. It works if the main statement is executed obediently.
git clone https://github.com/takumi13/COVID19_DataVisualizer.git
cd COVID19_DataVisualizer
python main.py
The program will generate the image shown below. As a reminder, this graph, especially the positive rate before May (red line), is It is far from the actual situation (actually it is about 20 to 30%). In addition to the above, when the program is executed, multiple images with graphs drawn in the img folder will be generated. The generated image is as follows.
prefectures.csv
"year","month","date","prefectureNameJ","prefectureNameE","testedPositive","peopleTested","discharged","deaths"
2020,3,11,"Hokkaido","Hokkaido",118,1069,"",""
2020,3,11,"Aomori Prefecture","Aomori",0,58,"",""
2020,3,11,"Iwate Prefecture","Iwate",0,20,"",""
...
2020,7,14,"Miyazaki prefecture","Miyazaki",20,1716,17,0
2020,7,14,"Kagoshima prefecture","Kagoshima",155,7150,30,0
2020,7,14,"Okinawa Prefecture","Okinawa",148,3326,141,7
From left to right Year, month, day, "prefecture", "prefecture (English notation)", cumulative number of infected people, cumulative number of PCR tests, cumulative number of severely ill, cumulative number of deaths
Preprocessing of CSV file is performed via the following data structure.
Below, the code is divided into stages, but think of it as actually connected from start to finish.
with open('prefectures.csv', encoding='utf-8') as f: #Read csv file
lines = f.readlines()
lines = lines[1:] #Delete the line title of the first line
data_japan = [] #Data for all prefectures
data_tokyo = [] #Data for Tokyo only
for line in lines:
line = line.split(',')
# 1.[Year,Month,Day,Prefectures,Cumulative number of infected people,Cumulative number of PCR tests]
line = [line[0], line[1], line[2], line[3], line[5], line[6]]
if line[5] == '""': #If the number of PCR tests is undefined
line[5] = '0' # '0'Substitute(Normalization process)
line[1] = line[1].rjust(2, '0') #Moon'01'~'12'Normalized to
line[2] = line[2].rjust(2, '0') #The day'01'~'31'Normalized to
if 'Tokyo' in line[3]:
data_tokyo.append([line[1]+line[2], int(line[4]), int(line[5])])
# 2.[time,Cumulative number of positives in a prefecture,Cumulative number of inspections in a certain prefecture]
data_japan.append([line[1]+line[2], int(line[4]), int(line[5])])
# d[0]To'time'Because it contains,Add it as a dictionary key
#here,Add up all the number of positives and PCR tests in 47 prefectures on the applicable date
dic_japan = {} #Japan 3.{'time' : [Cumulative number of positive people,Cumulative number of PCR tests]}
dic_tokyo = {} #Tokyo 3.{'time' : [Cumulative number of positive people,Cumulative number of PCR tests]}
for d in data_tokyo: #Cumulative number of positive people on each month and day,Initialization of cumulative PCR test number
dic_japan[d[0]] = [0, 0]
dic_tokyo[d[0]] = [0, 0]
for d in data_japan: #Cumulative number of positive people using the corresponding date as a key,Calculate the cumulative number of PCR tests
dic_japan[d[0]][0] += d[1] # d[0]Add up the cumulative number of positive people in the prefecture
dic_japan[d[0]][1] += d[2] # d[0]Add the cumulative number of PCR tests in the prefecture
for d in data_tokyo: #Cumulative number of positive people using the corresponding date as a key,Calculate the cumulative number of PCR tests
dic_tokyo[d[0]][0] += d[1] # d[0]Add up the cumulative number of positive people in the prefecture
dic_tokyo[d[0]][1] += d[2] # d[0]Add the cumulative number of PCR tests in the prefecture
#Change the dictionary type to list type to make it easier to handle in the subsequent processing
list_japan = [] #Japan 4.['time',Cumulative number of positive people,Cumulative number of PCR tests]
list_tokyo = [] #Tokyo 4.['time',Cumulative number of positive people,Cumulative number of PCR tests]
for day in dic_japan:
list_japan.append([day, dic_japan[day][0], dic_japan[day][1]])
list_tokyo.append([day, dic_tokyo[day][0], dic_tokyo[day][1]])
dic_day_japan = {} #Japan 5.{'time' : [Number of new positives,Number of new PCR tests]}
dic_day_tokyo = {} #Tokyo 5.{'time' : [Number of new positives,Number of new PCR tests]}
for i, day in enumerate(dic_japan):
if i > 0:
dic_day_japan[day] = [list_japan[i][1] - list_japan[i-1][1], list_japan[i][2] - list_japan[i-1][2]] # {'time' :Number of new positives,Number of new PCR tests}
dic_day_tokyo[day] = [list_tokyo[i][1] - list_tokyo[i-1][1], list_tokyo[i][2] - list_tokyo[i-1][2]] # {'time' :Number of new positives,Number of new PCR tests}
Thus, in the end, A dictionary-type variable having a list of [number of new positives, number of new PCR tests] for that month and day as an element is completed.
As for the number of PCR tests nationwide, the data for several days may be reflected at a later date, so if you graph the raw data as it is, Number of tests 0, number of positives 30 It is very inconvenient because a day like this appears. The way to improve this is to process the moving average.
japan_plus = [] #Number of new positives nationwide
japan_all = [] #Number of new inspections nationwide
tokyo_plus = [] #Number of new positives in Tokyo
tokyo_all = [] #Number of new inspections in Tokyo
for key in dic_day_tokyo:
tokyo_plus.append(dic_day_tokyo[key][0])
tokyo_all.append(dic_day_tokyo[key][1])
for key in dic_day_all:
japan_plus.append(dic_day_all[key][0])
japan_all.append(dic_day_all[key][1])
move_ave_n = 7 #Moving average width
k = move_ave_n//2 #Center of width of moving average
japan_plus_ave = japan_plus[k:-k] #In a relationship that takes an average,Ignore the oldest and latest 3 days of data
tokyo_plus_ave = tokyo_plus[k:-k] #Same as above
japan_all_ave = [] #Japan:Number of new PCR tests considering moving average
tokyo_all_ave = [] #Tokyo:Number of new PCR tests considering moving average
for i in range(k, len(tokyo_all)-k):
japan_all_ave.append(sum(japan_all[i-k : i+k+1]) / move_ave_n)
tokyo_all_ave.append(sum(tokyo_all[i-k : i+k+1]) / move_ave_n)
japan_ratio_ave = [] #Japan:Positive rate calculated based on the number of new PCR tests considering the moving average
tokyo_ratio_ave = [] #Tokyo:Positive rate calculated based on the number of new PCR tests considering the moving average
for i in range(len(japan_all_ave)):
japan_ratio_ave.append(round(100 * japan_plus_ave[i] / japan_all_ave[i], 1))
tokyo_ratio_ave.append(round(100 * tokyo_plus_ave[i] / tokyo_all_ave[i], 1))
This part is especially important
japan_all_ave = [] #Japan:Number of new PCR tests considering moving average
tokyo_all_ave = [] #Tokyo:Number of new PCR tests considering moving average
for i in range(k, len(tokyo_all)-k):
japan_all_ave.append(sum(japan_all[i-k : i+k+1]) / move_ave_n)
tokyo_all_ave.append(sum(tokyo_all[i-k : i+k+1]) / move_ave_n)
By sum (japan_all [i-k: i + k + 1])
, the sum of the three days before and after the day ʻiwas calculated. By dividing it by
move_ave_n`, we calculate the moving average for 7 days.
Follow the above steps to finally graph the normalized data. The code around matplotlib is based on the link below, so please take a look there.
Make a 2-axis graph with Python / matplotlib [Matplotlib] Display multiple column charts side by side
We have described a series of processes from reading a CSV file to preprocessing for displaying a graph. 4.4 I will describe the details of the graphing part in a separate article when I have time.
Thank you for watching until the end.
Recommended Posts