I drew a Python graph using public data on the number of patients positive for the new coronavirus (COVID-19) in Tokyo + with a link to the national version of practice data

Purpose

I drew a Python graph using the data of positive patients of the new coronavirus (COVID-19) released by the Tokyo Metropolitan Government.

I wrote it with the minimum necessary code, so I hope it will be helpful for those who are thinking of performing data analysis using Python from now on.

Since the public data in csv format, which is updated daily by the Tokyo Metropolitan Government, is directly read, there is no need to download the csv file one by one.

If you copy the following Python code to your own execution environment (Jupyter Notebook etc.), you can draw the latest information graph every time.

Also, I added a link to the Japanese national version of csv data later in this article, so I think it will be easier for you to acquire skills if you practice using it.

Python runtime environment

The Python code in this article has been tested using Jupyter Lab on a Windows 10 machine with Anaconda installed.

Data source

csv data

The data graphed this time is the following csv data. The results up to the previous day are updated daily. Tokyo Metropolitan Government_New Coronavirus Positive Patient Announcement Details (CSV Format)

home page

The following is the homepage with links to csv data. Details of Tokyo Metropolitan Government_New Coronavirus Positive Patient Announcement image.png

Graphing with Python

Now let's draw a graph in Python using csv data.

First, read the data

First, use the Python code described below to connect to the Tokyo homepage, acquire the latest data (csv format), and convert it to pandas DataFlame.

The point here is that the csv file is not saved in the local folder, but directly converted to pandas DataFlame (df). This saves you the trouble of opening a browser and downloading the latest version of the csv file, which is updated daily, just by running the code below.

import requests
import pandas as pd
import io

#Import csv directly into pandas dataframe
url = 'https://stopcovid19.metro.tokyo.lg.jp/data/130001_tokyo_covid19_patients.csv'
r = requests.get(url).content
df = pd.read_csv(io.StringIO(r.decode('utf-8')))
df

When the data is loaded successfully, the contents of DataFrame (df) should be displayed. image.png

Transition graph of newly infected people

I will draw a graph using the DataFrame (df) read above. First, the horizontal axis is the date and the vertical axis is the bar graph of the number of infected people. Let's continue to execute the following code.

((5/15 postscript)) Since the order of the original csv data is no longer in chronological order, I added a line of code to sort the data in the order of published_date near the center of the code below.

#Matplotlib for drawing graphs.Import pyplot and seaborn
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Draw a graph
plt.figure(figsize=(13,7))                   #Define the size of the graph

sns.set(font='Yu Gothic', font_scale = 1.2)  #Specify the font because Japanese characters are garbled

df = df.sort_values('Published_date')           #Published_dateの順番にデータを並び替える(5/15 postscript)

sns.countplot(data=df, x='Published_date')      #Create an aggregate graph of the number of infected people using Seaborn.

plt.title('COVID-19 Changes in the number of newly infected people @ Tokyo')
plt.xticks(rotation=90, fontsize=10)         #Since the date and time of the x-axis overlap, rotate 90 ° and display
plt.ylabel('Number of infected people(Man)')                   #Y-axis label'Number of infected people'Displayed as

Did you draw a graph like the one below? countplot_H.png It feels like it has converged, but I wonder what will happen in the future. .. ..

Graph of the number of infected people by day of the week

By the way, using the same bar graph, if you try to divide the horizontal axis by day of the week,

#Draw a graph of the number of infected people by day of the week
sns.countplot(data=df, x="Day of the week")            #Draw a graph
plt.title('Number of newly infected people by day of the week @ Tokyo')    #Show graph title
plt.ylabel('Number of infected people(Man)')                  #Show title on vertical axis

countplot_weekday.png It was easy to draw, but the order of the days of the week is strange.

To sort the days of the week, rewrite as follows.

#Rearrange the horizontal axis of the graph and draw the graph again
list_weekday = ['Month','fire','water','wood','Money','soil','Day']     #Make a list showing the order of the horizontal axis
sns.countplot(data=df, x="Day of the week",order=list_weekday)     #Draw a graph
plt.title('Number of newly infected people by day of the week @ Tokyo')    #Show graph title
plt.ylabel('Number of infected people(Man)')                  #Show title on vertical axis

countplot_weekday_sort.png It was safely sorted by day of the week. It seems that the number of Fridays and Saturdays on weekends is high, and the number of Sundays and Mondays is low.

Graph of the number of infected men and women

Next, the male-female ratio is ...

#Draw a graph of the number of infected people by gender
sns.countplot(data=df, x="patient_sex")      #Draw a graph
plt.title('Number of new infections by gender @ Tokyo')    #Show graph title
plt.ylabel('Number of infected people(Man)')                 #Show title on vertical axis

countplot_sex.png As reported every day, there are more men here, but ... Rather, it was found that the data included items "under investigation" and "unknown" in addition to "male" and "female". These unexpected discoveries are common in data analysis, Just in case, let's aggregate the patient_gender data with pivot_table. You can aggregate from the original data in the following one line.

#patient_Aggregate gender data
df.pivot_table(index='patient_sex',aggfunc='size').sort_values(ascending=False)

I think that the following tabulation results (number of each item) will appear. image.png

In other words, in the patient_gender item, In addition to "male" and "female", it seems that six "unknown" and one "under investigation" are mixed.

It's a common story that unexpected items are included when analyzing data, so It is very important to remember not only graph visualization but also data aggregation and preprocessing techniques.

Graph of the number of infected people by age

Next, by age group ...

list_age = ['Under 10 years old','10's','20's','30s','Forties','50s','60s','70s','80s','90s','100 years and over','unknown']
sns.countplot(data=df, x="patient_Age", order=list_age)
plt.xticks(rotation=90)
plt.ylabel('Number of infected people(Man)') 

countplot_age.png Looking at it this way, it seems that the number of infected people in their 20s and 30s is large for the population, not to mention the proportion of elderly people in their 60s and above. (It may be better to express that the ratio of people in their 40s and 50s is small for the population.)

Graph of population by age group in Tokyo (reference)

For reference, the graph [^ 1] of the population of Tokyo by age group (as of January 1, 2nd year of Reiwa) is shown below. [^ 1]: From Tokyo's households and population (by town and age) based on the Basic Resident Register

barplot_people.png

Age Total population Male population Female population
Under 10 years old 1,048,921 536,920 512,001
10's 1,029,680 526,065 503,615
20's 1,557,966 779,053 778,913
30s 1,842,086 939,710 902,376
Forties 2,177,935 1,108,561 1,069,374
50s 1,832,946 946,158 886,788
60s 1,373,395 688,654 684,741
70s 1,414,012 645,774 768,238
80s 794,805 304,309 490,496
90s and over 185,849 47,609 138,240
unknown 1 0 1

Number of infected people per population by age group (total for men and women)

And the graph below compares the number of infected people per 100,000 people by dividing the number of infected people by age group by the population by age group. countplot_age_ratio_all.png I was a little surprised. .. .. It seems that people in their 90s and above are overwhelming, followed by those in their 20s, 30s, and 40s to 80s.

Number of infected people per population by age group (by gender)

And if you divide it into men and women.

:boy_tone1: :girl_tone1:
countplot_age_ratio_male.png countplot_age_ratio_female.png

This is also a surprising result. I was wondering if there were many infected people in their 20s, but it was women who tended to have more infected people in their 20s and 30s. I don't know the cause, but it's a little worrisome result.

Heatmap by age and date

And if you look at the heatmaps by age and date, ...

#Published_Date and patient_Create a pivot table with a column of ages
df_pivot = df[['Published_date','patient_Age']].pivot_table(index='Published_date',columns='patient_Age',aggfunc='size')

#patient_List each item of the age (used on the vertical axis of the heat map)
list_age = ['Under 10 years old','10's','20's','30s','Forties','50s','60s','70s','80s','90s','100 years and over','unknown']

plt.figure(figsize=(6,16))                         #Define the size of the graph
plt.yticks(fontsize = 10)                          #Define y-axis font size

sns.heatmap(df_pivot[list_age], annot = True, annot_kws={"size": 10}, linewidth = .1)    #Draw heatmap

heatmap.png It looks like that, but it feels like "that's why". .. .. (-_-;) Since it seems that other information can be extracted, I will continue the analysis little by little.

About data trimming

By the way, I did not check the contents of the raw data (csv) at all, but since the csv data has been converted to DataFrame (df) with the code at the beginning, let's display the contents of the data again with the following command. Let's do it.

df

image.png There are 4,883 lines of data (as of May 12, 2020), but it seems that there are many nans that indicate blanks. To be on the safe side, let's take a look at the unique values contained in each column. Try running the code below.

#Data frame containing csv data(df)Extract the column name of and the unique value stored in each column.
for i in df.columns:                                       #Repeat for each column
    print('Column name:' + i)                                    #Print the name of the column
    print('Number of unique values:' + str(len(df[i].unique())))    #Count the number of unique values in each column
    print('Unique value:' + str(df[i].unique()))             #Extract unique values for each column
    print('///////////////////////////////////////////')   #Separator

Since the result is long, I folded it below and stored it.

Execution result (click) Column name: No Number of unique values: 4987 Unique value: [1 2 3 ... 10109 10110 10111] /////////////////////////////////////////// Column name: National local government code Number of unique values: 1 Unique value: [130001] /////////////////////////////////////////// Column name: Prefecture name Number of unique values: 1 Unique value: ['Tokyo'] /////////////////////////////////////////// Column name: City name Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Published_Date Number of unique values: 84 Unique value: ['2020-01-24' '2020-01-25' '2020-01-30' '2020-02-13' '2020-02-14' '2020-02-15' '2020-02-16' '2020-02-18' '2020-02-19' '2020-02-21' '2020-02-22' '2020-02-24' '2020-02-26' '2020-02-27' '2020-02-29' '2020-03-01' '2020-03-03' '2020-03-04' '2020-03-05' '2020-03-06' '2020-03-07' '2020-03-10' '2020-03-11' '2020-03-12' '2020-03-13' '2020-03-14' '2020-03-15' '2020-03-17' '2020-03-18' '2020-03-19' '2020-03-20' '2020-03-21' '2020-03-22' '2020-03-23' '2020-03-24' '2020-03-25' '2020-03-26' '2020-03-27' '2020-03-28' '2020-03-29' '2020-03-30' '2020-03-31' '2020-04-01' '2020-04-02' '2020-04-03' '2020-04-04' '2020-04-05' '2020-04-06' '2020-04-07' '2020-04-08' '2020-04-09' '2020-04-10' '2020-04-11' '2020-04-12' '2020-04-13' '2020-04-14' '2020-04-15' '2020-04-16' '2020-04-17' '2020-04-18' '2020-04-19' '2020-04-20' '2020-04-21' '2020-04-22' '2020-04-23' '2020-04-24' '2020-04-25' '2020-04-26' '2020-04-27' '2020-04-28' '2020-04-29' '2020-04-30' '2020-05-01' '2020-05-02' '2020-05-03' '2020-05-04' '2020-05-05' '2020-05-06' '2020-05-07' '2020-05-08' '2020-05-09' '2020-05-10' '2020-05-11' '2020-05-12'] /////////////////////////////////////////// Column name: Day of the week Number of unique values: 7 Unique values: ['Friday''Saturday''Thu''Sun''Tue''Wed''Monday'] /////////////////////////////////////////// Column name: Onset_date Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Patient_Residence Number of unique values: 7 Unique value: ['Wuhan City, Hubei Province''Changsha City, Hunan Province''Tokyo''Outside Tokyo' nan'Under investigation'''―'] /////////////////////////////////////////// Column name: Patient_age Number of unique values: 13 Unique values: ['40s' '30s' '70s' '50s' '80s' '60s' '20s''under 10s' '90s''teens' '100s and over' 'Unknown''-'] /////////////////////////////////////////// Column name: Patient_Gender Number of unique values: 4 Unique values: ['Men''Women'' Under investigation''Unknown'] /////////////////////////////////////////// Column name: Patient_attribute Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Patient_Status Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Patient_Symptoms Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Patient_ Travel history flag Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Remarks Number of unique values: 1 Unique value: [nan] /////////////////////////////////////////// Column name: Discharged flag Number of unique values: 2 Unique value: [1. nan] ///////////////////////////////////////////

At least for the following columns everything seems to be blank (nan).

In addition, the "national local government code" and "prefecture name" all have the same value, which makes no sense in data analysis. It is desirable to remove such unnecessary data from the data in advance. Create a new data frame (df_extract) by extracting only the necessary items. Execute the following code.

#Trim unnecessary columns (extract only necessary columns)
df_extract = df[['No','Published_date','Day of the week','patient_residence','patient_Age','patient_sex','Discharged flag']]
df_extract = df_extract.set_index('No')     #Set the "No" column to index.
df_extract

image.png This made me feel pretty refreshed. I think that trimming work, which properly judges and excludes unnecessary data when analyzing data, is also a very important skill.

Practice data

This time it was data from Tokyo, Jag Japan Co., Ltd. has released the national version of csv data. https://dl.dropboxusercontent.com/s/6mztoeb6xf78g5w/COVID-19.csv There is a lot of data and I think it is just right for practicing data analysis using Python. The procedure is almost the same, so if you are interested, why don't you try it yourself?

bonus

In fact, you can do almost the same thing with Excel (and relatively easily)

Below is a pivot graph drawn in Excel using the same data. In fact, you can easily do almost the same thing with Excel, including the heatmap introduced in this article. I also love Python, and I have a lot of feelings about Python, but when I think about what data analysis is for and who it is for, what I can do with Excel is what I can do with Excel. Every day I think that the basic style should not be done in Python.

Transition graph of newly infected people drawn in Excel

image.png

Heat map drawn in Excel (Modoki)

Thank you for reading through to the end.

I will continue to update it to improve my skills.

Recommended Posts

I drew a Python graph using public data on the number of patients positive for the new coronavirus (COVID-19) in Tokyo + with a link to the national version of practice data
Create a bot that posts the number of people positive for the new coronavirus in Tokyo to Slack
I tried fitting the exponential function and logistics function to the number of COVID-19 positive patients in Tokyo
I tried to automatically send the literature of the new coronavirus to LINE with Python
[Introduction to Python] How to get the index of data with a for statement
Posted the number of new corona positives in Tokyo to Slack (deployed on Heroku)
I tried to tabulate the number of deaths per capita of COVID-19 (new coronavirus) by country
How to know the number of GPUs from python ~ Notes on using multiprocessing with pytorch ~
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
Analyzing data on the number of corona patients in Japan
[Homology] Count the number of holes in data with Python
[Python] Introduction to graph creation using coronavirus data [For beginners]
I tried using "Asciichart Py" which can draw a beautiful graph on the console with Python.
I tried using PDF data of online medical care based on the spread of the new coronavirus infection
Get the number of readers of a treatise on Mendeley in Python
How to run the practice code of the book "Creating a profitable AI with Python" on Google Colaboratory
I tried to open the latest data of the Excel file managed by date in the folder with Python
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
[Completed version] Try to find out the number of residents in the town from the address list with Python
Let's use Python to represent the frequency of binary data contained in a data frame in a single bar graph.
I tried with the top 100 PyPI packages> I tried to graph the packages installed on Python
[Python, ObsPy] I drew a beach ball on the map with Cartopy + ObsPy.
I made a program to check the size of a file in Python
I tried to display the altitude value of DTM in a graph
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to predict the number of people infected with coronavirus in Japan by the method of the latest paper in China
I tried to predict the number of people infected with coronavirus in consideration of the effect of refraining from going out
I wanted to know the number of lines in multiple files, so I tried to get it with a command
I installed Pygame with Python 3.5.1 in the environment of pyenv on OS X
Paste a link to the data point of the graph created by jupyterlab & matplotlib
I searched for the skills needed to become a web engineer in Python
Align the number of samples between classes of data for machine learning with Python
I made a script to record the active window using win32gui of Python
How to get a list of files in the same directory with python
I made a Python program for Raspberry Pi that operates Omron's environmental sensor in the mode with data storage
I tried to make a function to retrieve data from database column by column using sql with sqlite3 of python [sqlite3, sql, pandas]
A story that I wanted to display the division result (%) on HTML with an application using django [Beginner learns python with a reference book in one hand]
[Example of Python improvement] I learned the basics of Python on a free site in 2 weeks.
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
Consolidate a large number of CSV files in folders with python (data without header)
An easy way to pad the number with zeros depending on the number of digits [Python]
I created a stacked bar graph with matplotlib in Python and added a data label
The concept of reference in Python collapsed for a moment, so I experimented a little.
I want to take a screenshot of the site on Docker using any font
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
Find a guideline for the number of processes / threads to set in the application server
How to count the number of occurrences of each element in the list in Python with weight
Convert PDF of the situation of people infected in Tokyo with the new coronavirus infection of the Tokyo Metropolitan Health and Welfare Bureau to CSV
I enjoyed writing with a single stroke on the self-avoiding random walk ~ Using python with reference to Computational Physics I (Asakura Shoten) ~
Since the stock market crashed due to the influence of the new coronavirus, I tried to visualize the performance of my investment trust with Python.
I made a program in Python that changes the 1-minute data of FX to an arbitrary time frame (1 hour frame, etc.)
I created a Discord bot on Docker that reports the number of corona infected people in Tokyo at a specified time.
[Python] I want to be a gourmet person [Data Driven approach] Choosing a store for the year-end and New Year holidays
I tried to graph the packages installed in Python
How to get the number of digits in Python
I want to work with a robot in python.
How to change python version of Notebook in Watson Studio (or Cloud Pak for Data)
Use hash to lighten collision detection of about 1000 balls in Python (related to the new coronavirus)
How to create an instance of a particular class from dict using __new__ () in python
[Python] I want to make a 3D scatter plot of the epicenter with Cartopy + Matplotlib!