Every time I saw the report on the number of new coronavirus infections, I was thinking "I want you to tell me more about the breakdown of the age group", but I learned that the Tokyo Metropolitan Government has released data on positive patients. ..
Here, we will introduce how to analyze and visualize the data published by Tokyo using Python and pandas, seaborn, Matplotlib.
The main point of this article is not the predictions and recommendations that "this will happen in the future" and "this kind of measures should be taken", but "this makes it easy to visualize the data, so everyone should try it". If you try it yourself, you will deepen your understanding, so please give it a try.
Note that if you are particular about details such as graph layout and axis format, Matplotlib will require troublesome processing, so I will not go into detail here (just a little touch at the end). Rather than creating a good-looking graph for widespread disclosure, the goal is to visualize the data and see trends on your own.
Sample code is also available on GitHub. The Jupyter Notebook (.ipynb
) is easier to read, so please refer to it as well.
The positive patient data for Tokyo is published below.
-Tokyo Metropolitan New Coronavirus Positive Patient Announcement Details --Dataset --Tokyo Open Data Catalog Site --CSV file - https://stopcovid19.metro.tokyo.lg.jp/data/130001_tokyo_covid19_patients.csv
You can reach it from the Get open data
link of Tokyo Metropolitan Government's new coronavirus infection control site.
Looking at History, it seems that it is updated from 10 to 15 o'clock on weekdays.
Below is a list of sites that have forked the countermeasure sites in Tokyo.
Some sites, such as Tokyo, have links to open data. The following is an example of Hokkaido and Kanagawa prefecture.
-Data on new coronavirus infection [Hokkaido] --Hokkaido Open Data Portal Site -Countermeasures against new coronavirus infections: number of positive patients and attribute data of positive patients --Kanagawa Prefecture homepage
Even if there is no link on the site, the data itself should be published somewhere, so you may find it by searching.
The following sample code uses data from Tokyo. Data of other prefectures may have different items, but the basic treatment is the same.
In addition, as data related to the new coronavirus, there are aggregated data such as the number of domestic PCR tests performed, the number of positives, the number of hospitalized persons, and the number of deaths released by the Ministry of Health, Labor and Welfare.
-Open Data | Ministry of Health, Labor and Welfare
The versions of each library and Python itself in the sample code below are as follows. Note that different versions may behave differently.
import math
import sys
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
print(pd.__version__)
# 1.0.5
print(mpl.__version__)
# 3.3.0
print(sns.__version__)
# 0.10.1
print(sys.version)
# 3.8.5 (default, Jul 21 2020, 10:48:26)
# [Clang 11.0.3 (clang-1103.0.32.62)]
Specify the path to the downloaded CSV file in pd.read_csv ()
and read it as DataFrame
. The data up to July 31, 2020 is used as an example.
df = pd.read_csv('data/130001_tokyo_covid19_patients_20200731.csv')
You can directly specify the URL in the argument of pd.read_csv ()
, but it is safer to download it locally because it will be accessed many times in vain at the stage of trial and error.
# df = pd.read_csv('https://stopcovid19.metro.tokyo.lg.jp/data/130001_tokyo_covid19_patients.csv')
The number of rows / columns and the data at the beginning and end are as follows.
print(df.shape)
# (12691, 16)
print(df.head())
#No National Local Public Organization Code Prefecture Name Municipal Name Published_Date Onset_Date Patient_Place of residence Patient_Age patient_sex\
#0 1 130001 Tokyo NaN 2020-01-24 Fri NaN Wuhan City, Hubei Province Male in his 40s
#1 2 130001 Tokyo NaN 2020-01-25 Sat NaN Wuhan City, Hubei Province Women in their 30s
#2 3 130001 Tokyo NaN 2020-01-30 Thu NaN Changsha City, Hunan Province Female in her 30s
#3 4 130001 Tokyo NaN 2020-02-13 Thu NaN Male in his 70s in Tokyo
#4 5 130001 Tokyo NaN 2020-02-14 Fri NaN Women in their 50s in Tokyo
#
#patient_Attribute patient_Condition patient_Symptom patient_Travel history flag Remarks Discharged flag
# 0 NaN NaN NaN NaN NaN 1.0
# 1 NaN NaN NaN NaN NaN 1.0
# 2 NaN NaN NaN NaN NaN 1.0
# 3 NaN NaN NaN NaN NaN 1.0
# 4 NaN NaN NaN NaN NaN 1.0
print(df.tail())
#No National Local Public Organization Code Prefecture Name Municipal Name Published_Date Onset_Date Patient_Place of residence Patient_Age\
#12686 12532 130001 Tokyo NaN 2020-07-31 Fri NaN NaN 70s
#12687 12558 130001 Tokyo NaN 2020-07-31 Fri NaN NaN 70s
#12688 12563 130001 Tokyo NaN 2020-07-31 Fri NaN NaN 70s
#12689 12144 130001 Tokyo NaN 2020-07-31 Fri NaN NaN 80s
#12690 12517 130001 Tokyo NaN 2020-07-31 Fri NaN NaN 80s
#
#patient_Gender patient_Attribute patient_Condition patient_Symptom patient_Travel history flag Remarks Discharged flag
#12686 Male NaN NaN NaN NaN NaN NaN
#12687 Male NaN NaN NaN NaN NaN NaN
#12688 Male NaN NaN NaN NaN NaN NaN
#12689 Female NaN NaN NaN NaN NaN NaN
#12690 Male NaN NaN NaN NaN NaN NaN
When categorical data such as this data is the main data, it is easy to get an overview by using methods such as count ()
, nunique ()
, ʻunique (), and
value_counts ()`.
count ()
returns the number of elements that are not the missing value NaN
. It can be seen that detailed information such as city / ward / town / village names, symptoms, and attributes is not disclosed (no data), probably because of privacy protection.
print(df.count())
# No 12691
#National Local Public Organization Code 12691
#Prefecture name 12691
#City name 0
#Published_Date 12691
#Day of the week 12691
#Onset_Date 0
#patient_Place of residence 12228
#patient_Age 12691
#patient_Gender 12691
#patient_Attribute 0
#patient_State 0
#patient_Symptom 0
#patient_Travel history flag 0
#Remark 0
#Discharged flag 7186
# dtype: int64
nunique ()
returns the number of data types. Since it is data from Tokyo, the national local government code and prefecture names are all the same.
print(df.nunique())
# No 12691
#National Local Public Organization Code 1
#Prefecture name 1
#City name 0
#Published_Date 164
#Day of the week 7
#Onset_Date 0
#patient_Place of residence 8
#patient_Age 13
#patient_Gender 5
#patient_Attribute 0
#patient_State 0
#patient_Symptom 0
#patient_Travel history flag 0
#Remark 0
#Discharged flag 1
# dtype: int64
For each column (= Series
), you can check the unique elements and their number (occurrence frequency) with ʻunique ()and
value_counts ()`.
print(df['patient_residence'].unique())
# ['Wuhan City, Hubei Province' 'Changsha City, Hunan Province' 'In Tokyo' 'Outside Tokyo' '―' 'investigating' '-' "'-" nan]
print(df['patient_residence'].value_counts(dropna=False))
#Tokyo 11271
#Outside Tokyo 531
# NaN 463
# ― 336
#Under investigation 85
#Wuhan City, Hubei Province 2
#Changsha City, Hunan Province 1
# '- 1
# - 1
# Name:patient_residence, dtype: int64
print(df['patient_sex'].unique())
# ['male' 'Female' "'-" '―' 'unknown']
print(df['patient_sex'].value_counts())
#Male 7550
#Female 5132
# '- 7
#Unknown 1
# ― 1
# Name:patient_sex, dtype: int64
This time, we will narrow down the analysis target to the date of publication, the age of the patient, and the discharged flag. For convenience, change the column name with rename ()
.
df = df[['Published_date', 'patient_Age', 'Discharged flag']].copy()
df.rename(columns={'Published_date': 'date_str', 'patient_Age': 'age_org', 'Discharged flag': 'discharged'},
inplace=True)
print(df)
# date_str age_org discharged
# 0 2020-01-24 40s 1.0
# 1 2020-01-25 30s 1.0
# 2 2020-01-30 30s 1.0
# 3 2020-02-13 70s 1.0
# 4 2020-02-14 50s 1.0
# ... ... ... ...
# 12686 2020-07-31 70s NaN
# 12687 2020-07-31 70s NaN
# 12688 2020-07-31 70s NaN
# 12689 2020-07-31 80s NaN
# 12690 2020-07-31 80s NaN
#
# [12691 rows x 3 columns]
The reason for using copy ()
here is to prevent SettingWithCopyWarning
. In this case, the data is not updated, so there is no problem if you leave it alone.
-How to deal with Pandas SettingWithCopyWarning
Looking at the chronological column, it contains data such as unknown
and'-
.
print(df['age_org'].unique())
# ['Forties' '30s' '70s' '50s' '60s' '80s' '20's' 'Under 10 years old' '90s' '10's' '100 years and over'
# 'unknown' "'-"]
print(df['age_org'].value_counts())
#20s 4166
#30s 2714
#40s 1741
#50s 1362
#60s 832
#70s 713
#80s 455
#Teen 281
#90s 214
#Under 10 years old 200
#Unknown 6
#100 years and over 5
# '- 2
# Name: age_org, dtype: int64
Since the number is small, I will exclude it here.
df = df[~df['age_org'].isin(['unknown', "'-"])]
print(df)
# date_str age_org discharged
# 0 2020-01-24 40s 1.0
# 1 2020-01-25 30s 1.0
# 2 2020-01-30 30s 1.0
# 3 2020-02-13 70s 1.0
# 4 2020-02-14 50s 1.0
# ... ... ... ...
# 12686 2020-07-31 70s NaN
# 12687 2020-07-31 70s NaN
# 12688 2020-07-31 70s NaN
# 12689 2020-07-31 80s NaN
# 12690 2020-07-31 80s NaN
#
# [12683 rows x 3 columns]
print(df['age_org'].unique())
# ['Forties' '30s' '70s' '50s' '60s' '80s' '20's' 'Under 10 years old' '90s' '10's' '100 years and over']
Since the age division is fine, make it a little rougher. The whole right side is enclosed in parentheses ()
because a line break occurs in the middle.
-Write a method chain with line breaks in Python
df['age'] = (
df['age_org'].replace(['Under 10 years old', '10's'], '0-19')
.replace(['20's', '30s'], '20-39')
.replace(['Forties', '50s'], '40-59')
.replace(['60s', '70s', '80s', '90s', '100 years and over'], '60-')
)
print(df['age'].unique())
# ['40-59' '20-39' '60-' '0-19']
print(df['age'].value_counts())
# 20-39 6880
# 40-59 3103
# 60- 2219
# 0-19 481
# Name: age, dtype: int64
The date and time (publication date) column date_str
is a character string. For future processing, add the column date
converted todatetime64 [ns]
type.
df['date'] = pd.to_datetime(df['date_str'])
print(df.dtypes)
# date_str object
# age_org object
# discharged float64
# age object
# date datetime64[ns]
# dtype: object
This is the end of preprocessing. From here, an example of actually analyzing and visualizing data is shown.
Here, we look at changes in the number of new positive patients by age group. The total number of new positive patients will be described at the end as an example of processing with Matplotlib.
Use pd.crosstab ()
to cross-tabulate the date and time (publication date) and age.
df_ct = pd.crosstab(df['date'], df['age'])
print(df_ct)
# age 0-19 20-39 40-59 60-
# date
# 2020-01-24 0 0 1 0
# 2020-01-25 0 1 0 0
# 2020-01-30 0 1 0 0
# 2020-02-13 0 0 0 1
# 2020-02-14 0 0 1 1
# ... ... ... ... ...
# 2020-07-27 5 79 34 13
# 2020-07-28 13 168 65 20
# 2020-07-29 9 160 56 25
# 2020-07-30 11 236 83 37
# 2020-07-31 10 332 82 39
#
# [164 rows x 4 columns]
print(type(df_ct.index))
# <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
The column converted to datetime64 [ns]
type becomes a new index and is treated as DatetimeIndex
. Note that even if the output is the same, if you specify a string type date and time, it will not be DatetimeIndex
.
Aggregate weekly with resample ()
. resample ()
can only be executed with DatetimeIndex
.
df_ct_week = df_ct.resample('W', label='left').sum()
print(df_ct_week)
# age 0-19 20-39 40-59 60-
# date
# 2020-01-19 0 1 1 0
# 2020-01-26 0 1 0 0
# 2020-02-02 0 0 0 0
# 2020-02-09 0 2 5 9
# 2020-02-16 0 1 3 6
# 2020-02-23 0 2 3 5
# 2020-03-01 2 5 9 9
# 2020-03-08 0 5 10 11
# 2020-03-15 0 10 27 12
# 2020-03-22 7 100 88 102
# 2020-03-29 16 244 198 148
# 2020-04-05 21 421 369 271
# 2020-04-12 30 350 375 280
# 2020-04-19 32 286 267 264
# 2020-04-26 29 216 165 260
# 2020-05-03 7 105 69 120
# 2020-05-10 2 46 16 46
# 2020-05-17 3 22 10 15
# 2020-05-24 4 43 16 21
# 2020-05-31 2 89 34 22
# 2020-06-07 5 113 17 26
# 2020-06-14 6 177 29 28
# 2020-06-21 10 236 65 23
# 2020-06-28 34 460 107 51
# 2020-07-05 79 824 191 68
# 2020-07-12 66 1006 295 117
# 2020-07-19 78 1140 414 171
# 2020-07-26 48 975 320 134
Visualized with plot ()
. You can easily create a stacked bar graph.
df_ct_week[:-1].plot.bar(stacked=True)
The data in the last row (last week) is excluded by [: -1]
. This data excludes the last week because it does not include Saturday (August 1, 2020) and is not appropriate to compare with other weeks.
In Jupyter Notebook, the graph is displayed in the output cell. If you want to save it as an image file, use plt.savefig ()
. You can also save the Jupyter Notebook output by right-clicking.
plt.figure()
df_ct_week[:-1].plot.bar(stacked=True)
plt.savefig('image/bar_chart.png', bbox_inches='tight')
plt.close('all')
The condition of occurrence is unknown, but there was a problem that the X-axis label was cut off when saving. Refer to the following and set bbox_inches ='tight'
to solve the problem.
If you create a bar graph as it is as in the above example, the time will be displayed on the X-axis label. The simplest solution is to convert the index to a string in any format.
df_ct_week_str = df_ct_week.copy()
df_ct_week_str.index = df_ct_week_str.index.strftime('%Y-%m-%d')
df_ct_week_str[:-1].plot.bar(stacked=True, figsize=(8, 4))
Standardize the whole and see the transition of the age ratio. T
is transpose (swap rows and columns). It can be standardized by transposing it, dividing it by the total value, transposing it again, and returning it to the original value.
Since June, the majority of young people (20-30s) have closed, but recently the percentage of middle-aged and elderly people (40s and older) is increasing.
df_ct_week_str_norm = (df_ct_week_str.T / df_ct_week_str.sum(axis=1)).T
The changes between young people (20-30s) and elderly people (60s and beyond) are as follows. The absolute number of elderly people has also increased to the level at the end of March.
df_ct_week_str[:-1][['20-39', '60-']].plot.bar(figsize=(8, 4))
To see the momentum of the spread of infection, calculate the change from the previous week.
The data can be shifted and divided by shift ()
.
df_week_ratio = df_ct_week / df_ct_week.shift()
print(df_week_ratio)
# age 0-19 20-39 40-59 60-
# date
# 2020-01-19 NaN NaN NaN NaN
# 2020-01-26 NaN 1.000000 0.000000 NaN
# 2020-02-02 NaN 0.000000 NaN NaN
# 2020-02-09 NaN inf inf inf
# 2020-02-16 NaN 0.500000 0.600000 0.666667
# 2020-02-23 NaN 2.000000 1.000000 0.833333
# 2020-03-01 inf 2.500000 3.000000 1.800000
# 2020-03-08 0.000000 1.000000 1.111111 1.222222
# 2020-03-15 NaN 2.000000 2.700000 1.090909
# 2020-03-22 inf 10.000000 3.259259 8.500000
# 2020-03-29 2.285714 2.440000 2.250000 1.450980
# 2020-04-05 1.312500 1.725410 1.863636 1.831081
# 2020-04-12 1.428571 0.831354 1.016260 1.033210
# 2020-04-19 1.066667 0.817143 0.712000 0.942857
# 2020-04-26 0.906250 0.755245 0.617978 0.984848
# 2020-05-03 0.241379 0.486111 0.418182 0.461538
# 2020-05-10 0.285714 0.438095 0.231884 0.383333
# 2020-05-17 1.500000 0.478261 0.625000 0.326087
# 2020-05-24 1.333333 1.954545 1.600000 1.400000
# 2020-05-31 0.500000 2.069767 2.125000 1.047619
# 2020-06-07 2.500000 1.269663 0.500000 1.181818
# 2020-06-14 1.200000 1.566372 1.705882 1.076923
# 2020-06-21 1.666667 1.333333 2.241379 0.821429
# 2020-06-28 3.400000 1.949153 1.646154 2.217391
# 2020-07-05 2.323529 1.791304 1.785047 1.333333
# 2020-07-12 0.835443 1.220874 1.544503 1.720588
# 2020-07-19 1.181818 1.133201 1.403390 1.461538
# 2020-07-26 0.615385 0.855263 0.772947 0.783626
df_week_ratio['2020-05-03':'2020-07-25'].plot(grid=True)
In July, the week-on-week rate has been declining in each age group.
In addition, unlike the bar graph, when creating a line graph with plot ()
(or plot.line ()
), the date and time data on the X-axis is appropriately formatted as in the above example. Note that it may not be formatted depending on the content of the date and time data as described later.
A heat map is created as another approach to grasp the transition of the number of new positive patients by age group.
Here, the detailed age group is used as it is. Cross tabulation with pd.crosstab ()
as in the stacked bar chart example. Since resample ()
is not used, the string type date and time date_str
is specified. The horizontal axis is transposed with T
to set the date and time, and the lower side of the vertical axis is transposed with[:: -1]
to reverse the arrangement of the rows after transposition.
df['age_detail'] = df['age_org'].replace(
{'Under 10 years old': '0-9', '10's': '10-19', '20's': '20-29', '30s': '30-39', 'Forties': '40-49', '50s': '50-59',
'60s': '60-69', '70s': '70-79', '80s': '80-89', '90s': '90-', '100 years and over': '90-'}
)
df_ct_hm = pd.crosstab(df['date_str'], df['age_detail']).T[::-1]
The seaborn function heatmap ()
is useful for creating heatmaps.
plt.figure(figsize=(15, 5))
sns.heatmap(df_ct_hm, cmap='hot')
It can be confirmed that the infection has gradually spread to the elderly since June.
See below for log scale heatmaps. I got a warning, but it worked for the time being.
Note that if there is 0
in the data, an error will occur, so we are doing a rough process of replacing 0
with 0.1
.
df_ct_hm_re = df_ct_hm.replace({0: 0.1})
min_value = df_ct_hm_re.values.min()
max_value = df_ct_hm_re.values.max()
log_norm = mpl.colors.LogNorm(vmin=min_value, vmax=max_value)
cbar_ticks = [math.pow(10, i) for i in range(math.floor(math.log10(min_value)),
1 + math.ceil(math.log10(max_value)))]
plt.figure(figsize=(15, 5))
sns.heatmap(df_ct_hm_re, norm=log_norm, cbar_kws={"ticks": cbar_ticks})
By the way, I learned the idea of visualizing with a heat map by looking at the Florida example in the following article.
-There is no evidence that the new corona is attenuated (Kutsuna Kutsuna) --Individual --Yahoo! News
@Zorinaq, who created the Florida graph, has released the code to create various graphs such as future forecasts in addition to the heat map. It seems difficult without some knowledge of Python, but if you are interested, you may want to take a look.
As shown in the result of count ()
shown above, there are 7186 cases where the discharged flag is 1
in the public data, but [Tokyo Metropolitan Government's new coronavirus infection control site](https: / /stopcovid19.metro.tokyo.lg.jp/) "Discharge, etc. (including the lapse of the medical treatment period)" is 9615 (as of July 31, 2020, 20:30 update).
I don't know if the data is just delayed or for some reason, but keep in mind that the discharge flag for public data may be different from the current status.
Similar to the example of the change in the number of positives by age group, the change in the discharged flag is viewed in a stacked bar graph. Missing value NaN
is replaced with 0
as preprocessing.
print(df['discharged'].unique())
# [ 1. nan]
df['discharged'] = df['discharged'].fillna(0).astype('int')
print(df['discharged'].unique())
# [1 0]
print(pd.crosstab(df['date'], df['discharged']).resample('W', label='left').sum()[:-1].plot.bar(stacked=True))
If you are concerned about the time being displayed on the X-axis, you can convert the date and time of the index to a character string, as in the example of changes in the number of positives by age group. I'm leaving it here. The same applies to the following examples.
This graph shows the percentage of discharged flags by publication date. Of course, many people who have been positive for a long time (= old publication date) have been discharged (= the discharged flag is 1
).
Check by age group. In pd.crosstab ()
, if you specify multiple columns in the list, the result will be obtained as a multi-index.
df_dc = pd.crosstab(df['date'], [df['age'], df['discharged']]).resample('W', label='left').sum()
print(df_dc)
# age 0-19 20-39 40-59 60-
# discharged 0 1 0 1 0 1 0 1
# date
# 2020-01-19 0 0 0 1 0 1 0 0
# 2020-01-26 0 0 0 1 0 0 0 0
# 2020-02-02 0 0 0 0 0 0 0 0
# 2020-02-09 0 0 0 2 0 5 0 9
# 2020-02-16 0 0 0 1 0 3 0 6
# 2020-02-23 0 0 0 2 0 3 0 5
# 2020-03-01 0 2 0 5 0 9 0 9
# 2020-03-08 0 0 0 5 0 10 1 10
# 2020-03-15 0 0 0 10 0 27 0 12
# 2020-03-22 0 7 0 100 0 88 2 100
# 2020-03-29 0 16 1 243 4 194 9 139
# 2020-04-05 0 21 5 416 1 368 11 260
# 2020-04-12 1 29 0 350 6 369 10 270
# 2020-04-19 2 30 3 283 6 261 17 247
# 2020-04-26 1 28 8 208 4 161 33 227
# 2020-05-03 1 6 6 99 5 64 23 97
# 2020-05-10 0 2 7 39 3 13 8 38
# 2020-05-17 2 1 10 12 2 8 9 6
# 2020-05-24 3 1 18 25 8 8 5 16
# 2020-05-31 0 2 13 76 8 26 9 13
# 2020-06-07 1 4 17 96 7 10 12 14
# 2020-06-14 3 3 84 93 13 16 17 11
# 2020-06-21 4 6 75 161 18 47 8 15
# 2020-06-28 4 30 37 423 19 88 20 31
# 2020-07-05 44 35 211 613 92 99 46 22
# 2020-07-12 62 4 803 203 250 45 113 4
# 2020-07-19 78 0 1140 0 414 0 171 0
# 2020-07-26 48 0 975 0 320 0 134 0
The graphs of young people and elderly people are as follows.
df_dc[:-1]['20-39'].plot.bar(stacked=True)
df_dc[:-1]['60-'].plot.bar(stacked=True)
As you can imagine, the percentage of elderly people whose discharge flag is not set to 1
is higher even if the publication date is older, and it seems that hospitalization is likely to be prolonged.
Standardize to make the ratio easier to see.
x_young = df_dc[9:-1]['20-39']
x_young_norm = (x_young.T / x_young.sum(axis=1)).T
print(x_young_norm)
# discharged 0 1
# date
# 2020-03-22 0.000000 1.000000
# 2020-03-29 0.004098 0.995902
# 2020-04-05 0.011876 0.988124
# 2020-04-12 0.000000 1.000000
# 2020-04-19 0.010490 0.989510
# 2020-04-26 0.037037 0.962963
# 2020-05-03 0.057143 0.942857
# 2020-05-10 0.152174 0.847826
# 2020-05-17 0.454545 0.545455
# 2020-05-24 0.418605 0.581395
# 2020-05-31 0.146067 0.853933
# 2020-06-07 0.150442 0.849558
# 2020-06-14 0.474576 0.525424
# 2020-06-21 0.317797 0.682203
# 2020-06-28 0.080435 0.919565
# 2020-07-05 0.256068 0.743932
# 2020-07-12 0.798211 0.201789
# 2020-07-19 1.000000 0.000000
x_young_norm.plot.bar(stacked=True)
x_old = df_dc[9:-1]['60-']
x_old_norm = (x_old.T / x_old.sum(axis=1)).T
print(x_old_norm)
# discharged 0 1
# date
# 2020-03-22 0.019608 0.980392
# 2020-03-29 0.060811 0.939189
# 2020-04-05 0.040590 0.959410
# 2020-04-12 0.035714 0.964286
# 2020-04-19 0.064394 0.935606
# 2020-04-26 0.126923 0.873077
# 2020-05-03 0.191667 0.808333
# 2020-05-10 0.173913 0.826087
# 2020-05-17 0.600000 0.400000
# 2020-05-24 0.238095 0.761905
# 2020-05-31 0.409091 0.590909
# 2020-06-07 0.461538 0.538462
# 2020-06-14 0.607143 0.392857
# 2020-06-21 0.347826 0.652174
# 2020-06-28 0.392157 0.607843
# 2020-07-05 0.676471 0.323529
# 2020-07-12 0.965812 0.034188
# 2020-07-19 1.000000 0.000000
x_old_norm.plot.bar(stacked=True)
The percentages of young and elderly people whose discharged flags are not set to 1
are shown below. After all, elderly people tend to be hospitalized for a longer period of time.
pd.DataFrame({'20-39': x_young_norm[0], '60-': x_old_norm[0]}).plot.bar()
The examples so far have been relatively easy to handle using plot ()
in DataFrame
and the seaborn function, but in some cases you may need to handle them in Matplotlib.
Take the transition of the total number of new positives as an example.
Here, value_counts ()
is used to count the date and time columns and calculate the total number of new positives for each publication date. Note that if you do not sort by sort_index ()
, they will be arranged in descending order.
s_total = df['date'].value_counts().sort_index()
print(s_total)
# 2020-01-24 1
# 2020-01-25 1
# 2020-01-30 1
# 2020-02-13 1
# 2020-02-14 2
# ...
# 2020-07-27 131
# 2020-07-28 266
# 2020-07-29 250
# 2020-07-30 367
# 2020-07-31 463
# Name: date, Length: 164, dtype: int64
print(type(s_total))
# <class 'pandas.core.series.Series'>
print(type(s_total.index))
# <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
Unlike the previous examples, it is Series
instead of DataFrame
, but the idea is the same in both cases.
When a bar graph is generated with plot.bar ()
, the X axes overlap as shown below.
s_total.plot.bar()
In the example of the previous week's comparison, I wrote that when generating a line graph with plot ()
, the date and time will be formatted appropriately, but in this case it does not work.
s_total.plot()
It seems that the reason why plot ()
doesn't work is that the date and time data set in the index is not periodic (I'm sorry if it's different because I haven't examined it in detail).
In the example compared to the previous week, weekly data existed without omission, but in this example, data is available on the date and time when the number of positive persons such as 2020-01-26
and 2020-01-27
is 0
Does not exist.
Using reindex ()
and pd.date_range ()
, add data with the value as 0
even on the date and time when the number of positives is 0
.
s_total_re = s_total.reindex(
index=pd.date_range(s_total.index[0], s_total.index[-1]),
fill_value=0
)
print(s_total_re)
# 2020-01-24 1
# 2020-01-25 1
# 2020-01-26 0
# 2020-01-27 0
# 2020-01-28 0
# ...
# 2020-07-27 131
# 2020-07-28 266
# 2020-07-29 250
# 2020-07-30 367
# 2020-07-31 463
# Freq: D, Name: date, Length: 190, dtype: int64
This will format the date and time appropriately with plot ()
. By the way, if you want to use log scale, logy = True
.
s_total_re.plot()
s_total_re.plot(logy=True)
Even in this case, plot.bar ()
is useless.
s_total_re.plot.bar()
For types other than line graphs, Matplotlib must handle them.
Set Formatter and Locator, and generate a graph with bar ()
of Matplotlib.
fig, ax = plt.subplots(figsize=(12, 4))
ax.xaxis.set_major_locator(mpl.dates.AutoDateLocator())
ax.xaxis.set_major_formatter(mpl.dates.DateFormatter('%Y-%m-%d'))
ax.xaxis.set_tick_params(rotation=90)
ax.bar(s_total.index, s_total)
Set_yscale ('log')
if you want a log scale.
fig, ax = plt.subplots(figsize=(12, 4))
ax.xaxis.set_major_locator(mpl.dates.AutoDateLocator())
ax.xaxis.set_major_formatter(mpl.dates.DateFormatter('%Y-%m-%d'))
ax.xaxis.set_tick_params(rotation=90)
ax.set_yscale('log')
ax.bar(s_total.index, s_total)
If you want to add a moving average, use rolling ()
.
print(s_total.rolling(7).mean())
# 2020-01-24 NaN
# 2020-01-25 NaN
# 2020-01-30 NaN
# 2020-02-13 NaN
# 2020-02-14 NaN
# ...
# 2020-07-27 252.285714
# 2020-07-28 256.428571
# 2020-07-29 258.142857
# 2020-07-30 258.285714
# 2020-07-31 287.285714
# Name: date, Length: 164, dtype: float64
fig, ax = plt.subplots(figsize=(12, 4))
ax.xaxis.set_major_locator(mpl.dates.AutoDateLocator())
ax.xaxis.set_minor_locator(mpl.dates.DayLocator())
ax.xaxis.set_major_formatter(mpl.dates.DateFormatter('%Y-%m-%d'))
ax.xaxis.set_tick_params(labelsize=12)
ax.yaxis.set_tick_params(labelsize=12)
ax.grid(linestyle='--')
ax.margins(x=0)
ax.bar(s_total.index, s_total, width=1, color='#c0e0c0', edgecolor='black')
ax.plot(s_total.index, s_total.rolling(7).mean(), color='red')
In the above example, some settings have been added for reference. margins (x = 0)
is the truncation of the left and right margins. You can specify the color by name or color code.
The data available is limited, but I think you can deepen your understanding by playing around with it yourself. Please try it out.
Recommended Posts