Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)

Introduction

Every time I saw the report on the number of new coronavirus infections, I was thinking "I want you to tell me more about the breakdown of the age group", but I learned that the Tokyo Metropolitan Government has released data on positive patients. ..

Here, we will introduce how to analyze and visualize the data published by Tokyo using Python and pandas, seaborn, Matplotlib.

The main point of this article is not the predictions and recommendations that "this will happen in the future" and "this kind of measures should be taken", but "this makes it easy to visualize the data, so everyone should try it". If you try it yourself, you will deepen your understanding, so please give it a try.

Note that if you are particular about details such as graph layout and axis format, Matplotlib will require troublesome processing, so I will not go into detail here (just a little touch at the end). Rather than creating a good-looking graph for widespread disclosure, the goal is to visualize the data and see trends on your own.

Sample code is also available on GitHub. The Jupyter Notebook (.ipynb) is easier to read, so please refer to it as well.

Data overview

Tokyo

The positive patient data for Tokyo is published below.

-Tokyo Metropolitan New Coronavirus Positive Patient Announcement Details --Dataset --Tokyo Open Data Catalog Site --CSV file - https://stopcovid19.metro.tokyo.lg.jp/data/130001_tokyo_covid19_patients.csv

You can reach it from the Get open data link of Tokyo Metropolitan Government's new coronavirus infection control site.

Looking at History, it seems that it is updated from 10 to 15 o'clock on weekdays.

Other prefectures

Below is a list of sites that have forked the countermeasure sites in Tokyo.

Some sites, such as Tokyo, have links to open data. The following is an example of Hokkaido and Kanagawa prefecture.

-Data on new coronavirus infection [Hokkaido] --Hokkaido Open Data Portal Site -Countermeasures against new coronavirus infections: number of positive patients and attribute data of positive patients --Kanagawa Prefecture homepage

Even if there is no link on the site, the data itself should be published somewhere, so you may find it by searching.

The following sample code uses data from Tokyo. Data of other prefectures may have different items, but the basic treatment is the same.

Other data

In addition, as data related to the new coronavirus, there are aggregated data such as the number of domestic PCR tests performed, the number of positives, the number of hospitalized persons, and the number of deaths released by the Ministry of Health, Labor and Welfare.

-Open Data | Ministry of Health, Labor and Welfare

Library version

The versions of each library and Python itself in the sample code below are as follows. Note that different versions may behave differently.

import math
import sys

import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

print(pd.__version__)
# 1.0.5

print(mpl.__version__)
# 3.3.0

print(sns.__version__)
# 0.10.1

print(sys.version)
# 3.8.5 (default, Jul 21 2020, 10:48:26) 
# [Clang 11.0.3 (clang-1103.0.32.62)]

Data confirmation and preprocessing

Specify the path to the downloaded CSV file in pd.read_csv () and read it as DataFrame. The data up to July 31, 2020 is used as an example.

df = pd.read_csv('data/130001_tokyo_covid19_patients_20200731.csv')

You can directly specify the URL in the argument of pd.read_csv (), but it is safer to download it locally because it will be accessed many times in vain at the stage of trial and error.

# df = pd.read_csv('https://stopcovid19.metro.tokyo.lg.jp/data/130001_tokyo_covid19_patients.csv')

The number of rows / columns and the data at the beginning and end are as follows.

print(df.shape)
# (12691, 16)

print(df.head())
#No National Local Public Organization Code Prefecture Name Municipal Name Published_Date Onset_Date Patient_Place of residence Patient_Age patient_sex\
#0 1 130001 Tokyo NaN 2020-01-24 Fri NaN Wuhan City, Hubei Province Male in his 40s
#1 2 130001 Tokyo NaN 2020-01-25 Sat NaN Wuhan City, Hubei Province Women in their 30s
#2 3 130001 Tokyo NaN 2020-01-30 Thu NaN Changsha City, Hunan Province Female in her 30s
#3 4 130001 Tokyo NaN 2020-02-13 Thu NaN Male in his 70s in Tokyo
#4 5 130001 Tokyo NaN 2020-02-14 Fri NaN Women in their 50s in Tokyo
# 
#patient_Attribute patient_Condition patient_Symptom patient_Travel history flag Remarks Discharged flag
# 0    NaN    NaN    NaN           NaN NaN     1.0  
# 1    NaN    NaN    NaN           NaN NaN     1.0  
# 2    NaN    NaN    NaN           NaN NaN     1.0  
# 3    NaN    NaN    NaN           NaN NaN     1.0  
# 4    NaN    NaN    NaN           NaN NaN     1.0  

print(df.tail())
#No National Local Public Organization Code Prefecture Name Municipal Name Published_Date Onset_Date Patient_Place of residence Patient_Age\
#12686 12532 130001 Tokyo NaN 2020-07-31 Fri NaN NaN 70s
#12687 12558 130001 Tokyo NaN 2020-07-31 Fri NaN NaN 70s
#12688 12563 130001 Tokyo NaN 2020-07-31 Fri NaN NaN 70s
#12689 12144 130001 Tokyo NaN 2020-07-31 Fri NaN NaN 80s
#12690 12517 130001 Tokyo NaN 2020-07-31 Fri NaN NaN 80s
# 
#patient_Gender patient_Attribute patient_Condition patient_Symptom patient_Travel history flag Remarks Discharged flag
#12686 Male NaN NaN NaN NaN NaN NaN
#12687 Male NaN NaN NaN NaN NaN NaN
#12688 Male NaN NaN NaN NaN NaN NaN
#12689 Female NaN NaN NaN NaN NaN NaN
#12690 Male NaN NaN NaN NaN NaN NaN

When categorical data such as this data is the main data, it is easy to get an overview by using methods such as count (), nunique (), ʻunique (), and value_counts ()`.

count () returns the number of elements that are not the missing value NaN. It can be seen that detailed information such as city / ward / town / village names, symptoms, and attributes is not disclosed (no data), probably because of privacy protection.

print(df.count())
# No              12691
#National Local Public Organization Code 12691
#Prefecture name 12691
#City name 0
#Published_Date 12691
#Day of the week 12691
#Onset_Date 0
#patient_Place of residence 12228
#patient_Age 12691
#patient_Gender 12691
#patient_Attribute 0
#patient_State 0
#patient_Symptom 0
#patient_Travel history flag 0
#Remark 0
#Discharged flag 7186
# dtype: int64

nunique () returns the number of data types. Since it is data from Tokyo, the national local government code and prefecture names are all the same.

print(df.nunique())
# No              12691
#National Local Public Organization Code 1
#Prefecture name 1
#City name 0
#Published_Date 164
#Day of the week 7
#Onset_Date 0
#patient_Place of residence 8
#patient_Age 13
#patient_Gender 5
#patient_Attribute 0
#patient_State 0
#patient_Symptom 0
#patient_Travel history flag 0
#Remark 0
#Discharged flag 1
# dtype: int64

For each column (= Series), you can check the unique elements and their number (occurrence frequency) with ʻunique ()andvalue_counts ()`.

print(df['patient_residence'].unique())
# ['Wuhan City, Hubei Province' 'Changsha City, Hunan Province' 'In Tokyo' 'Outside Tokyo' '―' 'investigating' '-' "'-" nan]

print(df['patient_residence'].value_counts(dropna=False))
#Tokyo 11271
#Outside Tokyo 531
# NaN         463
# ―           336
#Under investigation 85
#Wuhan City, Hubei Province 2
#Changsha City, Hunan Province 1
# '-            1
# -             1
# Name:patient_residence, dtype: int64

print(df['patient_sex'].unique())
# ['male' 'Female' "'-" '―' 'unknown']

print(df['patient_sex'].value_counts())
#Male 7550
#Female 5132
# '-       7
#Unknown 1
# ―        1
# Name:patient_sex, dtype: int64

This time, we will narrow down the analysis target to the date of publication, the age of the patient, and the discharged flag. For convenience, change the column name with rename ().

df = df[['Published_date', 'patient_Age', 'Discharged flag']].copy()

df.rename(columns={'Published_date': 'date_str', 'patient_Age': 'age_org', 'Discharged flag': 'discharged'},
          inplace=True)

print(df)
#          date_str age_org  discharged
# 0      2020-01-24 40s 1.0
# 1      2020-01-25 30s 1.0
# 2      2020-01-30 30s 1.0
# 3      2020-02-13 70s 1.0
# 4      2020-02-14 50s 1.0
# ...           ...     ...         ...
# 12686  2020-07-31 70s NaN
# 12687  2020-07-31 70s NaN
# 12688  2020-07-31 70s NaN
# 12689  2020-07-31 80s NaN
# 12690  2020-07-31 80s NaN
# 
# [12691 rows x 3 columns]

The reason for using copy () here is to prevent SettingWithCopyWarning. In this case, the data is not updated, so there is no problem if you leave it alone.

-How to deal with Pandas SettingWithCopyWarning

Looking at the chronological column, it contains data such as unknown and'-.

print(df['age_org'].unique())
# ['Forties' '30s' '70s' '50s' '60s' '80s' '20's' 'Under 10 years old' '90s' '10's' '100 years and over'
#  'unknown' "'-"]

print(df['age_org'].value_counts())
#20s 4166
#30s 2714
#40s 1741
#50s 1362
#60s 832
#70s 713
#80s 455
#Teen 281
#90s 214
#Under 10 years old 200
#Unknown 6
#100 years and over 5
# '-           2
# Name: age_org, dtype: int64

Since the number is small, I will exclude it here.

df = df[~df['age_org'].isin(['unknown', "'-"])]

print(df)
#          date_str age_org  discharged
# 0      2020-01-24 40s 1.0
# 1      2020-01-25 30s 1.0
# 2      2020-01-30 30s 1.0
# 3      2020-02-13 70s 1.0
# 4      2020-02-14 50s 1.0
# ...           ...     ...         ...
# 12686  2020-07-31 70s NaN
# 12687  2020-07-31 70s NaN
# 12688  2020-07-31 70s NaN
# 12689  2020-07-31 80s NaN
# 12690  2020-07-31 80s NaN
# 
# [12683 rows x 3 columns]

print(df['age_org'].unique())
# ['Forties' '30s' '70s' '50s' '60s' '80s' '20's' 'Under 10 years old' '90s' '10's' '100 years and over']

Since the age division is fine, make it a little rougher. The whole right side is enclosed in parentheses () because a line break occurs in the middle.

-Write a method chain with line breaks in Python

df['age'] = (
    df['age_org'].replace(['Under 10 years old', '10's'], '0-19')
    .replace(['20's', '30s'], '20-39')
    .replace(['Forties', '50s'], '40-59')
    .replace(['60s', '70s', '80s', '90s', '100 years and over'], '60-')
)

print(df['age'].unique())
# ['40-59' '20-39' '60-' '0-19']

print(df['age'].value_counts())
# 20-39    6880
# 40-59    3103
# 60-      2219
# 0-19      481
# Name: age, dtype: int64

The date and time (publication date) column date_str is a character string. For future processing, add the column date converted todatetime64 [ns]type.

df['date'] = pd.to_datetime(df['date_str'])

print(df.dtypes)
# date_str              object
# age_org               object
# discharged           float64
# age                   object
# date          datetime64[ns]
# dtype: object

This is the end of preprocessing. From here, an example of actually analyzing and visualizing data is shown.

Changes in the number of new positive patients by age group

Here, we look at changes in the number of new positive patients by age group. The total number of new positive patients will be described at the end as an example of processing with Matplotlib.

Stacked bar graph

Use pd.crosstab () to cross-tabulate the date and time (publication date) and age.

df_ct = pd.crosstab(df['date'], df['age'])

print(df_ct)
# age         0-19  20-39  40-59  60-
# date                               
# 2020-01-24     0      0      1    0
# 2020-01-25     0      1      0    0
# 2020-01-30     0      1      0    0
# 2020-02-13     0      0      0    1
# 2020-02-14     0      0      1    1
# ...          ...    ...    ...  ...
# 2020-07-27     5     79     34   13
# 2020-07-28    13    168     65   20
# 2020-07-29     9    160     56   25
# 2020-07-30    11    236     83   37
# 2020-07-31    10    332     82   39
# 
# [164 rows x 4 columns]

print(type(df_ct.index))
# <class 'pandas.core.indexes.datetimes.DatetimeIndex'>

The column converted to datetime64 [ns] type becomes a new index and is treated as DatetimeIndex. Note that even if the output is the same, if you specify a string type date and time, it will not be DatetimeIndex.

Aggregate weekly with resample (). resample () can only be executed with DatetimeIndex.

df_ct_week = df_ct.resample('W', label='left').sum()

print(df_ct_week)
# age         0-19  20-39  40-59  60-
# date                               
# 2020-01-19     0      1      1    0
# 2020-01-26     0      1      0    0
# 2020-02-02     0      0      0    0
# 2020-02-09     0      2      5    9
# 2020-02-16     0      1      3    6
# 2020-02-23     0      2      3    5
# 2020-03-01     2      5      9    9
# 2020-03-08     0      5     10   11
# 2020-03-15     0     10     27   12
# 2020-03-22     7    100     88  102
# 2020-03-29    16    244    198  148
# 2020-04-05    21    421    369  271
# 2020-04-12    30    350    375  280
# 2020-04-19    32    286    267  264
# 2020-04-26    29    216    165  260
# 2020-05-03     7    105     69  120
# 2020-05-10     2     46     16   46
# 2020-05-17     3     22     10   15
# 2020-05-24     4     43     16   21
# 2020-05-31     2     89     34   22
# 2020-06-07     5    113     17   26
# 2020-06-14     6    177     29   28
# 2020-06-21    10    236     65   23
# 2020-06-28    34    460    107   51
# 2020-07-05    79    824    191   68
# 2020-07-12    66   1006    295  117
# 2020-07-19    78   1140    414  171
# 2020-07-26    48    975    320  134

Visualized with plot (). You can easily create a stacked bar graph.

df_ct_week[:-1].plot.bar(stacked=True)

bar_ct_week.png

The data in the last row (last week) is excluded by [: -1]. This data excludes the last week because it does not include Saturday (August 1, 2020) and is not appropriate to compare with other weeks.

In Jupyter Notebook, the graph is displayed in the output cell. If you want to save it as an image file, use plt.savefig (). You can also save the Jupyter Notebook output by right-clicking.

plt.figure()
df_ct_week[:-1].plot.bar(stacked=True)
plt.savefig('image/bar_chart.png', bbox_inches='tight')
plt.close('all')

The condition of occurrence is unknown, but there was a problem that the X-axis label was cut off when saving. Refer to the following and set bbox_inches ='tight' to solve the problem.

If you create a bar graph as it is as in the above example, the time will be displayed on the X-axis label. The simplest solution is to convert the index to a string in any format.

df_ct_week_str = df_ct_week.copy()
df_ct_week_str.index = df_ct_week_str.index.strftime('%Y-%m-%d')

df_ct_week_str[:-1].plot.bar(stacked=True, figsize=(8, 4))

bar_ct_week_str.png

Standardize the whole and see the transition of the age ratio. T is transpose (swap rows and columns). It can be standardized by transposing it, dividing it by the total value, transposing it again, and returning it to the original value.

Since June, the majority of young people (20-30s) have closed, but recently the percentage of middle-aged and elderly people (40s and older) is increasing.

df_ct_week_str_norm = (df_ct_week_str.T / df_ct_week_str.sum(axis=1)).T

bar_ct_week_str_norm.png

The changes between young people (20-30s) and elderly people (60s and beyond) are as follows. The absolute number of elderly people has also increased to the level at the end of March.

df_ct_week_str[:-1][['20-39', '60-']].plot.bar(figsize=(8, 4))

bar_ct_week_str_young_old.png

Line graph (compared to the previous week)

To see the momentum of the spread of infection, calculate the change from the previous week.

The data can be shifted and divided by shift ().

df_week_ratio = df_ct_week / df_ct_week.shift()

print(df_week_ratio)
# age             0-19      20-39     40-59       60-
# date                                               
# 2020-01-19       NaN        NaN       NaN       NaN
# 2020-01-26       NaN   1.000000  0.000000       NaN
# 2020-02-02       NaN   0.000000       NaN       NaN
# 2020-02-09       NaN        inf       inf       inf
# 2020-02-16       NaN   0.500000  0.600000  0.666667
# 2020-02-23       NaN   2.000000  1.000000  0.833333
# 2020-03-01       inf   2.500000  3.000000  1.800000
# 2020-03-08  0.000000   1.000000  1.111111  1.222222
# 2020-03-15       NaN   2.000000  2.700000  1.090909
# 2020-03-22       inf  10.000000  3.259259  8.500000
# 2020-03-29  2.285714   2.440000  2.250000  1.450980
# 2020-04-05  1.312500   1.725410  1.863636  1.831081
# 2020-04-12  1.428571   0.831354  1.016260  1.033210
# 2020-04-19  1.066667   0.817143  0.712000  0.942857
# 2020-04-26  0.906250   0.755245  0.617978  0.984848
# 2020-05-03  0.241379   0.486111  0.418182  0.461538
# 2020-05-10  0.285714   0.438095  0.231884  0.383333
# 2020-05-17  1.500000   0.478261  0.625000  0.326087
# 2020-05-24  1.333333   1.954545  1.600000  1.400000
# 2020-05-31  0.500000   2.069767  2.125000  1.047619
# 2020-06-07  2.500000   1.269663  0.500000  1.181818
# 2020-06-14  1.200000   1.566372  1.705882  1.076923
# 2020-06-21  1.666667   1.333333  2.241379  0.821429
# 2020-06-28  3.400000   1.949153  1.646154  2.217391
# 2020-07-05  2.323529   1.791304  1.785047  1.333333
# 2020-07-12  0.835443   1.220874  1.544503  1.720588
# 2020-07-19  1.181818   1.133201  1.403390  1.461538
# 2020-07-26  0.615385   0.855263  0.772947  0.783626

df_week_ratio['2020-05-03':'2020-07-25'].plot(grid=True)

line_week_ratio.png

In July, the week-on-week rate has been declining in each age group.

In addition, unlike the bar graph, when creating a line graph with plot () (or plot.line ()), the date and time data on the X-axis is appropriately formatted as in the above example. Note that it may not be formatted depending on the content of the date and time data as described later.

Heat map

A heat map is created as another approach to grasp the transition of the number of new positive patients by age group.

Here, the detailed age group is used as it is. Cross tabulation with pd.crosstab () as in the stacked bar chart example. Since resample () is not used, the string type date and time date_str is specified. The horizontal axis is transposed with T to set the date and time, and the lower side of the vertical axis is transposed with[:: -1]to reverse the arrangement of the rows after transposition.

df['age_detail'] = df['age_org'].replace(
    {'Under 10 years old': '0-9', '10's': '10-19', '20's': '20-29', '30s': '30-39', 'Forties': '40-49', '50s': '50-59',
     '60s': '60-69', '70s': '70-79', '80s': '80-89', '90s': '90-', '100 years and over': '90-'}
)

df_ct_hm = pd.crosstab(df['date_str'], df['age_detail']).T[::-1]

The seaborn function heatmap () is useful for creating heatmaps.

plt.figure(figsize=(15, 5))
sns.heatmap(df_ct_hm, cmap='hot')

heatmap.png

It can be confirmed that the infection has gradually spread to the elderly since June.

See below for log scale heatmaps. I got a warning, but it worked for the time being.

Note that if there is 0 in the data, an error will occur, so we are doing a rough process of replacing 0 with 0.1.

df_ct_hm_re = df_ct_hm.replace({0: 0.1})

min_value = df_ct_hm_re.values.min()
max_value = df_ct_hm_re.values.max()

log_norm = mpl.colors.LogNorm(vmin=min_value, vmax=max_value)
cbar_ticks = [math.pow(10, i) for i in range(math.floor(math.log10(min_value)),
                                             1 + math.ceil(math.log10(max_value)))]

plt.figure(figsize=(15, 5))
sns.heatmap(df_ct_hm_re, norm=log_norm, cbar_kws={"ticks": cbar_ticks})

heatmap_log.png

By the way, I learned the idea of visualizing with a heat map by looking at the Florida example in the following article.

-There is no evidence that the new corona is attenuated (Kutsuna Kutsuna) --Individual --Yahoo! News

@Zorinaq, who created the Florida graph, has released the code to create various graphs such as future forecasts in addition to the heat map. It seems difficult without some knowledge of Python, but if you are interested, you may want to take a look.

Discharged flag

important point

As shown in the result of count () shown above, there are 7186 cases where the discharged flag is 1 in the public data, but [Tokyo Metropolitan Government's new coronavirus infection control site](https: / /stopcovid19.metro.tokyo.lg.jp/) "Discharge, etc. (including the lapse of the medical treatment period)" is 9615 (as of July 31, 2020, 20:30 update).

tokyo_stopcovid.png

I don't know if the data is just delayed or for some reason, but keep in mind that the discharge flag for public data may be different from the current status.

Stacked bar graph

Similar to the example of the change in the number of positives by age group, the change in the discharged flag is viewed in a stacked bar graph. Missing value NaN is replaced with 0 as preprocessing.

print(df['discharged'].unique())
# [ 1. nan]

df['discharged'] = df['discharged'].fillna(0).astype('int')

print(df['discharged'].unique())
# [1 0]

print(pd.crosstab(df['date'], df['discharged']).resample('W', label='left').sum()[:-1].plot.bar(stacked=True))

bar_discharged_all.png

If you are concerned about the time being displayed on the X-axis, you can convert the date and time of the index to a character string, as in the example of changes in the number of positives by age group. I'm leaving it here. The same applies to the following examples.

This graph shows the percentage of discharged flags by publication date. Of course, many people who have been positive for a long time (= old publication date) have been discharged (= the discharged flag is 1).

Check by age group. In pd.crosstab (), if you specify multiple columns in the list, the result will be obtained as a multi-index.

df_dc = pd.crosstab(df['date'], [df['age'], df['discharged']]).resample('W', label='left').sum()

print(df_dc)
# age        0-19     20-39      40-59       60-     
# discharged    0   1     0    1     0    1    0    1
# date                                               
# 2020-01-19    0   0     0    1     0    1    0    0
# 2020-01-26    0   0     0    1     0    0    0    0
# 2020-02-02    0   0     0    0     0    0    0    0
# 2020-02-09    0   0     0    2     0    5    0    9
# 2020-02-16    0   0     0    1     0    3    0    6
# 2020-02-23    0   0     0    2     0    3    0    5
# 2020-03-01    0   2     0    5     0    9    0    9
# 2020-03-08    0   0     0    5     0   10    1   10
# 2020-03-15    0   0     0   10     0   27    0   12
# 2020-03-22    0   7     0  100     0   88    2  100
# 2020-03-29    0  16     1  243     4  194    9  139
# 2020-04-05    0  21     5  416     1  368   11  260
# 2020-04-12    1  29     0  350     6  369   10  270
# 2020-04-19    2  30     3  283     6  261   17  247
# 2020-04-26    1  28     8  208     4  161   33  227
# 2020-05-03    1   6     6   99     5   64   23   97
# 2020-05-10    0   2     7   39     3   13    8   38
# 2020-05-17    2   1    10   12     2    8    9    6
# 2020-05-24    3   1    18   25     8    8    5   16
# 2020-05-31    0   2    13   76     8   26    9   13
# 2020-06-07    1   4    17   96     7   10   12   14
# 2020-06-14    3   3    84   93    13   16   17   11
# 2020-06-21    4   6    75  161    18   47    8   15
# 2020-06-28    4  30    37  423    19   88   20   31
# 2020-07-05   44  35   211  613    92   99   46   22
# 2020-07-12   62   4   803  203   250   45  113    4
# 2020-07-19   78   0  1140    0   414    0  171    0
# 2020-07-26   48   0   975    0   320    0  134    0

The graphs of young people and elderly people are as follows.

df_dc[:-1]['20-39'].plot.bar(stacked=True)

bar_discharged_young.png

df_dc[:-1]['60-'].plot.bar(stacked=True)

bar_discharged_old.png

As you can imagine, the percentage of elderly people whose discharge flag is not set to 1 is higher even if the publication date is older, and it seems that hospitalization is likely to be prolonged.

Standardize to make the ratio easier to see.

x_young = df_dc[9:-1]['20-39']
x_young_norm = (x_young.T / x_young.sum(axis=1)).T

print(x_young_norm)
# discharged         0         1
# date                          
# 2020-03-22  0.000000  1.000000
# 2020-03-29  0.004098  0.995902
# 2020-04-05  0.011876  0.988124
# 2020-04-12  0.000000  1.000000
# 2020-04-19  0.010490  0.989510
# 2020-04-26  0.037037  0.962963
# 2020-05-03  0.057143  0.942857
# 2020-05-10  0.152174  0.847826
# 2020-05-17  0.454545  0.545455
# 2020-05-24  0.418605  0.581395
# 2020-05-31  0.146067  0.853933
# 2020-06-07  0.150442  0.849558
# 2020-06-14  0.474576  0.525424
# 2020-06-21  0.317797  0.682203
# 2020-06-28  0.080435  0.919565
# 2020-07-05  0.256068  0.743932
# 2020-07-12  0.798211  0.201789
# 2020-07-19  1.000000  0.000000

x_young_norm.plot.bar(stacked=True)

bar_discharged_young_norm.png

x_old = df_dc[9:-1]['60-']
x_old_norm = (x_old.T / x_old.sum(axis=1)).T

print(x_old_norm)
# discharged         0         1
# date                          
# 2020-03-22  0.019608  0.980392
# 2020-03-29  0.060811  0.939189
# 2020-04-05  0.040590  0.959410
# 2020-04-12  0.035714  0.964286
# 2020-04-19  0.064394  0.935606
# 2020-04-26  0.126923  0.873077
# 2020-05-03  0.191667  0.808333
# 2020-05-10  0.173913  0.826087
# 2020-05-17  0.600000  0.400000
# 2020-05-24  0.238095  0.761905
# 2020-05-31  0.409091  0.590909
# 2020-06-07  0.461538  0.538462
# 2020-06-14  0.607143  0.392857
# 2020-06-21  0.347826  0.652174
# 2020-06-28  0.392157  0.607843
# 2020-07-05  0.676471  0.323529
# 2020-07-12  0.965812  0.034188
# 2020-07-19  1.000000  0.000000

x_old_norm.plot.bar(stacked=True)

bar_discharged_old_norm.png

The percentages of young and elderly people whose discharged flags are not set to 1 are shown below. After all, elderly people tend to be hospitalized for a longer period of time.

pd.DataFrame({'20-39': x_young_norm[0], '60-': x_old_norm[0]}).plot.bar()

bar_discharged_young_old_norm.png

Processed by Matplotlib

The examples so far have been relatively easy to handle using plot () in DataFrame and the seaborn function, but in some cases you may need to handle them in Matplotlib.

Take the transition of the total number of new positives as an example.

Here, value_counts () is used to count the date and time columns and calculate the total number of new positives for each publication date. Note that if you do not sort by sort_index (), they will be arranged in descending order.

s_total = df['date'].value_counts().sort_index()

print(s_total)
# 2020-01-24      1
# 2020-01-25      1
# 2020-01-30      1
# 2020-02-13      1
# 2020-02-14      2
#              ... 
# 2020-07-27    131
# 2020-07-28    266
# 2020-07-29    250
# 2020-07-30    367
# 2020-07-31    463
# Name: date, Length: 164, dtype: int64

print(type(s_total))
# <class 'pandas.core.series.Series'>

print(type(s_total.index))
# <class 'pandas.core.indexes.datetimes.DatetimeIndex'>

Unlike the previous examples, it is Series instead of DataFrame, but the idea is the same in both cases.

When a bar graph is generated with plot.bar (), the X axes overlap as shown below.

s_total.plot.bar()

bar_total_ng.png

In the example of the previous week's comparison, I wrote that when generating a line graph with plot (), the date and time will be formatted appropriately, but in this case it does not work.

s_total.plot()

line_total_ng.png

It seems that the reason why plot () doesn't work is that the date and time data set in the index is not periodic (I'm sorry if it's different because I haven't examined it in detail).

In the example compared to the previous week, weekly data existed without omission, but in this example, data is available on the date and time when the number of positive persons such as 2020-01-26 and 2020-01-27 is 0 Does not exist.

Using reindex () and pd.date_range (), add data with the value as 0 even on the date and time when the number of positives is 0.

s_total_re = s_total.reindex(
    index=pd.date_range(s_total.index[0], s_total.index[-1]),
    fill_value=0
)

print(s_total_re)
# 2020-01-24      1
# 2020-01-25      1
# 2020-01-26      0
# 2020-01-27      0
# 2020-01-28      0
#              ... 
# 2020-07-27    131
# 2020-07-28    266
# 2020-07-29    250
# 2020-07-30    367
# 2020-07-31    463
# Freq: D, Name: date, Length: 190, dtype: int64

This will format the date and time appropriately with plot (). By the way, if you want to use log scale, logy = True.

s_total_re.plot()

line_total_ok.png

s_total_re.plot(logy=True)

line_total_ok_log.png

Even in this case, plot.bar () is useless.

s_total_re.plot.bar()

bar_total_ng_2.png

For types other than line graphs, Matplotlib must handle them.

Set Formatter and Locator, and generate a graph with bar () of Matplotlib.

fig, ax = plt.subplots(figsize=(12, 4))

ax.xaxis.set_major_locator(mpl.dates.AutoDateLocator())
ax.xaxis.set_major_formatter(mpl.dates.DateFormatter('%Y-%m-%d'))

ax.xaxis.set_tick_params(rotation=90)

ax.bar(s_total.index, s_total)

bar_total_ok.png

Set_yscale ('log') if you want a log scale.

fig, ax = plt.subplots(figsize=(12, 4))

ax.xaxis.set_major_locator(mpl.dates.AutoDateLocator())
ax.xaxis.set_major_formatter(mpl.dates.DateFormatter('%Y-%m-%d'))

ax.xaxis.set_tick_params(rotation=90)

ax.set_yscale('log')

ax.bar(s_total.index, s_total)

bar_total_ok_log.png

If you want to add a moving average, use rolling ().

print(s_total.rolling(7).mean())
# 2020-01-24           NaN
# 2020-01-25           NaN
# 2020-01-30           NaN
# 2020-02-13           NaN
# 2020-02-14           NaN
#                  ...    
# 2020-07-27    252.285714
# 2020-07-28    256.428571
# 2020-07-29    258.142857
# 2020-07-30    258.285714
# 2020-07-31    287.285714
# Name: date, Length: 164, dtype: float64

fig, ax = plt.subplots(figsize=(12, 4))

ax.xaxis.set_major_locator(mpl.dates.AutoDateLocator())
ax.xaxis.set_minor_locator(mpl.dates.DayLocator())
ax.xaxis.set_major_formatter(mpl.dates.DateFormatter('%Y-%m-%d'))

ax.xaxis.set_tick_params(labelsize=12)
ax.yaxis.set_tick_params(labelsize=12)

ax.grid(linestyle='--')
ax.margins(x=0)

ax.bar(s_total.index, s_total, width=1, color='#c0e0c0', edgecolor='black')
ax.plot(s_total.index, s_total.rolling(7).mean(), color='red')

bar_total_with_ma.png

In the above example, some settings have been added for reference. margins (x = 0) is the truncation of the left and right margins. You can specify the color by name or color code.

at the end

The data available is limited, but I think you can deepen your understanding by playing around with it yourself. Please try it out.

Recommended Posts

Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Analysis of financial data by pandas and its visualization (2)
Data analysis using python pandas
Try scraping the data of COVID-19 in Tokyo with Python
Create a decision tree from 0 with Python and understand it (3. Data analysis library Pandas edition)
Recommended books and sources of data analysis programming (Python or R)
Practical exercise of data analysis with Python ~ 2016 New Coder Survey Edition ~
Automatic acquisition of gene expression level data by python and R
Example of 3D skeleton analysis by Python
Pandas of the beginner, by the beginner, for the beginner [Python]
Analysis of X-ray microtomography image by Python
Data analysis python
[CovsirPhy] COVID-19 Python Package for Data Analysis: Data loading
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
Calculation of technical indicators by TA-Lib and pandas
Sentiment analysis of large-scale tweet data by NLTK
A well-prepared record of data analysis in Python
Basic operation of Python Pandas Series and Dataframe (1)
Python practice data analysis Summary of learning that I hit about 10 with 100 knocks
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
[Python] Comparison of Principal Component Analysis Theory and Implementation by Python (PCA, Kernel PCA, 2DPCA)
Practice of creating a data analysis platform with BigQuery and Cloud DataFlow (data processing)
Data analysis with python 2
Get a large amount of Starbucks Twitter data with python and try data analysis Part 1
[Python] [Word] [python-docx] Simple analysis of diff data using python
List of Python libraries for data scientists and data engineers
Data analysis overview python
Data analysis environment construction with Python (IPython notebook + Pandas)
Challenge principal component analysis of text data with Python
Summary of Pandas methods used when extracting data [Python]
Story of image analysis of PDF file and data extraction
List of Python code used in big data analysis
[CovsirPhy] COVID-19 Python package for data analysis: SIR-F model
[CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis
Python data analysis template
[CovsirPhy] COVID-19 Python Package for Data Analysis: SIR model
Analysis of measurement data ②-Histogram and fitting, lmfit recommendation-
[CovsirPhy] COVID-19 Python Package for Data Analysis: Parameter estimation
Visualization method of data by explanatory variable and objective variable
[Python beginner's memo] Importance and method of confirming missing value NaN before data analysis
Data analysis with Python
[CovsirPhy] COVID-19 Python Package for Data Analysis: Scenario Analysis (Parameter Comparison)
Perform isocurrent analysis of open channels with Python and matplotlib
■ Kaggle Practice for Beginners --Introduction of Python --by Google Colaboratory
Python application: Data cleansing # 3: Use of OpenCV and preprocessing of image data
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
Get rid of dirty data with Python and regular expressions
Data analysis based on the election results of the Tokyo Governor's election (2020)
Scraping desired data from website by linking Python and Excel
[Introduction to Data Scientists] Basics of Python ♬ Functions and classes
Graph time series data in Python using pandas and matplotlib
Comparison of data frame handling in Python (pandas), R, Pig
[Python] Random data extraction / combination from DataFrame using random and pandas
My python data analysis container
Python 2 series and 3 series (Anaconda edition)
Python for Data Analysis Chapter 4
Static analysis of Python programs
Visualization of data by prefecture
Python data analysis learning notes
Python for Data Analysis Chapter 2
Source installation and installation of Python