[CovsirPhy] COVID-19 Python Package for Data Analysis: Data loading

Introduction

We are creating a Python package CovsirPhy that allows you to easily download and analyze COVID-19 data (such as the number of PCR positives).

Introductory article:

The English version of the document is Covsir Phy: COVID-19 analysis with phase-dependent SIRs, Kaggle: COVID-19 data with SIR model.

** This time, I will explain how to download the actual data of COVID-19. ** ** English edition:

1. Execution environment

CovsirPhy can be installed by the following method! Please use Python 3.7 or above, or Google Colaboratory.

--Stable version: pip install covsirphy --upgrade --Development version: pip install" git + https://github.com/lisphilar/covid19-sir.git#egg=covsirphy "

#For data display
from pprint import pprint
# CovsirPhy
import covsirphy as cs
cs.__version__
# '2.8.2'
Execution environment
OS Windows Subsystem for Linux
Python version 3.8.5

The tables and graphs in this article were created using the data as of 9/11/2020.

2. Summary

You can download the data in the following 4 lines.

data_loader = cs.DataLoader("input")
jhu_data = data_loader.jhu()
population_data = data_loader.population()
oxcgrt_data = data_loader.oxcgrt()

The following 3 types of data are automatically saved in the "input" directory (folder) from COVID-19 Data Hub [^ 1]. Data shaping is also good.

--Time-series data for each country / region regarding the number of infected / recovered / dead --Population data of each country / region -Oxford Covid-19 Government Response Tracker (OxCGRT): Data quantifying the status of measures taken by each country against COVID-19

Data formatting is done on the CovsirPhy side, but the data download itself depends on the official package covid19dh of COVID-19 Data Hub. We also work with developers [^ 2] to prevent errors, but if anything happens CovsirPhy issue page Please contact us from!

  1. DataLoader class A user interface for downloading and formatting data. If you give "input" as the first argument as shown below, each data will be downloaded to the "input" directory.
data_loader = cs.DataLoader("input")

You can change the directory name. The default of the first argument is "input" and can be omitted.

  1. DataLoader.jhu() This is a method to download "time-series data of each country / region regarding the number of infected / recovered / dead". Only download if the latest data has not been downloaded. If it has already been downloaded, only the saved data will be read and formatted.
# verbose=True:Display the source of data at the time of download
jhu_data = data_loader.jhu(verbose=True)
type(jhu_data)
# -> <class 'covsirphy.cleaning.jhu_data.JHUData'>

Originally, I used the method name "jhu" because I downloaded the data from Johns Hopkins University directly.

The data source [^ 3] can be confirmed from the instance of DataLoader.

[^ 3]: COVID-19 Data Hub is secondary data! Based on the data of Johns Hopkins University, the database side performs preprocessing such as missing value processing. Thank you very much.

# COVID-19 Data Hub Information-> (Output result omitted)
print(jhu_data.citation)
#List of data citation sources-> (Output result omitted)
print(data_loader.covid19dh_citation)
#View downloaded data(pandas.DataFrame) -> (Output result omitted)
jhu_data.raw.tail()

By JHUData.cleaned (), the data of date / country name / region name / total number of confirmed cases (number of PCR positives) / current number of infected people / total number of deaths / total number of recoverers is stored in data frame format ( You can get it with pandas.DataFrame).

jhu_data.cleaned().tail()
Date Country Province Confirmed Infected Fatal Recovered
211098 2020-09-07 Colombia Vichada 14 0 0 14
211099 2020-09-08 Colombia Vichada 14 0 0 14
211100 2020-09-09 Colombia Vichada 14 0 0 14
211101 2020-09-10 Colombia Vichada 14 0 0 14
211102 2020-09-11 Colombia Vichada 14 0 0 14

Depending on the country, both the value for the whole country and the value for each region are registered, so it is not possible to obtain the correct aggregated data for each country with jhu_data.cleaned (). Groupby ("Country "). Sum (). Therefore, we have prepared a method JHUData.subset (country, province) that retrieves data for a specific country or region. The country and region names columns are omitted from the output.

#Select only country name-> (Output result omitted)
jhu_data.subset(country="Japan")
#ISO3 code is OK for country name-> (Output result omitted)
jhu_data.subset(country="JPN")
#Select local name
jhu_data.subset(country="JPN", province="Tokyo").tail()
Date Confirmed Infected Fatal Recovered
172 2020-09-07 21849 2510 372 18967
173 2020-09-08 22019 2470 378 19171
174 2020-09-09 22168 2349 379 19440
175 2020-09-10 22444 2478 379 19587
176 2020-09-11 22631 2439 380 19812

Note: This is the 4th data (Tokyo / country / volunteer domestic organization / COVID-19 Data Hub) and may differ from the figures announced by the Tokyo Metropolitan Government.

If you want to create a time series graph, please use the cs.line_plot () function (the function may be deprecated and classified, so we are considering it).

cs.line_plot(
    subset_df.set_index("Date").drop("Confirmed", axis=1),
    title="Japan/Tokyo: cases over time",
    filename=None, #Set the file name when outputting to a file
    y_integer=True, #Change the y-axis to an integer value. Do not use x10 etc.
)

jhu_data_subset.jpg

In addition, we have prepared a method JHUData.total () to get the total value of the whole world. With percentage data.

jhu_data.total().tail()
Date Confirmed Infected Fatal Recovered Fatal per Confirmed Recovered per Confirmed Fatal per (Fatal or Recovered)
2020-09-07 2.71499e+07 8.06515e+06 890441 1.81943e+07 0.0163986 0.335071 0.0466573
2020-09-08 2.73868e+07 8.10302e+06 895203 1.83886e+07 0.0163437 0.33572 0.0464225
2020-09-09 2.76653e+07 8.15167e+06 901058 1.86126e+07 0.016285 0.336388 0.0461758
2020-09-10 2.7954e+07 8.2298e+06 906678 1.88175e+07 0.0162173 0.33658 0.0459678
2020-09-11 2.79547e+07 8.22937e+06 906696 1.88187e+07 0.0162172 0.336592 0.045966
  1. DataLoader.population() It is a method to acquire "population data of each country / region" on a daily basis.
population_data = data_loader.population()
print(type(population_data))
# -> <class 'covsirphy.cleaning.population.PopulationData'>

You can get ISO3 code / country / region / date / population data with PopulationData.cleaned (). Also, use PopulationData.value (country, province) to get the value for each country / region.

#Get formatted data in data frame format->Output result omitted
population_data.cleaned().tail()
#Select only country name-> int
population_data.value(country="Japan")
#ISO3 code is OK for country name-> int
population_data.value(country="JPN")
#Select local name-> int
population_data.value(country="JPN", province="Tokyo")

Population values can be updated with the PopulationData.update (value, country, province) method.

#Before update-> 13942856
population_data.value(country="Japan", province="Tokyo")
#update
# https://www.metro.tokyo.lg.jp/tosei/hodohappyo/press/2020/06/11/07.html
population_data.update(14_002_973, "Japan", province="Tokyo")
#After update-> 14002973
population_data.value("Japan", province="Tokyo")
  1. DataLoader.oxcgrt() "Oxford Covid-19 Government Response Tracker (OxCGRT): quantified data on the status of measures taken by each country against COVID-19" on a daily basis The method to get. Please check the link for details of the data. I will introduce how to use the data for analysis in another article, but I am still exploring it.
oxcgrt_data = data_loader.oxcgrt()
print(type(oxcgrt_data))
# -> <class 'covsirphy.cleaning.oxcgrt.OxCGRTData'>

You can get the data of ISO3 code / country name / date / each index by ʻOxCGRTData.cleaned (). Regional data is not included. ʻOxCGRTData.subset (country) can also not specify a region name.

#Get formatted data in data frame format->Output result omitted
oxcgrt_data.cleaned().tail()
#Only country name can be selected
oxcgrt_data.subset(country="Japan")
#ISO3 code is OK for country name
oxcgrt_data.subset(country="JPN")
Date School_closing Workplace_closing Cancel_events Gatherings_restrictions Transport_closing Stay_home_restrictions Internal_movement_restrictions International_movement_restrictions Information_campaigns Testing_policy Contact_tracing Stringency_index
247 2020-09-07 1 1 1 0 0 1 1 3 2 2 1 30.56
248 2020-09-08 1 1 1 0 0 1 1 3 2 2 1 30.56
249 2020-09-09 1 1 1 0 0 1 1 3 2 2 1 30.56
250 2020-09-10 1 1 1 0 0 1 1 3 2 2 1 30.56
251 2020-09-11 1 1 1 0 0 1 1 3 2 2 1 30.56

7. Postscript

This time, I explained how to get each data using CovsirPhy. I did my best to get it easily with a short code, so please use it! We welcome your feedback.

Next time, I will write an article about the explanation of the analysis method using actual data. In addition to the usage examples, I would like to describe the technical background as much as possible. Thank you!

Thank you for your hard work!

Recommended Posts

[CovsirPhy] COVID-19 Python Package for Data Analysis: Data loading
[CovsirPhy] COVID-19 Python package for data analysis: SIR-F model
[CovsirPhy] COVID-19 Python package for data analysis: S-R trend analysis
[CovsirPhy] COVID-19 Python Package for Data Analysis: SIR model
[CovsirPhy] COVID-19 Python Package for Data Analysis: Parameter estimation
[CovsirPhy] COVID-19 Python Package for Data Analysis: Scenario Analysis (Parameter Comparison)
Python for Data Analysis Chapter 4
Python for Data Analysis Chapter 2
Python for Data Analysis Chapter 3
Preprocessing template for data analysis (Python)
Data analysis python
Python visualization tool for data analysis work
Data analysis with python 2
Data analysis using Python 0
Data analysis overview python
Python data analysis template
Data analysis with Python
Let's analyze Covid-19 (Corona) data using Python [For beginners]
Data analysis for improving POG 1 ~ Web scraping with Python ~
My python data analysis container
[Python] Notes on data analysis
Python data analysis learning notes
Data analysis using python pandas
Tips for data analysis ・ Notes
[Understand in the shortest time] Python basics for data analysis
Which should I study, R or Python, for data analysis?
<Python> Build a dedicated server for Jupyter Notebook data analysis
Python: Time Series Analysis: Preprocessing Time Series Data
Python course for data science_useful techniques
Data analysis for improving POG 3-Regression analysis-
Data formatting for Python / color plots
Data analysis starting with python (data visualization 1)
Data analysis starting with python (data visualization 2)
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Create a USB boot Ubuntu with a Python environment for data analysis
Detailed Python techniques required for data shaping (1)
[Python] First data analysis / machine learning (Kaggle)
A summary of Python e-books that are useful for free-to-read data analysis
Data analysis starting with python (data preprocessing-machine learning)
How to use "deque" for Python data
Detailed Python techniques required for data shaping (2)
I did Python data analysis training remotely
Python 3 Engineer Certified Data Analysis Exam Preparation
JupyterLab Basic Setting 2 (pip) for data analysis
JupyterLab Basic Setup for Data Analysis (pip)
Analysis for Data Scientists: Qiita Self-Article Summary 2020
Data analysis in Python Summary of sources to look at first for beginners
Data analysis for improving POG 2 ~ Analysis with jupyter notebook ~
Python template for log analysis at explosive speed
[Examination Report] Python 3 Engineer Certified Data Analysis Exam
Analysis for Data Scientists: Qiita Self-Article Summary 2020 (Practice)
Python3 Engineer Certification Data Analysis Exam Self-made Questions
Python 3 Engineer Certification Data Analysis Exam Pre-Exam Learning
An introduction to statistical modeling for data analysis
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
How to use data analysis tools for beginners
Display candlesticks for FX (forex) data in Python
Data analysis in Python: A note about line_profiler
[Python] Flow from web scraping to data analysis
A well-prepared record of data analysis in Python
Astro: Python modules / functions often used for analysis