I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University

Introduction

I will post to Quiita after a long time. Recently, I started to analyze statistical data on the pandemic of the new coronavirus (as a personal lifework, not a job?). And I have posted some articles on my blog. -Situation of each country from the viewpoint of the lifesaving rate of the new corona: What is the policy taken by that developed country with an extremely low lifesaving rate? │ YUUKOU's experience value -[Understanding the transition of the lifesaving rate of the new Corona: US strength, critical UK, Netherlands, China whose transition is too beautiful │ YUUKOU's experience value](https://yuukou-exp.plus/covid19-rescue-ratio -timeline-analysis-20200401 /)

For example, a chart that plots the time-series transition of the lifesaving rate is posted as the result of data analysis. (Although the counting criteria for infected people differ from country to country, the data shows that Japan has excellent medical practice in the world.)

covid19_rescue_ratio_japan_europe_us_20200401.png

This time, I would like to share the code of preparation for analyzing the new coronavirus statistical data provided by Johns Hopkins University.

-Public data of Johns Hopkins University

With this code, you'll be able to generate a data frame for the new coronavirus statistics and be ready to work on your data analysis.

We hope that you will make a small contribution if you use it.

Data download & processing

Johns Hopkins University publishes statistical data (and in chronological order!) Of new coronavirus infections worldwide on github. --Repost: Public data from Johns Hopkins University

The overall flow of processing is to use ʻurllibto get the data and then process it. Statistical data published by Johns Hopkins University includes three things:confirmed, deaths, and recovered`. In addition, there are records that record the particle size up to the regional unit of each country. This time, we will summarize and analyze by country.

However, there is one caveat. Even though it is a time series, there are dozens of columns for each date lined up in the column direction, so we have to convert it into an easy-to-use structure.

For example, this is the data frame. (In the case of the number of confirmed infections) You can see that the columns that look like dates are lined up. timelined_df_sample_jhuniv_20200406.png

By structurally transforming time series columns in the row direction and aggregating them by country, you can settle into an orthodox data frame that is easy to handle.

This time, I implemented it on Jupyter Notebook. So, I think that it will work if you paste the code posted in the entry as it is and execute it in order from the top.

Crawler class implementation

Define a crawler class. The name is just that. It seems that it will be reused in other notebooks, so I made it a class for the time being.

import urllib
import json
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import io
from dateutil.parser import parse
from tqdm import tqdm, tqdm_notebook

class Crowler():

  def __init__(self):
    """
Crawler class

    """
    self._ua = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) '\
      'AppleWebKit/537.36 (KHTML, like Gecko) '\
      'Chrome/55.0.2883.95 Safari/537.36 '

  def fetch(self, url):
    """
Specify the URL and execute the HTTP request.

    :param url:
    :return:Request result(html)
    """
    req = urllib.request.Request(url, headers={'User-Agent': self._ua})
    return urllib.request.urlopen(req)

various settings

Define the crawler instance declaration and the URL of each data source.

#Crawler instance
cr = Crowler()

#Time series data of infected person transition
url_infection = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'

#Time series data of fatalities
url_deaths = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'

#Healer time series data
url_recover = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_recovered_global.csv'

Get each data source

Crawl the three data sources and convert them into data frames once.

url_map = {'infection': url_infection,
           'deaths': url_deaths,
           'recover': url_recover}
df_house = {}

for _k, _url in url_map.items():
    _body_csv = cr.fetch(_url)
    df_house[_k] = pd.read_csv(_body_csv)

df_house is a dictionary that stores three data frames. The contents are as follows.

--Data frame of confirmed number of infected people notebook_confirm_df_20200406.png

--Data frame of fatalities notebook_confirm_df_deaths_20200406.png

--Data frame of the number of healers notebook_confirm_df_recovered_20200406.png

Table structure conversion

Preparing a function to convert to date type

Time series columns have a format like 3/27/20 and cannot be converted as they are with Python's dateutil.parser.parse. It's muddy, but once we have a function to convert it to the standard YYYY-mm-dd format.

def transform_date(s):
    """
    '3/15/20'Format date'2020-03-15'like'YYYY-mm-dd'Convert to format
    """
    _chunk = str(s).split('/')
    return '20{year}-{month:02d}-{day:02d}'.format(year=_chunk[2], month=int(_chunk[0]), day=int(_chunk[1]))

Convert each data frame

Converts time series columns into rows in each of the three data frames. Convert the column named date to have a time series.

df_buffer_house = {}
for _k, _df in df_house.items():
    df_buffer_house[_k] = {'Province/State':[], 
                           'Country/Region':[],
                           'date': [],
                           _k: []}
    _col_dates = _df.columns[4:]
    for _k_date in tqdm(_col_dates):
        for _idx, _r in _df.iterrows():
            df_buffer_house[_k]['Province/State'].append(_r['Province/State'])
            df_buffer_house[_k]['Country/Region'].append(_r['Country/Region'])
            df_buffer_house[_k]['date'].append(transform_date(_k_date))
            df_buffer_house[_k][_k].append(_r[_k_date])

When executed on Jupyter Notebook, the conversion will proceed while displaying the progress bar as shown below.

100%|██████████████████████████████████████████| 72/72 [00:05<00:00, 12.37it/s]
100%|██████████████████████████████████████████| 72/72 [00:05<00:00, 12.89it/s]
100%|██████████████████████████████████████████| 72/72 [00:05<00:00, 13.27it/s]

The structure of the three data frames has become much better, so all I have to do is combine them, but there is a caveat.

In the number of infections (ʻinfection) and the number of deaths ( deaths), multiple Province / States are recorded, but in the number of cures (recover), it is recorded as a country unit. There is. Example) Canada`

Therefore, it is necessary to aggregate each data frame by country and then combine them.

df_integrated = pd.DataFrame()
col_integrated = ['Country/Region', 'date']
df_chunk = {}
for _k, _df_dict in df_buffer_house.items():
    _df_raw = pd.DataFrame.from_dict(_df_dict)
    # 'Country/Region'Aggregate by
    _df_grouped_buffer = {'Country/Region':[], 'date':[] , _k:[]}
    for _idx, _grp in tqdm(_df_raw.groupby(col_integrated)):
        _df_grouped_buffer['Country/Region'].append(_idx[0])
        _df_grouped_buffer['date'].append(_idx[1])
        _df_grouped_buffer[_k].append(_grp[_k].sum())
    df_chunk[_k] = pd.DataFrame.from_dict(_df_grouped_buffer)    
    
df_integrated = df_chunk['infection'].merge(df_chunk['deaths'], on=col_integrated, how='outer')
df_integrated = df_integrated.merge(df_chunk['recover'], on=col_integrated, how='left')

I will do it.

100%|██████████████████████████████████| 13032/13032 [00:08<00:00, 1621.81it/s]
100%|██████████████████████████████████| 13032/13032 [00:08<00:00, 1599.91it/s]
100%|██████████████████████████████████| 13032/13032 [00:07<00:00, 1647.02it/s]

Operation check

Let's see if the Canada mentioned in the previous example has been converted to proper data. notebook_confirm_df_integrated_20200406.png

Sounds okay! There was no sign that there were many missing records in Nan, and we were able to confirm that the numbers were changing in chronological order!

Analysis example using converted statistical data

I would like to introduce an example of analysis code using the statistical data of the new coronavirus obtained by this conversion.

Calculation of lifesaving rate and infection termination

Calculation of lifesaving rate

The lifesaving rate is defined here as the ratio of the number of patients who have been cured (Total Recovered Cases) to the number of patients who have completed treatment (Closed Cases) (*).

Resuce Ratio = \ frac {Total Recovered (number of patients cured)} {Closed Cases (number of patients who have completed treatment)}

Calculation of infection termination

It is a number that shows how close the infection in each country is. It shows the ratio of how many patients have been treated to the total number of infected people.

Phase Position = \ frac {Closed Case (number of patients who have completed treatment)} {Total Case (cumulative number of infected people)}

Phase Position takes a value between 0.0 and 1.0. The closer it is to 0.0, the earlier the infection phase. The closer it is to 1.0, the more the infection phase is in the final stages.

Calculation code example

df_grouped = df_integrated
df_grouped['date'] = pd.to_datetime(df_grouped['date'])

#Calculation of lifesaving rate
df_grouped['rescue_ratio'] = df_grouped['recover']/(df_grouped['recover'] + df_grouped['deaths'])
df_grouped['rescue_ratio'] = df_grouped['rescue_ratio'].fillna(0)

#Calculation of infection termination
#Number of patients who have completed treatment=Number of patients cured+Number of patients who died
df_grouped['phase_position'] = (df_grouped['recover'] + df_grouped['deaths'])/df_grouped['infection']

Confirmation of calculation result

Let's check the calculation results using the United States as an example. Then the following data frame will be displayed.   notebook_code_sample_result_rescue_ratio_20200406.png

The United States is still in the early stages of infection, and although lifesaving rates are picking up, it can be seen that the situation is still severe.

Summary and introduction of analysis entries

So, I have introduced the code for preparation for analyzing the statistical data of the new coronavirus. Statistical data from Johns Hopkins University is one of the data sources that are currently attracting attention in the world, so I hope that you will actively disseminate information by trial and error of various analytical ideas. think!

So, let's start with the beginning, and I would like to conclude by introducing the new corona analysis entry I wrote.

-Situation of each country from the viewpoint of the lifesaving rate of the new corona: What is the policy taken by that developed country with an extremely low lifesaving rate? │ YUUKOU's experience value -[Understanding the transition of the lifesaving rate of the new Corona: US strength, critical UK, Netherlands, China whose transition is too beautiful │ YUUKOU's experience value](https://yuukou-exp.plus/covid19-rescue-ratio -timeline-analysis-20200401 /) -Results of quantifying the infection phases of the new coronavirus and countries around the world: The United States is dangerous │ YUUKOU's experience value

Recommended Posts

I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
I tried to get CloudWatch data with Python
I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 2)
I'm tired of Python, so I tried to analyze the data with nehan (I want to go live even with corona sickness-Part 1)
I tried to get the authentication code of Qiita API with Python.
I tried to verify and analyze the acceleration of Python by Cython
I tried to streamline the standard role of new employees with Python
I tried to get the movie information of TMDb API with Python
I tried to analyze J League data with Python
I tried to automatically send the literature of the new coronavirus to LINE with Python
I tried to find the entropy of the image with python
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to analyze the data of the soccer FIFA World Cup Russia tournament with soccer action
[Pandas] I tried to analyze sales data with Python [For beginners]
I tried to improve the efficiency of daily work with Python
PhytoMine-I tried to get the genetic information of plants with Python
Stock price plummeted with "new corona"? I tried to get the Nikkei Stock Average by web scraping
I tried to predict the number of domestically infected people of the new corona with a mathematical model
I tried to solve the first question of the University of Tokyo 2019 math entrance exam with python sympy
(Python) I tried to analyze 1 million hands ~ I tried to estimate the number of AA ~
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried to analyze the New Year's card by myself using python
I tried to save the data with discord
I tried to open the latest data of the Excel file managed by date in the folder with Python
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
[Introduction to Python] How to get the index of data with a for statement
I tried to get started with blender python script_Part 01
I tried to touch the CSV file with Python
I tried to get started with blender python script_Part 02
I tried to solve the problem with Python Vol.1
[Python] I tried to get Json of squid ring 2
I tried to summarize the string operations of Python
[Python & SQLite] I tried to analyze the expected value of a race with horses in the 1x win range ①
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
[New Corona] Is the next peak in December? I tried trend analysis with Python!
I tried to visualize the running data of the racing game (Assetto Corsa) with Plotly
I just wanted to extract the data of the desired date and time with Django
The 15th offline real-time I tried to solve the problem of how to write with python
I want to be able to analyze data with Python (Part 3)
I tried "gamma correction" of the image with Python + OpenCV
I tried to simulate how the infection spreads with Python
I tried to get the location information of Odakyu Bus
I tried to make various "dummy data" with Python faker
I tried to find the average of the sequence with TensorFlow
I want to be able to analyze data with Python (Part 1)
I want to be able to analyze data with Python (Part 4)
Get rid of dirty data with Python and regular expressions
I want to be able to analyze data with Python (Part 2)
[Python] I tried to visualize tweets about Corona with WordCloud
I want to know the features of Python and pip
[Python] I tried collecting data using the API of wikipedia
I tried to enumerate the differences between java and python
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried to divide the file into folders with Python
Get additional data to LDAP with python (Writer and Reader)
[Introduction to Python] How to get data with the listdir function
Get the source of the page to load infinitely with python.
I tried to display the point cloud data DB of Shizuoka prefecture with Vue + Leaflet