[PYTHON] Analyzing the age-specific severity of coronavirus

There is a lot of information about coronaviruses on the market these days, but I think it's difficult to judge whether the information is correct if you don't know who posted it.

** To get truly correct information, you should analyze the primary information yourself as much as possible **. In this article, we will compare the severity rates by age group using Positive Patient Attribute Data published by Hokkaido. ..

Data read

Use Python's Pandas for analysis. First, import Pandas.

import pandas as pd

Then load the data.

df = pd.read_csv("https://www.harp.lg.jp/opendata/dataset/1369/resource/3132/010006_hokkaido_covid19_patients.csv", encoding="shift-jis")

You can check the read data with the head method.

df.head()

table1.png

Categorization of age and severity

Now, this time we will compare the severity of each age group. First, let's see how the current data is categorized.

Let's start with the age.

df["patient_Age"].value_counts()
Undisclosed 231
20s 223
70s 219
60s 202
50s 193
40s 176
80s 163
30s 157
90s 75
Teen 33
Less than 10 16
100s 5
Under 10 years old 4
Elderly 1
Name:patient_Age, dtype: int64

Generally, it is divided by "-generation", but there are some notation fluctuations (under 10 and under 10 years old) and age unknown (elderly and undisclosed).

Since it is difficult to analyze with existing categories, define categories here and assign new categories to each data.

First, define which of the new categories the original category fits into.

age_dict = {
    "Less than 10": "Teens and younger",
    "Under 10 years old": "Teens and younger",
    "10's": "10's以下",
    "20's": "20's",
    "30s": "30s",
    "Forties": "Forties",
    "50s": "50s",
    "60s": "60s",
    "70s": "70s",
    "80s": "80s",
    "90s": "90s以上",
    "100s": "90s and over",
    "Undisclosed": "unknown",
    "Senior citizens": "unknown"
}

Then add a new category column to the DataFrame.

df["Age category"] = [age_dict[key] for key in df["patient_Age"]]

Based on the age category defined here, the number of severely ill people will be counted.

Similarly, for the patient status, check the original category and define the new category.

df["patient_Status"].value_counts()
Mild conversation possible 1004
−              108
Undisclosed 102
Asymptomatic 102
Asymptomatic conversation possible 97
Mild 88
Mild, conversation possible 54
Moderate conversation possible 35
Mild / conversation possible 30
Moderate 29
Severe 13
Severe conversation not possible 9
Rest on bed, conversation possible 7
Asymptomatic, conversation possible 5
Serious injury: No conversation 3
Positive after death 2
Moderate conversation not possible 2
No symptom, conversation possible 2
Rest on the bed, conversation possible 1
Negative confirmed 1
Mild high fever 1
Under investigation 1
Degree of communication 1
Moderate / conversation possible 1
Name:patient_Status, dtype: int64
stat_dict = {
    "Severe": "3.Severe",
    "Severe conversation not possible": "3.Severe",
    "Serious injury, no conversation": "3.Severe",
    "Moderate conversation possible": "2.Moderate",
    "Moderate": "2.Moderate",
    "Moderate conversation not possible": "2.Moderate",
    "Moderate / conversation possible": "2.Moderate",
    "Mild conversation possible": "1.Mild",
    "Mild": "1.Mild",
    "Mild, conversation possible": "1.Mild",
    "Mild / conversation possible": "1.Mild",
    "Mild high fever": "1.Mild",
    "Asymptomatic conversation possible": "0.No symptoms",
    "Asymptomatic": "0.No symptoms",
    "Asymptomatic, conversation possible": "0.No symptoms",
    "No symptom, conversation possible": "0.No symptoms",
    "−": "unknown",
    "Undisclosed": "unknown",
    "Rest on the bed, conversation possible": "unknown",
    "Turned positive after death": "unknown",
    "Degree of communication": "unknown",
    "Negative confirmed": "unknown",
    "investigating": "unknown",
    "Rest on the bed, conversation possible": "unknown"
}
df["State category"] = [stat_dict[key] for key in df["patient_Status"]]

This completes the assignment of age and state categories. You can check how it was actually assigned with the head method.

df.head()

table2.png

Aggregation of the number of severely ill persons by age

Now that the categories are ready, let's start counting the number of patients by status category. We adopted crosstab by crosstab for aggregation.

//Japaneseization of matplotlib
pip install japanize-matplotlib
import japanize_matplotlib
import seaborn as sns
sns.set(font="IPAexGothic")

pd.crosstab(df["Age category"], df["State category"]).apply(
    lambda x: x/sum(x), axis=1
).plot(
    kind="bar",
    logy=True,
    rot=45,
    figsize=(8,4),
    color=["grey", "grey", "orange", "red", "grey"]
).legend(loc="upper left")

crosstab.png

Since the number of moderate and severe cases is small (less than 10%) overall, the y-axis is displayed logarithmically.

It is often said that coronavirus remains mild in young people and tends to become severe in elderly people, but when actually aggregated, this tendency is certainly seen.

There are almost no moderate or severe cases until the 30s, and the proportion of severely ill cases clearly increases in proportion to the age from the 40s to the 80s **.

Summary

This time, using the coronavirus-positive person attribute data in Hokkaido, it was confirmed that the severity rate increases in proportion to age.

In this way, you can obtain more accurate knowledge by ** analyzing the primary data published by the national and prefectural governments by yourself **.

Open data is not always correct, but why not try the method introduced here as one of the ways to get as accurate information as possible quickly.

that's all.

Recommended Posts

Analyzing the age-specific severity of coronavirus
Estimate the peak infectivity of the new coronavirus
Factfulness of the new coronavirus seen in Splunk
GUI simulation of the new coronavirus (SEIR model)
The beginning of cif2cell
The meaning of self
the zen of Python
The story of sys.path.append ()
Let's test the medical collapse hypothesis of the new coronavirus
Revenge of the Types: Revenge of types
Analyzing user dissatisfaction very easily from the contents of inquiries
Analyzing data on the number of corona patients in Japan
Let's visualize the number of people infected with coronavirus with matplotlib
Quantify the degree of self-restraint required to contain the new coronavirus
What I saw by analyzing the data of the engineer market
Align the version of chromedriver_binary
Scraping the result of "Schedule-kun"
10. Counting the number of lines
The story of building Zabbix 4.4
Towards the retirement of Python2
[Apache] The story of prefork
Compare the fonts of jupyter-themes
About the ease of Python
Get the number of digits
Explain the code of Tensorflow_in_ROS
Reuse the results of clustering
GoPiGo3 of the old man
Calculate the number of changes
Change the theme of Jupyter
The popularity of programming languages
Change the style of matplotlib
Visualize the orbit of Hayabusa2
About the components of Luigi
Connected components of the graph
Filter the output of tracemalloc
About the features of Python
Simulation of the contents of the wallet
The Power of Pandas: Python
I found out by analyzing the reviews of the job change site! ??