[PYTHON] (Now) I analyzed the new coronavirus (COVID-19)

Introduction

On January 16, 2020, a new type of coronavirus infection (disease name) caused by SARS-CoV-2 (virus name) was confirmed for the first time in Japan. Unfortunately, the disease has killed many people, from ordinary people to celebrities. Even now, more than half a year after that, the epidemic has not subsided, and masks are a necessity when going out. In this post, we have briefly analyzed and summarized the coronavirus in Japan. I hoped that this analysis would give me some awareness and improve my analysis skills.

Data preparation

In analyzing the coronavirus this time, we used the CSV data published by Jag Japan Co., Ltd.. Thank you very much. I will post the link below.

About "Map of the number of people infected with the new coronavirus" スクリーンショット 2020-08-30 12.56.46.png

environment

Try to analyze

1. Import the required libraries

COVID-19.ipynb


import collections
import matplotlib.pyplot as plt
import pandas as pd

2. Read the CSV file

COVID-19.ipynb


pd.set_option('display.max_columns', None)
df = pd.read_csv('COVID-19.csv')
df

In JupyterLab, if there are many columns, the display will be omitted, so display everything in the first line.

3. Check the age of the infected person

COVID-19.ipynb


age = df['Age'].value_counts(ascending=True)
age

Execution result


90 or more 1
90s 1
100         2
80s 7
Teen 9
70s 10
60s 12
90         14
50s 25
30s 33
40s 33
80         44
20s 49
10         66
70         69
60        128
40        167
50        179
30        203
20        310
90       1040
Unknown 1145
0-10     1335
80       2645
10       2952
70       3751
60       4531
50       7355
40       8315
30      10551
20      18009
Name:Age, dtype: int64

Since there is no single notation such as 20's and 20's ... I will try to unify the notation using df.replace ().

COVID-19.ipynb


df = df.replace({'Age':{'0-10':'under10','10's':'10','20's':'20', '30s':'30', 'Forties':'40', '50s':'50', '60s':'60', '70s':'70', '80s':'80', '90s':'90' , 'unknown':'unknown', '90 and above':'90~'}})
age2 = df['Age'].value_counts()
age2

Output result


20              18009
30              10551
40               8315
50               7355
60               4531
70               3751
10               2952
80               2645
under10     1335
unknown          1145
90               1040
20                359
30                236
50                204
40                200
60                140
70                 79
10                 75
80                 51
90                 15
100                 2
90~                 1
Name:Age, dtype: int64

I was able to suppress the output display more than before. (I tried various things because I wanted to get the total with the same numbers, but it didn't work, so I'll leave it as a future task.) It's a little hard to see, so I'll visualize it with a graph.

COVID-19.ipynb


plt.title('Age of infected person')
age2.plot.bar()

Age of infected person.png Making it a graph makes it easier to understand visually. Looking at this graph, we can see that the younger the generation, such as those in their 20s, 30s, 40s, etc., are more infected. In particular, the large number of infected people in their 20s is obvious.

4. Check the number of infected people by gender

COVID-19.ipynb


df = df.replace({'sex':{'male':'male', 'Female':'female', 'unknown':'unknown'}})
sex = df['sex'].value_counts()

plt.xlabel('Sex')
plt.ylabel('Number of people')
plt.title('Infected_sex')

#print(sex) #Display when you want to know the detailed number of infected people by gender
sex.plot.bar()

Infected_sex.png

When I checked it in a graph, I found that the number of infected men was higher. I think that infection does not depend on the gender of humans, but I think that the purpose and behavior when going out are different, so if I can know in detail, I expect that the relationship between the number of infections by gender can be determined.

5. Check the increase / decrease of positive reaction

COVID-19.ipynb


fixed_date = df['Fixed date']
fixed_date = collections.Counter(fixed_date)
#fixed_date #Since there is a lot of output, the execution result is omitted.

date = []
value = []

for get_date in fixed_date:
    date.append(get_date)
for get_value in fixed_date.values():
    value.append(get_value)

plt.plot(date, value)
plt.xticks( [0, 180, 70] )
plt.xticks(rotation=45)

plt.xlabel('date')
plt.ylabel('value')
plt.title('Changes in infected people')

plt.show()

Changes in infected people.png If you check the graph, you can see that positive patients were confirmed from January, and although the number increased sharply around April and temporarily healed, it increased again in July and peaked around August. By graphing, we were able to confirm the second wave of the new coronavirus. Since the end of the graph, the number of confirmed positive patients has decreased sharply, so I'm looking forward to it in the future.

6. Plot the locations where corona infection was confirmed on the map

I have X and Y coordinate data in CSV, so I will plot it. This time, I referred to this article.

COVID-19.ipynb


#Install it as it is required to use geopandas
pipenv install geopandas
pipenv install descartes

#Depict the original map data
map_1 = gpd.read_file('./land-master(qiita)/japan.geojson')
map_1.plot(figsize=(10,10), edgecolor='#444', facecolor='white', linewidth = 1);

map_1.png

COVID-19.ipynb


#Try entering the CSV XY coordinates
map_1.plot(figsize=(10,10), edgecolor='#444', facecolor='white', linewidth = 1);
plt.scatter(df['X'],df['Y'])
plt.show()

スクリーンショット 2020-08-30 15.49.01.png If you look closely at the plotted points, they are meaningfully gathered in the upper right corner ... so let's expand it.

COVID-19.ipynb


map_1.plot(figsize=(10,10), edgecolor='#444', facecolor='white', linewidth = 1);
plt.xlim([120,150]) #Set the range you want to expand(Any)
plt.ylim([30,46]) #Set the range you want to expand(Any)
plt.scatter(df['X'],df['Y'])
plt.show()

スクリーンショット 2020-08-30 15.52.15.png I was able to confirm that the plot was made firmly. You can see from this map that the coronavirus is widespread nationwide. It turned out that there are many infected people in the Kyushu region as a whole, not to mention the Kanto region. It's very scary to think that there may be a risk of infection wherever you go.

7. Summary

I think there are some points that I haven't reached since this is my first post on qiita, but I am very happy that I enjoyed analyzing and creating articles. It's a simple analysis, but I'm very happy because I was able to try something new for myself by plotting the coordinates on a map. In the future, I would like to take on the challenge of deeper corona analysis. It's a difficult time with the coronavirus, but please love yourself.

Recommended Posts

(Now) I analyzed the new coronavirus (COVID-19)
I analyzed tweets about the new coronavirus posted on Twitter
I analyzed the tweets about the new coronavirus posted on Twitter Part 2
Plot the spread of the new coronavirus
I tried to tabulate the number of deaths per capita of COVID-19 (new coronavirus) by country
Estimate the peak infectivity of the new coronavirus
I tried to predict the behavior of the new coronavirus with the SEIR model.
Before the coronavirus, I first tried SARS analysis
I touched some of the new features of Python 3.8 ①
Factfulness of the new coronavirus seen in Splunk
GUI simulation of the new coronavirus (SEIR model)
I tried to automatically send the literature of the new coronavirus to LINE with Python
I tried to summarize the new coronavirus infected people in Ichikawa City, Chiba Prefecture
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
Let's test the medical collapse hypothesis of the new coronavirus
I counted the grains
Quantify the degree of self-restraint required to contain the new coronavirus
I tried using PDF data of online medical care based on the spread of the new coronavirus infection