Visualize the spread of corona in each prefecture The goal is this.
Use python seaborn. Click here for the number of infected people data (up to 4/5). https://toyokeizai.net/sp/visual/tko/covid19/ Click here for prefecture name data. https://gist.github.com/mugifly/d6e68a516de4a008687c Here is a summary of various things. https://github.com/kyasby/colona.git
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
% matplotlib inline
is magic.
numpy imports for cumsum ()
.
df = pd.read_csv("COVID-19.csv")
df = df[["Consultation prefecture", "Prefecture of residence", "Number of people", "Date of onset", "Fixed date"]]
df = df.rename(columns={"Age":"age", "sex":"sex", "Consultation prefecture":"hsp", "Prefecture of residence":"house"})
df
Extract the required columns and rename them at the same time. By passing a list to df [], you can extract only that column.
df.rename(columns={"old_columns_name":"new_name"},index={"old_index_name":"new_name"})
You can change the column name and index name by passing a dictionary such as.
for i, row in df.iterrows():
if type(row["Date of onset"])==float:
df.at[i, "Date of onset"] = row["Fixed date"]
else:
pass
df = df.rename(columns = {"Date of onset":"Date"})
I want the horizontal axis of the heatmap to be a date, so I get the date. However, as shown below, NaN is included in the "onset date", so in that case, replace it with the "confirmed date".
hsp | house | Number of people | Date of onset | Fixed date |
---|---|---|---|---|
Kanagawa Prefecture | Kanagawa Prefecture | 1 | 1/3/2020 | 1/15/2020 |
Tokyo | People's Republic of China | 1 | 1/14/2020 | 1/24/2020 |
Tokyo | People's Republic of China | 1 | 1/21/2020 | 1/25/2020 |
Osaka | Osaka | 1 | 1/20/2020 | 1/29/2020 |
unknown | People's Republic of China | 1 | 1/29/2020 | 1/30/2020 |
Chiba | People's Republic of China | 1 | NaN | 1/30/2020 |
Finally, change the column name to "Date".
Judgment of NaN
type (row ["onset date "]) == float
I wrote it like this, but please let me know if there is a better way to write it.
todofuken = pd.read_csv("japan.csv", header=None)[0]
df["hsp"].value_counts()
So, if you check the contents of "hsp", you can see that there are "Haneda Airport" and "Unknown".
df["hsp"]= df["hsp"].apply(lambda x : "Other" if x not in list(todofuken) else x)
Use apply and lambda function to partially rewrite the contents of df ["hsp"]. If it is not in the prefecture name list, enter "Other", and if there is, enter the prefecture as it is When using apply and lambda functions, you will probably get a syntax error without else. (Unconfirmed) Please be careful.
So far, df looks like this.
hsp | house | Number of people | Date | Fixed date |
---|---|---|---|---|
Kanagawa Prefecture | Kanagawa Prefecture | 1 | 1/3/2020 | 1/15/2020 |
Tokyo | People's Republic of China | 1 | 1/14/2020 | 1/24/2020 |
Tokyo | People's Republic of China | 1 | 1/21/2020 | 1/25/2020 |
Aichi prefecture | People's Republic of China | 1 | 1/23/2020 | 1/26/2020 |
Aichi prefecture | People's Republic of China | 1 | 1/22/2020 | 1/28/2020 |
Nara Prefecture | Nara Prefecture | 1 | 1/14/2020 | 1/28/2020 |
Hokkaido | People's Republic of China | 1 | 1/26/2020 | 1/28/2020 |
Osaka | Osaka | 1 | 1/20/2020 | 1/29/2020 |
pvt = df.pivot_table(index="hsp", columns="Date", values="Number of people").fillna(0)
pvt = pvt.rename(index = dict(zip(jpn[0], jpn[2]))).rename(index={"Other":"others"})
pandas has a method called pivot_table that allows you to literally create a pivot table. (You don't need Excel.) Also, fill NaN with 0.
Then, rename it from Hokkaido to Hokkaido. In my environment, if the index name or column name contains Japanese, the characters will not be displayed. It seems that it will be solved by installing something, but it will be handled by renaming. (Please let me know if there is a better way.)
The contents of jpn [0] are the names of prefectures in kanji such as Hokkaido and Aomori. The contents of jpn [2] are the names of prefectures in Roman letters such as hokkaido and aomori.
Pair them with the zip function
, make them into a dictionary with the dict function
, and pass them to the rename function
.
Also, change "Other" to "others".
Up to this point, the data frame looks like this.
for i in range(len(pvt)):
pvt.iloc[i]=pvt.iloc[i].cumsum()
Extract line by line from pvt and use cumsum ()
function of numpy
to make cumulative number of people.
In this way, it has been updated to the cumulative number of people.
plt.figure(figsize=(20,10))
sns.heatmap(pvt.iloc[:,-60:] , linewidths=0, cmap='Spectral', cbar=True, xticklabels=5)
plt.savefig("colona.png ")
I decided to display the young date from 60 days ago because there are few infected people (fortunately) and it is meaningless to display it.
You can use :
(slice). For example, 10:20
indicates 10 or more and less than 20.
I was able to display the heat map like this.
Recommended Posts