Introduction

This article is not an article about the government or local government's response to the coronavirus or the dissemination of specific propaganda. I live in Tokyo, and it is reported that the number of people infected with corona is increasing every day. Recently, it is reported that the majority of young people are reported to exceed 200 people every day. However, I did not hear that only young people were affected by the corona in the observation range, and as a mere edge of the data scientist, I thought that this should be investigated properly based on the disclosed data. Well, I wonder if it is appropriate as a practice exercise for pandas. .. This time, I will investigate how the number of newly infected coronaviruses is changing by age group.

Data source

You can drop the data here. https://catalog.data.metro.tokyo.lg.jp/dataset/t000010d0000000068/resource/c2d997db-1450-43fa-8037-ebb11ec28d4c Although it is in csv format and there are many columns that do not contain anything, I think that the data is clean and easy to handle. At the time of writing this, there seems to be data up to 7/9.

environment

This time, I created an analysis environment on Jupyter set up with Docker. It's quite appropriate, but please understand that I'm just reusing the ones I used elsewhere (I don't require this much).

FROM python:3.8.2
USER root

EXPOSE 9999

ENV PYTHONDONTWRITEBYTECODE 1
ENV PYTHONUNBUFFERED 1

RUN mkdir /code
WORKDIR /code

RUN apt-get update && apt-get -y install locales default-mysql-client && \
    localedef -f UTF-8 -i ja_JP ja_JP.UTF-8
ENV LANG ja_JP.UTF-8
ENV LANGUAGE ja_JP:ja
ENV LC_ALL ja_JP.UTF-8
ENV TZ JST-9
ENV TERM xterm
ADD ./requirements_python.txt /code
RUN pip install --upgrade pip
RUN pip install -r /code/requirements_python.txt
WORKDIR /root
RUN jupyter notebook --generate-config
RUN echo c.NotebookApp.port = 9999 >> ~/.jupyter/jupyter_notebook_config.py
RUN echo c.NotebookApp.token = \'jupyter\' >> ~/.jupyter/jupyter_notebook_config.py
CMD jupyter lab --no-browser --ip=0.0.0.0 --allow-root

`requirement_python.txt`


glob2
json5
jupyterlab
numpy
pandas
pyOpenSSL
scikit-learn
scipy
setuptools
tqdm
urllib3
matplotlib
xlrd

Implementation

From here, we will describe the implementation.

1. import and import

First, load the required packages and files. This time I only use matplotlib and pandas. Regarding the publication date, change it to the datetime type at this timing.

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

%matplotlib inline
df_patient = pd.read_csv('./data/130001_tokyo_covid19_patients.csv')
df_patient['Published_date'] = pd.to_datetime(df_patient['Published_date'])
df_patient.head()

2. Data retrieval and processing

This data processing will be done as follows.

Count the number of newly infected people by age
Count with a 7-day moving average to eliminate factors such as the number of inspections per day of the week.

2-1 Count the number of newly infected people by age group

This is not so difficult with group by. Also, leave only the necessary columns at this timing.

df_patient_day = df_patient.groupby(['Published_date','patient_Age']).count().reset_index()[['Published_date','patient_Age','No']]
df_patient_day

Also, if there is Japanese in the description of the age, the characters will be garbled in the matplotlib part (it is troublesome to handle by reusing the environment as described above), so replace it as follows.

genes_dict = {'Under 10 years old':'under 10',\
         '10's': '10', \
         '20's': '20', \
         '30s': '30', \
         'Forties': '40', \
         '50s': '50', \
         '60s': '60', \
         '70s': '70', \
         '80s': '80', \
         '90s': '90', \
         '100 years and over': 'over 100', \
         "'-": '-',
         'unknown': 'unknown'
        }

df_patient_day['patient_Age'] = [genes_dict[x] for x in df_patient_day['patient_Age'].values.tolist()]
df_patient_day

There is a problem in the above case, and if the number of newly infected people does not exist on that day and that age, there is no data and it will be a problem when taking a moving average later, so here this data is in the above age range. × Create a direct product of the entire date range and combine it with the above DataFrame (please let me know if you know a better way here!).

genes = ['under 10',\
         '10', \
         '20', \
         '30', \
         '40', \
         '50', \
         '60', \
         '70', \
         '80', \
         '90', \
         'over 100', \
         '-',
         'unknown'
        ]
days = pd.date_range(start=df_patient['Published_date'].min(), end=df_patient['Published_date'].max(), freq='D')
data = [[x, y] for x in days for y in genes]

df_data = pd.DataFrame(data, columns=['Published_date', 'patient_Age'])
df_data = pd.merge(df_data, df_patient_day, on=['Published_date', 'patient_Age'], how='left').fillna(0)
df_data = df_data.rename(columns={'No':'Number of people'})
df_data

2-2 Take a moving average for each age group

Take a moving average for each age group. You can easily get a moving average with the pandas function. For a 7-day moving average, just do rolling (7). If you want to take the average, do rolling (7) .mean (). And since the first 6 days will be nan, delete it with dropna (). This time, for later implementation, I will make it a DataFrame for each age and store it in the dictionary. That's all there is to it!

result_diff = {}
for x in genes:
    df = df_data[df_data['patient_Age'] == x]
    df = pd.Series(df['Number of people'].values.tolist(), index=df['Published_date'].values)
    result_diff[x] = df.rolling(7).mean().dropna()

3. Visualize

Finally visualize.

fig, axe = plt.subplots()
for x in genes:
    df_diff = result_diff[x]
    axe.plot(df_diff.index, df_diff.values, label=x)
    
axe.legend()
axe.set_ylim([0,65])

result

Finally, I will display the result. スクリーンショット 2020-07-12 15.05.14.png

From the 20s to the 50s, it was confirmed that the ages have changed in order of younger age. Even so, how to increase the number of people in their twenties is amazing. The announcement by Tokyo was not a lie.

Summary

I don't mean to say what the factors are here, but you can quickly check the content of the report with public data like this, so why not try it as a practice as well? There are still many things that can be investigated by comparing this with the actual population distribution, and I think it is a good teaching material for actually practicing data processing.

[PYTHON] A story that verified whether the number of coronas is really increasing rapidly among young people