[PYTHON] A story that verified whether the number of coronas is really increasing rapidly among young people


Data source

You can drop the data here. https://catalog.data.metro.tokyo.lg.jp/dataset/t000010d0000000068/resource/c2d997db-1450-43fa-8037-ebb11ec28d4c Although it is in csv format and there are many columns that do not contain anything, I think that the data is clean and easy to handle. At the time of writing this, there seems to be data up to 7/9.


This time, I created an analysis environment on Jupyter set up with Docker. It's quite appropriate, but please understand that I'm just reusing the ones I used elsewhere (I don't require this much).

FROM python:3.8.2
USER root



RUN mkdir /code

RUN apt-get update && apt-get -y install locales default-mysql-client && \
    localedef -f UTF-8 -i ja_JP ja_JP.UTF-8
ENV TERM xterm
ADD ./requirements_python.txt /code
RUN pip install --upgrade pip
RUN pip install -r /code/requirements_python.txt
RUN jupyter notebook --generate-config
RUN echo c.NotebookApp.port = 9999 >> ~/.jupyter/jupyter_notebook_config.py
RUN echo c.NotebookApp.token = \'jupyter\' >> ~/.jupyter/jupyter_notebook_config.py
CMD jupyter lab --no-browser --ip= --allow-root




From here, we will describe the implementation.

1. import and import

First, load the required packages and files. This time I only use matplotlib and pandas. Regarding the publication date, change it to the datetime type at this timing.

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

%matplotlib inline
df_patient = pd.read_csv('./data/130001_tokyo_covid19_patients.csv')
df_patient['Published_date'] = pd.to_datetime(df_patient['Published_date'])

2. Data retrieval and processing

This data processing will be done as follows.

  1. Count the number of newly infected people by age
  2. Count with a 7-day moving average to eliminate factors such as the number of inspections per day of the week.

2-1 Count the number of newly infected people by age group

This is not so difficult with group by. Also, leave only the necessary columns at this timing.

df_patient_day = df_patient.groupby(['Published_date','patient_Age']).count().reset_index()[['Published_date','patient_Age','No']]

Also, if there is Japanese in the description of the age, the characters will be garbled in the matplotlib part (it is troublesome to handle by reusing the environment as described above), so replace it as follows.

genes_dict = {'Under 10 years old':'under 10',\
         '10's': '10', \
         '20's': '20', \
         '30s': '30', \
         'Forties': '40', \
         '50s': '50', \
         '60s': '60', \
         '70s': '70', \
         '80s': '80', \
         '90s': '90', \
         '100 years and over': 'over 100', \
         "'-": '-',
         'unknown': 'unknown'

df_patient_day['patient_Age'] = [genes_dict[x] for x in df_patient_day['patient_Age'].values.tolist()]

There is a problem in the above case, and if the number of newly infected people does not exist on that day and that age, there is no data and it will be a problem when taking a moving average later, so here this data is in the above age range. × Create a direct product of the entire date range and combine it with the above DataFrame (please let me know if you know a better way here!).

genes = ['under 10',\
         '10', \
         '20', \
         '30', \
         '40', \
         '50', \
         '60', \
         '70', \
         '80', \
         '90', \
         'over 100', \
days = pd.date_range(start=df_patient['Published_date'].min(), end=df_patient['Published_date'].max(), freq='D')
data = [[x, y] for x in days for y in genes]

df_data = pd.DataFrame(data, columns=['Published_date', 'patient_Age'])
df_data = pd.merge(df_data, df_patient_day, on=['Published_date', 'patient_Age'], how='left').fillna(0)
df_data = df_data.rename(columns={'No':'Number of people'})

2-2 Take a moving average for each age group

Take a moving average for each age group. You can easily get a moving average with the pandas function. For a 7-day moving average, just do rolling (7). If you want to take the average, do rolling (7) .mean (). And since the first 6 days will be nan, delete it with dropna (). This time, for later implementation, I will make it a DataFrame for each age and store it in the dictionary. That's all there is to it!

result_diff = {}
for x in genes:
    df = df_data[df_data['patient_Age'] == x]
    df = pd.Series(df['Number of people'].values.tolist(), index=df['Published_date'].values)
    result_diff[x] = df.rolling(7).mean().dropna()

3. Visualize

Finally visualize.

fig, axe = plt.subplots()
for x in genes:
    df_diff = result_diff[x]
    axe.plot(df_diff.index, df_diff.values, label=x)


Finally, I will display the result. スクリーンショット 2020-07-12 15.05.14.png

From the 20s to the 50s, it was confirmed that the ages have changed in order of younger age. Even so, how to increase the number of people in their twenties is amazing. The announcement by Tokyo was not a lie.


I don't mean to say what the factors are here, but you can quickly check the content of the report with public data like this, so why not try it as a practice as well? There are still many things that can be investigated by comparing this with the actual population distribution, and I think it is a good teaching material for actually practicing data processing.

Recommended Posts

A story that verified whether the number of coronas is really increasing rapidly among young people
Create a BOT that displays the number of infected people in the new corona
A story that reduces the effort of operation / maintenance
[Python] A program that counts the number of valleys
Zip 4 Gbyte problem is a story of the past
A story that analyzed the delivery of Nico Nama.
A server that returns the number of people in front of the camera with bottle.py and OpenCV
A story that is a little addicted to the authority of the directory specified by expdp (for beginners)
A story that struggled to handle the Python package of PocketSphinx
The story of creating a site that lists the release dates of books
A program that determines whether a number entered in Python is a prime number
The story of making a module that skips mail with python
A programming language that young people will need in the future
Create a bot that posts the number of people positive for the new coronavirus in Tokyo to Slack
The story of writing a program
A story about creating a program that will increase the number of Instagram followers from 0 to 700 in a week
A story that visualizes the present of Qiita with Qiita API + Elasticsearch + Kibana
[Python] A program that calculates the number of socks to be paired
The story of developing a web application that automatically generates catchphrases [MeCab]
The story of making a package that speeds up the operation of Juman (Juman ++) & KNP
I tried to confirm whether the unbiased estimator of standard deviation is really unbiased by "throwing a coin 10,000 times"
[python] [meta] Is the type of python a type?
The story of blackjack A processing (python)
The story of IPv6 address that I want to keep at a minimum
The story of making a box that interconnects Pepper's AL Memory and MQTT
The story of making a web application that records extensive reading with Django
The story of Django creating a library that might be a little more useful
[Python] A program that calculates the number of updates of the highest and lowest records
Is there a contradiction between the party that protects the people from NHK and the party that protects NHK from the people?
The story of making a Line Bot that tells us the schedule of competitive programming
Deep Learning! The story of the data itself that is read when it does not follow after handwritten number recognition