Try scraping the data of COVID-19 in Tokyo with Python

1.First of all

While refraining from corona, I live in Tokyo, and I am overwhelmed by the number of infected people in Tokyo that is announced every day. However, I'm not sure why the number of infected people increased or decreased! In the first place, the number of tests increases and decreases significantly every day, and the period required for the tests varies, so I guess that the increase and decrease in the number of infected people depends on the number of tests and the test period, even though it is an amateur. Will end up Therefore, I wondered if it would be possible to graph the numerical values in a way that is a little easier to understand.

2. Data acquisition

Tokyo Metropolitan Government New Coronavirus Infection Control Site https://stopcovid19.metro.tokyo.lg.jp/ The data of COVID 19 in Tokyo is updated daily here. (Please think that there is a slight time lag and it is one day late) I scraped the data on this site and decided to use it as the original data for graphing.

The data we want to obtain are the number of people who have been tested and the number of positive patients. Number of people to be inspected https://stopcovid19.metro.tokyo.lg.jp/cards/number-of-inspection-persons/ Number of positive patients https://stopcovid19.metro.tokyo.lg.jp/cards/number-of-confirmed-cases/

From the URL, use BeautifulSoup to download the site data. スクリーンショット 2020-04-21 15.21.23.png Let's open the URL in Chrome and display the developer tools. While following the HTML tags, go down to the tag where the target numerical value (number of positive patients) is written. A class (text-end) is set for the tag, and data is extracted using this class. Download the URL information you want to scrape with requests, and with BeautifulSoup, extract all the tags with the text-end class set from it.

Python


import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt

kensa_url = 'https://stopcovid19.metro.tokyo.lg.jp/cards/number-of-inspection-persons/'
yousei_url = 'https://stopcovid19.metro.tokyo.lg.jp/cards/number-of-confirmed-cases/'

r = requests.get(kensa_url , timeout=10, params=None)
soup = BeautifulSoup(r.text,'html.parser')
kensa_data = soup.select('.text-end')

r = requests.get(yousei_url , timeout=10, params=None)
soup = BeautifulSoup(r.text,'html.parser')
yousei_data = soup.select('.text-end')

3. Processing of acquired data

Looking at the contents of the extracted list, the first two were heads. I also found that dates and cumulatives are stored alternately. Let's extract only the necessary parts for ease of use.

[<th aria-label="Number of people to be inspected(By day)" aria-sort="none" class="text-end" role="columnheader" scope="col"><span>Number of people to be inspected(By day)</span></th>,
 <th aria-label="Number of people to be inspected(Cumulative)" aria-sort="none" class="text-end" role="columnheader" scope="col"><span>Number of people to be inspected(Cumulative)</span></th>,
 <td class="text-end">304</td>,
 <td class="text-end">8,683</td>,
 <td class="text-end">339</td>,
-----Omitted thereafter-----

Let's save various data in the list. for i in range(2, len(kensa_data), 2): Use to avoid the head and start the for statement from the third line of the list. Also, by extracting from the list every two, only the numerical value of the date will be extracted. Let's get the date at the same time. Get the downloaded today by datetime.today (), and return the date by one day each time you get the data with the for statement. num_list is also created for display.

Python


kensa_list = []
yousei_list = []
date_list = []
num_list = []

num = 0
date = datetime.today()
date = date - timedelta(days=1)

for i in range(2, len(kensa_data), 2):
        kensa_list.append(kensa_data[i].string)
        yousei_list.append(yousei_data[i].string)
        date_list.append(datetime.strftime(date, '%Y/%m/%d'))
        date = date - timedelta(days=1)

All data are in reverse chronological order, so use .reverse () to reverse the order.

Python


kensa_list.reverse()
yousei_list.reverse()
date_list.reverse()

4. Data storage

If you want to save it, save it in CSV here.

Python


with open('COVID-19.csv','a') as f:
    writer = csv.writer(f)
    writer.writerow(['date', 'kensa', 'yousei'])
    for i in range(len(date_list)):
        writer.writerow([date_list[i], kensa_list[i], yousei_list[i]])

4. Check the data

Let's check the number of people tested and the number of positive patients.

Python


plt.subplot(2,1,1)
plt.plot(num_list, kensa_list, label="kensa-list")
plt.legend()
    
plt.subplot(2,1,2)
plt.plot(num_list, yousei_list, label="yousei-list")
plt.legend()

image.png

As you can see from the graph, the number of people conducting inspections changes drastically depending on the day. At first glance, it seems that there is a correlation between the number of test patients and the number of positive patients, but looking around 80, it seems unnatural how the number of positive patients decreases even if the number of test patients drops significantly. To do. This seems to come from the fact that the daily number of tests does not always correspond to the daily number of positives.

5. Make the data easy to understand.

Therefore, let's create data by dividing the total number of positives up to that day by the total number of tests up to the day before that day. By summing up, you can make a graph with the ratio of daily totals regardless of the schedule of test results. The number of tests was set by the previous day because the results on the same day of the tests do not seem to be reflected in the number of positives. image.png

Python


kensa_total = 0
yousei_total = 0
kensa_yousei_list = []

for i in range(len(kensa_list)):
    yousei_total =  yousei_total + int(yousei_list[i])
    
    if kensa_total == 0:
        kensa_yousei_list.append(0)
    else:
        kensa_yousei_list.append(yousei_total/kensa_total)

    kensa_total =  kensa_total + int(kensa_list[i])

Add the total of kensa_total and yousei_total while turning with the for statement. While adding, add kensa_yousei_list by dividing kensa_total by yousei_tota each time.

Python


plt.plot(num_list, kensa_yousei_list, label="Average")
plt.legend()
plt.show()

image.png

In the beginning, the part where the numerical value rises significantly is because the number of inspections at the beginning of the data continued to be 0, so please ignore this, the graph is rising as you go to the second half. .. It can be seen that the percentage of positive numbers in the number of tests gradually increases. I couldn't tell if the percentage of requesters was really increasing just by the progress of the number of positives, but by dividing by the total number of tests in this way, a stable graph can be created, and by looking at this, it is positive. It can be seen that the proportion of people is also increasing. Around April 21, when this graph was created, the number of positives has decreased a little, so the end of the graph has decreased a little.

6. Summary

We were able to create an easy-to-understand graph by calculating the ratio of daily numerical data by totaling up to that point. If you look at the graph, you can see that it is increasing in a relatively clean manner. Also, since it does not depend on the increase or decrease in the number of daily inspections, I don't think you will be surprised when the number suddenly increases. (Because the number of inspections must increase before the value increases)

The code for this time is published below. https://github.com/no-B-github/COVID19_Data_Scraping

I tried to make it a web application so that the graph can be updated daily. In the future, I would like to keep an eye on this and do my best to refrain from COVID 19.

https://covid-19-tokyo.herokuapp.com/

Recommended Posts

Try scraping the data of COVID-19 in Tokyo with Python
[Homology] Count the number of holes in data with Python
Try working with binary data in Python
Try scraping with Python.
2016 The University of Tokyo Mathematics Solved with Python
The story of reading HSPICE data in Python
Try to image the elevation data of the Geographical Survey Institute with Python
Scraping with selenium in Python
Scraping with chromedriver in python
Scraping with Selenium in Python
Scraping with Tor in Python
Not being aware of the contents of the data in python
Let's use the open data of "Mamebus" in Python
Extract the band information of raster data with python
Location information data display in Python --Try plotting with the map display library (folium)-
Try scraping with Python + Beautiful Soup
A network diagram was created with the data of COVID-19.
Scraping with Selenium in Python (Basic)
Calculate the square root of 2 in millions of digits with python
The story of rubyist struggling with python :: Dict data with pycall
Try to automate the operation of network devices with Python
Try to extract the features of the sensor data with CNN
I tried scraping the ranking of Qiita Advent Calendar with Python
[Cloudian # 6] Try deleting the object stored in the bucket with Python (boto3)
[Cloudian # 9] Try to display the metadata of the object in Python (boto3)
Try transcribing the probability mass function of the binomial distribution in Python
Output the contents of ~ .xlsx in the folder to HTML with Python
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
Visualize the frequency of word occurrences in sentences with Word Cloud. [Python]
Try to solve the shortest path with Python + NetworkX + social data
Get additional data in LDAP with python
Try logging in to qiita with Python
Try using the Wunderlist API in Python
Check the behavior of destructor in Python
Try using the Kraken API in Python
The story of verifying the open data of COVID-19
Check the existence of the file with python
Display Python 3 in the browser with MAMP
The result of installing python in Anaconda
[Python] Try pydash of the Python version of lodash
The basics of running NoxPlayer in Python
Try HTML scraping with a Python library
Recommendation of Altair! Data visualization with Python
In search of the fastest FizzBuzz in Python
Try hitting the YouTube API in Python
[For beginners] Try web scraping with Python
Scraping with Python
Scraping with Python
Receive a list of the results of parallel processing in Python with starmap
Plot CSV of time series data with unixtime value in Python (matplotlib)
Display the status of COVID 19 infection in Japan with Splunk (GitHub version)
[Cloudian # 5] Try to list the objects stored in the bucket with Python (boto3)
Get the key for the second layer migration of JSON data in python
Visualize corona infection data in Tokyo with matplotlib
Output the number of CPU cores in Python
Try the python version of emacs-org parser orgparse
[Python] Get the files in a folder with Python
Load the network modeled with Rhinoceros in Python ③
[Python] Sort the list of pathlib.Path in natural sort
Prepare the execution environment of Python3 with Docker
[Cloudian # 7] Try deleting the bucket in Python (boto3)