This article was seen in the daily trend the other day Since the status of each prefecture of the new coronavirus is only published in PDF, I made an API [Python] by @ tommy19970714 Is it possible to realize without downloading the PDF? Get PDF contents → Process to JSON format I wrote up to without downloading the PDF.
Use a library called Tika
that can extract text from PDF.
This library is often introduced to extract text from local PDFs, but in fact it can also be used for online PDFs.
Get the URL of the PDF file with the code of Original article
import urllib.request
from bs4 import BeautifulSoup
def extract_page_url(infomation_url):
req = urllib.request.Request(infomation_url)
html = urllib.request.urlopen(req)
soup = BeautifulSoup(html, "html.parser")
topic = soup.find_all('div', attrs={'class': 'm-grid__col1'})[1]
article_urls = [tag['href'] for tag in topic.find_all('a', href=True)]
article_titles = [tag.text for tag in topic.find_all('a', href=True)]
return article_urls, article_titles
target_url = "https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/0000121431_00086.html"
page_urls, page_titles = extract_page_url(target_url)
def get_pdf_url(page_url):
req = urllib.request.Request(page_url)
html = urllib.request.urlopen(req)
soup = BeautifulSoup(html, "html.parser")
for atag in soup.find_all('a', href=True):
if 'Status of test positives in each prefecture' in atag.text:
return atag['href']
pdf_url = "https://www.mhlw.go.jp" + get_pdf_url(page_urls[0])
If you pass the PDF URL to Tika
,
from tika import parser
file_data = parser.from_buffer(requests.get(pdf_url))
text = file_data["content"]
print(text)
out
2020/8/17 24:00
Severe
North Sea Road 1,627 36,329 136 3 1,388 103 0
Aomori 33 1,731 1 0 31 1 0
Iwate 9 2,218 6 0 3 0 0
Miyagi 184 7,146 10 1 173 1 0
Autumn field 43 1,554 16 0 26 0 1
Yamagata 76 3,005 0 0 76 1 1
Fukushima 106 11,990 16 1 90 0 0
Ibaraki 456 9,383 83 2 363 10 0
Tochigi 277 18,653 43 2 221 1 12
Gunma 307 10,523 101 0 178 19 9
Saitama 3,256 89,463 549 10 2,625 82 0
Chiba * 5 2,495 45,444 506 8 1,931 58 0
Tokyo * 4 17,875 260,990 3,519 27 14,015 341 0
Kanagawa 3,904 91,480 633 21 3,166 105 0
Niigata 130 8,383 8 0 121 0 1
Tomiyama 308 6,694 45 3 242 22 1
Ishikawa 475 4,097 134 2 312 29 0
Fukui 155 5,899 6 1 141 8 0
Yamanashi 140 8,681 22 0 117 1 0
Nagano 149 10,909 31 0 119 - 1
Gifu 516 15,311 69 2 439 8 0
Shizuoka 418 19,352 76 2 341 1 0
Aichi 3,744 39,879 1,447 13 2,248 44 5
Triple 284 7,642 103 2 180 1 0
Shiga 346 7,216 106 4 236 4 0
Kyoto 1,119 24,839 200 3 898 21 0
Osaka 6,916 105,926 1,621 70 5,179 111 5
Hyogo 1,900 36,336 295 14 1,557 48 0
Nara 403 11,572 102 3 298 3 0
Wakayama 198 7,395 22 1 170 4 2
Tottori 21 4,316 11 0 10 0 0
Shimane 132 4,211 103 0 29 0 0
Okayama 126 3,836 20 - 90 - 16
Hiroshima * 5 437 16,148 72 1 362 3 0
Yamaguchi 83 5,050 18 0 65 0 0
Tokushima 91 3,936 48 0 38 1 4
Kagawa 65 6,505 8 0 56 1 0
Ehime 110 3,431 10 0 94 6 0
Kochi 103 2,609 19 0 80 3 1
Fukuoka 3,633 35,054 1,055 21 2,537 41 0
Saga 198 3,983 67 0 133 0 2
Nagasaki 185 10,783 38 -36 3 108 G48 function (=O47=★ Same-day data ★!E49)
Kumamoto 412 9,237 98 5 247 6 61
Oita 113 9,222 32 0 80 1 0
Miyazaki 267 6,956 76 0 191 1 1
Kagoshima 328 13,337 63 2 249 7 9
Okinawa rope 1,656 18,502 1,123 19 523 14 0
(Other) * 3 149- 0 - 149 - 0
55 in total,958 1,067,156 12,767 243 41,853 1,114 240
※1
※2
* 3 Others are positive on cruise ships in Nagasaki Prefecture.
※4
※5
The number of people conducting PCR tests is larger than the actual number because the number of cases is recorded for some local governments. Also, about local governments that have not been updated
Is the value of the previous day.
Created by the Ministry of Health, Labor and Welfare by subtracting the number of people who require inpatient treatment, the number of people who have been discharged or canceled, and the number of deaths from the number of people who have positive PCR tests.
The totals do not match because some local governments have not re-counted those who were readmitted after the medical treatment was canceled as positive persons.
The figures for Tokyo are quoted from the following sources: https://stopcovid19.metro.tokyo.lg.jp/
The number of cases of local governments that have published positive cases in airport quarantine as domestic cases is not included.
Status of test-positive persons in each prefecture (domestic cases excluding airport quarantine and charter flight cases)
Prefecture name Number of positive people
PCR test
Number of participants * 1
Inpatient treatment, etc.
Those who need
(Man)
Discharge or cancellation of medical treatment
Number of people who became
(Man)
Death (cumulative)
(Man)
Checking * 2
(Man)
It was really easy to get the content.
The variable name is insanely appropriate,
l = {}
for i in re.findall("(?:[one-龥](?:\s+[one-龥]|Total){1,2}|(Other))[※\d\s]+?\n",text.translate(str.maketrans({"\u3000":"",",":"","-":"0"}))):
a = i.split()
b = "".join(re.findall("[one-龥 That]", i))
l[b] = {}
l[b]["Number of positives"] = int(a[-7])
l[b]["Number of people performing PCR tests"] = int(a[-6])
l[b]["Those who need hospital treatment, etc."] = {"Not serious":int(a[-5]), "Severe":int(a[-4])}
l[b]["Number of people discharged or canceled"] = int(a[-3])
l[b]["Death (cumulative)"] = int(a[-2])
l[b]["Checking"] = int(a[-1])
print(l["Tokyo"])
output
out
{'Number of positives': 17875,
'Number of people performing PCR tests': 260990,
'Those who need hospital treatment, etc.': {'Not serious': 3519, 'Severe': 27},
'Number of people discharged or canceled': 14015,
'Death (cumulative)': 341,
'Checking': 0}
It has become beautiful.
By the way, if you display all the data,
out
{'Hokkaido': {'Number of positives': 1627,
'Number of people performing PCR tests': 36329,
'Those who need hospital treatment, etc.': {'Not serious': 136, 'Severe': 3},
'Number of people discharged or canceled': 1388,
'Death (cumulative)': 103,
'Checking': 0},
'Aomori': {'Number of positives': 33,
'Number of people performing PCR tests': 1731,
'Those who need hospital treatment, etc.': {'Not serious': 1, 'Severe': 0},
'Number of people discharged or canceled': 31,
'Death (cumulative)': 1,
'Checking': 0},
'Iwate': {'Number of positives': 9,
'Number of people performing PCR tests': 2218,
'Those who need hospital treatment, etc.': {'Not serious': 6, 'Severe': 0},
'Number of people discharged or canceled': 3,
'Death (cumulative)': 0,
'Checking': 0},
'Miyagi': {'Number of positives': 184,
'Number of people performing PCR tests': 7146,
'Those who need hospital treatment, etc.': {'Not serious': 10, 'Severe': 1},
'Number of people discharged or canceled': 173,
'Death (cumulative)': 1,
'Checking': 0},
'Akita': {'Number of positives': 43,
'Number of people performing PCR tests': 1554,
'Those who need hospital treatment, etc.': {'Not serious': 16, 'Severe': 0},
'Number of people discharged or canceled': 26,
'Death (cumulative)': 0,
'Checking': 1},
'Yamagata': {'Number of positives': 76,
'Number of people performing PCR tests': 3005,
'Those who need hospital treatment, etc.': {'Not serious': 0, 'Severe': 0},
'Number of people discharged or canceled': 76,
'Death (cumulative)': 1,
'Checking': 1},
'Fukushima': {'Number of positives': 106,
'Number of people performing PCR tests': 11990,
'Those who need hospital treatment, etc.': {'Not serious': 16, 'Severe': 1},
'Number of people discharged or canceled': 90,
'Death (cumulative)': 0,
'Checking': 0},
'Ibaraki': {'Number of positives': 456,
'Number of people performing PCR tests': 9383,
'Those who need hospital treatment, etc.': {'Not serious': 83, 'Severe': 2},
'Number of people discharged or canceled': 363,
'Death (cumulative)': 10,
'Checking': 0},
'Tochigi': {'Number of positives': 277,
'Number of people performing PCR tests': 18653,
'Those who need hospital treatment, etc.': {'Not serious': 43, 'Severe': 2},
'Number of people discharged or canceled': 221,
'Death (cumulative)': 1,
'Checking': 12},
'Gunma': {'Number of positives': 307,
'Number of people performing PCR tests': 10523,
'Those who need hospital treatment, etc.': {'Not serious': 101, 'Severe': 0},
'Number of people discharged or canceled': 178,
'Death (cumulative)': 19,
'Checking': 9},
'Saitama': {'Number of positives': 3256,
'Number of people performing PCR tests': 89463,
'Those who need hospital treatment, etc.': {'Not serious': 549, 'Severe': 10},
'Number of people discharged or canceled': 2625,
'Death (cumulative)': 82,
'Checking': 0},
'Chiba': {'Number of positives': 2495,
'Number of people performing PCR tests': 45444,
'Those who need hospital treatment, etc.': {'Not serious': 506, 'Severe': 8},
'Number of people discharged or canceled': 1931,
'Death (cumulative)': 58,
'Checking': 0},
'Tokyo': {'Number of positives': 17875,
'Number of people performing PCR tests': 260990,
'Those who need hospital treatment, etc.': {'Not serious': 3519, 'Severe': 27},
'Number of people discharged or canceled': 14015,
'Death (cumulative)': 341,
'Checking': 0},
'Kanagawa': {'Number of positives': 3904,
'Number of people performing PCR tests': 91480,
'Those who need hospital treatment, etc.': {'Not serious': 633, 'Severe': 21},
'Number of people discharged or canceled': 3166,
'Death (cumulative)': 105,
'Checking': 0},
'Niigata': {'Number of positives': 130,
'Number of people performing PCR tests': 8383,
'Those who need hospital treatment, etc.': {'Not serious': 8, 'Severe': 0},
'Number of people discharged or canceled': 121,
'Death (cumulative)': 0,
'Checking': 1},
'Toyama': {'Number of positives': 308,
'Number of people performing PCR tests': 6694,
'Those who need hospital treatment, etc.': {'Not serious': 45, 'Severe': 3},
'Number of people discharged or canceled': 242,
'Death (cumulative)': 22,
'Checking': 1},
'Ishikawa': {'Number of positives': 475,
'Number of people performing PCR tests': 4097,
'Those who need hospital treatment, etc.': {'Not serious': 134, 'Severe': 2},
'Number of people discharged or canceled': 312,
'Death (cumulative)': 29,
'Checking': 0},
'Fukui': {'Number of positives': 155,
'Number of people performing PCR tests': 5899,
'Those who need hospital treatment, etc.': {'Not serious': 6, 'Severe': 1},
'Number of people discharged or canceled': 141,
'Death (cumulative)': 8,
'Checking': 0},
'Yamanashi': {'Number of positives': 140,
'Number of people performing PCR tests': 8681,
'Those who need hospital treatment, etc.': {'Not serious': 22, 'Severe': 0},
'Number of people discharged or canceled': 117,
'Death (cumulative)': 1,
'Checking': 0},
'Nagano': {'Number of positives': 149,
'Number of people performing PCR tests': 10909,
'Those who need hospital treatment, etc.': {'Not serious': 31, 'Severe': 0},
'Number of people discharged or canceled': 119,
'Death (cumulative)': 0,
'Checking': 1},
'Gifu': {'Number of positives': 516,
'Number of people performing PCR tests': 15311,
'Those who need hospital treatment, etc.': {'Not serious': 69, 'Severe': 2},
'Number of people discharged or canceled': 439,
'Death (cumulative)': 8,
'Checking': 0},
'Shizuoka': {'Number of positives': 418,
'Number of people performing PCR tests': 19352,
'Those who need hospital treatment, etc.': {'Not serious': 76, 'Severe': 2},
'Number of people discharged or canceled': 341,
'Death (cumulative)': 1,
'Checking': 0},
'Aichi': {'Number of positives': 3744,
'Number of people performing PCR tests': 39879,
'Those who need hospital treatment, etc.': {'Not serious': 1447, 'Severe': 13},
'Number of people discharged or canceled': 2248,
'Death (cumulative)': 44,
'Checking': 5},
'Triple': {'Number of positives': 284,
'Number of people performing PCR tests': 7642,
'Those who need hospital treatment, etc.': {'Not serious': 103, 'Severe': 2},
'Number of people discharged or canceled': 180,
'Death (cumulative)': 1,
'Checking': 0},
'Shiga': {'Number of positives': 346,
'Number of people performing PCR tests': 7216,
'Those who need hospital treatment, etc.': {'Not serious': 106, 'Severe': 4},
'Number of people discharged or canceled': 236,
'Death (cumulative)': 4,
'Checking': 0},
'Kyoto': {'Number of positives': 1119,
'Number of people performing PCR tests': 24839,
'Those who need hospital treatment, etc.': {'Not serious': 200, 'Severe': 3},
'Number of people discharged or canceled': 898,
'Death (cumulative)': 21,
'Checking': 0},
'Osaka': {'Number of positives': 6916,
'Number of people performing PCR tests': 105926,
'Those who need hospital treatment, etc.': {'Not serious': 1621, 'Severe': 70},
'Number of people discharged or canceled': 5179,
'Death (cumulative)': 111,
'Checking': 5},
'Hyogo': {'Number of positives': 1900,
'Number of people performing PCR tests': 36336,
'Those who need hospital treatment, etc.': {'Not serious': 295, 'Severe': 14},
'Number of people discharged or canceled': 1557,
'Death (cumulative)': 48,
'Checking': 0},
'Nara': {'Number of positives': 403,
'Number of people performing PCR tests': 11572,
'Those who need hospital treatment, etc.': {'Not serious': 102, 'Severe': 3},
'Number of people discharged or canceled': 298,
'Death (cumulative)': 3,
'Checking': 0},
'Wakayama': {'Number of positives': 198,
'Number of people performing PCR tests': 7395,
'Those who need hospital treatment, etc.': {'Not serious': 22, 'Severe': 1},
'Number of people discharged or canceled': 170,
'Death (cumulative)': 4,
'Checking': 2},
'Tottori': {'Number of positives': 21,
'Number of people performing PCR tests': 4316,
'Those who need hospital treatment, etc.': {'Not serious': 11, 'Severe': 0},
'Number of people discharged or canceled': 10,
'Death (cumulative)': 0,
'Checking': 0},
'Shimane': {'Number of positives': 132,
'Number of people performing PCR tests': 4211,
'Those who need hospital treatment, etc.': {'Not serious': 103, 'Severe': 0},
'Number of people discharged or canceled': 29,
'Death (cumulative)': 0,
'Checking': 0},
'Okayama': {'Number of positives': 126,
'Number of people performing PCR tests': 3836,
'Those who need hospital treatment, etc.': {'Not serious': 20, 'Severe': 0},
'Number of people discharged or canceled': 90,
'Death (cumulative)': 0,
'Checking': 16},
'Hiroshima': {'Number of positives': 437,
'Number of people performing PCR tests': 16148,
'Those who need hospital treatment, etc.': {'Not serious': 72, 'Severe': 1},
'Number of people discharged or canceled': 362,
'Death (cumulative)': 3,
'Checking': 0},
'Yamaguchi': {'Number of positives': 83,
'Number of people performing PCR tests': 5050,
'Those who need hospital treatment, etc.': {'Not serious': 18, 'Severe': 0},
'Number of people discharged or canceled': 65,
'Death (cumulative)': 0,
'Checking': 0},
'Tokushima': {'Number of positives': 91,
'Number of people performing PCR tests': 3936,
'Those who need hospital treatment, etc.': {'Not serious': 48, 'Severe': 0},
'Number of people discharged or canceled': 38,
'Death (cumulative)': 1,
'Checking': 4},
'Kagawa': {'Number of positives': 65,
'Number of people performing PCR tests': 6505,
'Those who need hospital treatment, etc.': {'Not serious': 8, 'Severe': 0},
'Number of people discharged or canceled': 56,
'Death (cumulative)': 1,
'Checking': 0},
'Ehime': {'Number of positives': 110,
'Number of people performing PCR tests': 3431,
'Those who need hospital treatment, etc.': {'Not serious': 10, 'Severe': 0},
'Number of people discharged or canceled': 94,
'Death (cumulative)': 6,
'Checking': 0},
'Kochi': {'Number of positives': 103,
'Number of people performing PCR tests': 2609,
'Those who need hospital treatment, etc.': {'Not serious': 19, 'Severe': 0},
'Number of people discharged or canceled': 80,
'Death (cumulative)': 3,
'Checking': 1},
'Fukuoka': {'Number of positives': 3633,
'Number of people performing PCR tests': 35054,
'Those who need hospital treatment, etc.': {'Not serious': 1055, 'Severe': 21},
'Number of people discharged or canceled': 2537,
'Death (cumulative)': 41,
'Checking': 0},
'Saga': {'Number of positives': 198,
'Number of people performing PCR tests': 3983,
'Those who need hospital treatment, etc.': {'Not serious': 67, 'Severe': 0},
'Number of people discharged or canceled': 133,
'Death (cumulative)': 0,
'Checking': 2},
'Kumamoto': {'Number of positives': 412,
'Number of people performing PCR tests': 9237,
'Those who need hospital treatment, etc.': {'Not serious': 98, 'Severe': 5},
'Number of people discharged or canceled': 247,
'Death (cumulative)': 6,
'Checking': 61},
'Oita': {'Number of positives': 113,
'Number of people performing PCR tests': 9222,
'Those who need hospital treatment, etc.': {'Not serious': 32, 'Severe': 0},
'Number of people discharged or canceled': 80,
'Death (cumulative)': 1,
'Checking': 0},
'Miyazaki': {'Number of positives': 267,
'Number of people performing PCR tests': 6956,
'Those who need hospital treatment, etc.': {'Not serious': 76, 'Severe': 0},
'Number of people discharged or canceled': 191,
'Death (cumulative)': 1,
'Checking': 1},
'Kagoshima': {'Number of positives': 328,
'Number of people performing PCR tests': 13337,
'Those who need hospital treatment, etc.': {'Not serious': 63, 'Severe': 2},
'Number of people discharged or canceled': 249,
'Death (cumulative)': 7,
'Checking': 9},
'Okinawa': {'Number of positives': 1656,
'Number of people performing PCR tests': 18502,
'Those who need hospital treatment, etc.': {'Not serious': 1123, 'Severe': 19},
'Number of people discharged or canceled': 523,
'Death (cumulative)': 14,
'Checking': 0},
'Other': {'Number of positives': 149,
'Number of people performing PCR tests': 0,
'Those who need hospital treatment, etc.': {'Not serious': 0, 'Severe': 0},
'Number of people discharged or canceled': 149,
'Death (cumulative)': 0,
'Checking': 0},
'total': {'Number of positives': 55958,
'Number of people performing PCR tests': 1067156,
'Those who need hospital treatment, etc.': {'Not serious': 12767, 'Severe': 243},
'Number of people discharged or canceled': 41853,
'Death (cumulative)': 1114,
'Checking': 240}}
After that, encode it as json.dumps (l)
etc. and join original article.
I've been looking for a way to work with PDFs without downloading, so it was a good opportunity.