Cities and wards prefer PDF to provide data. We will try to convert the data in such awkward format using the command to convert it to text format and plot it using the data of Gotemba City. (It corresponds to the data as of June 08, 2017 of the administration.)

Advance preparation

BeautifulSoup in python
R zoo
Linux poppler, parallel, wget

The installation method is as follows.

# pip install bs4

$ R
> install.packages("zoo")

# pacman -S poppler parallel wget

Created script

`get_pdf_links.py`


import urllib.request
from bs4 import BeautifulSoup
import re

url = "http://www.city.gotemba.shizuoka.jp/gyousei/g-6/g-6-1/2475.html"
req = urllib.request.Request(url, headers={'User-Agent': "Magic Browser"})
con = urllib.request.urlopen(req)
soup = BeautifulSoup(con.read(), 'html.parser')
result = soup.find_all("li")
li = []
for link in result:
    if re.match(r'.*PDF.*', link.get_text()) is not None:
        li.append(link.find("a")['href'])

for link in li:
    print(link)

`pdf/print_data.py`


import re, os

txt_files = []
for filename in os.listdir('.'):
    if filename.endswith('txt'):
        txt_files.append(filename)

txt_files.remove(".txt")

data = []
for filename in txt_files:
    fp = open(filename)
    year = None
    month = None
    population = None
    for i,line in enumerate(fp):
        if i == 0:
            year = re.sub(r'Heisei([0-9]+)Year.*$', r'\1', line)
            year = year.replace("\n","")
            month = re.sub(r'Heisei[0-9]+Year([0-9]+)Month.*$', r'\1', line)
            month = month.replace("\n","")
        elif i == 554:
            population = line.replace(",","")
            population = population.replace("\n","")
    data.append([int(year), int(month), int(population)])
    fp.close()
data_fmt = []
for val in data:
    data_fmt.append([val[0]+1988, val[1], val[2]])

data_fmt.sort()
data_fmt2 = []
for val in data_fmt:
    data_fmt2.append([str(val[0])+"-"+str(val[1]), val[2]])

print("date, population")
for val in data_fmt2:
    print(val[0]+","+str(val[1]))

`pdf/plot_data.R`


library(zoo)
data <- read.csv("data.csv", header=T)
z <- read.zoo(data, FUN = as.yearmon)
plot(z)

Script for execution

`process.sh`


#/bin/bash

python get_pdf_links.py | parallel --gnu "wget {}"
mv *.pdf pdf
cd pdf
for file in *.pdf; do pdftotext "$file" "$file.txt"; done
rm dd92f76ed99f94259ade29d559663bc1.pdf.txt
rm 7a76d9a16bcc1ce29875b76a6ef12a2e.pdf.txt 
python print_data.py > data.csv
Rscript plot_data.R

The output data is Rplots.pdf in pdf

Output data

Screenshot from 2017-06-08 15-36-54.png

Caution

PDF files are good for printing and making them easier to read, but they can be tedious to parse as plain text. Depending on the PDF, the captured image may be embedded instead of the text, so it may not open at all. Therefore, files that cannot be opened by process.sh are deleted by rm. There is no workaround for these files.

Personal request

If the government wants to visualize the data, the file format should be not only PDF but also plain text format such as csv. You cannot get new insights just by looking at the aggregated graphs. By reading the raw data numerically, a wide range of analysis is possible.

[PYTHON] Extract and plot the latest population data from the PDF data provided by the city