[PYTHON] I got the date from the pub rice in Kagawa and drew a graph

Preface

Hi, I've been taking a break from school lately thanks to Corona, and I've been spending endless free days, so I'm killing time by playing with various technologies every day. To be honest, it's a lot of fun. By the way, in those days, I found something when I was making Discord bots, implementing 2048, playing in esoteric languages, and playing with morphological analysis. Yes, currently at Google Drive [by Netorabo editorial department](https://nlab.itmedia.co.jp/nl/articles/2004 /25/news026.html) This is a ** public comment ** of the ** Kagawa Prefecture Net Game Addiction Countermeasures Ordinance **, which has been highly acclaimed. When I found this, I thought.

** It looks like it's fun to play with this **.

Since the person read by the scanner is converted to data on PDF, it cannot be treated as data as it is, so it is necessary to convert it to text data, but the process to convert it to text data seems to be already fun. I haven't touched on image processing technology around here yet, so new knowledge is likely to be expanded. Moreover, from what I heard, it seems that there are some unnatural biases in the data. It's absolutely fun to analyze this. That's why I decided to play.

Environment

For the time being, in the image

First, convert the PDF to an image using pdf2image. It is a plagiarism of the code written in almost this article. I'm sorry, I don't feel like I can write better code ...

Imaging.py


import pathlib
import pdf2image

pdf_files = pathlib.Path('PDF').glob('*.pdf')

for pdf_file in pdf_files:
    base = pdf_file.stem
    img_dir = pathlib.Path(f'image/{base}')
    img_dir.mkdir()
    images = pdf2image.convert_from_path(pdf_file, grayscale=True, dpi=200)
    for index, image in enumerate(images):
        image.save(img_dir/pathlib.Path(f'{index + 1}.png'), 'png')
    print(base)  #For checking progress

It will take some time to execute, so please wait patiently.

If you wait, it will be like this. image.png Well, when I put them side by side like this, I feel like I have a pub rice in my hand.

From image to string

Use Tessertact_OCR. I worship at the computer with the expectation that it will not be recognized in a good way. It is important to bow as deeply as possible. It would be nice to have an offering. If you feel that your worship is understood, let's try to recognize the 14th (appropriately decided) approval on January 23rd.

C:\Users\usr\Documents\Kagawa>tesseract .\image\Agree 0123\14.png .\Character recognition\test -l jpn
Tesseract Open Source OCR Engine v5.0.0-alpha.20200328 with Leptonica
Warning: Invalid resolution 0 dpi. Using 70 instead.
Estimating resolution as 344
Detected 201 diacritics

It looks like there are a lot of problems, but it's probably because of my mind, because of my mind. Now let's compare the input and the output. Click here for the entered image 14.png Here is the output text.

test.txt


desknefs NEO             -Page 171

Parliamentary Secretariat(glkeldprefr kagawa lg jp)

 

-------- "

―――(Omitted as blank lines continue)―――
 
Citizen: "Wagawa Prefecture Opinion / Inquiry Page"<hp- [email protected] kagawa.Idg.]p>-
destination: gikaiGpref.kagawa.Ig.jp

CC :-"

subject:Posting from the freezing / inquiry page

White time:January 23, 2020(Book) 15:16

―――(Omitted as blank lines continue)―――

[Contents of opinions and inquiries]
[Opinion box to the prefectural assembly homepage]-                                」
The prefectural assembly will continue to inform you of the status of the assembly in an easy-to-understand manner.
I will. Please let us know your opinions and impressions when you visit the parliamentary website.
I. We will refer to your opinions as valuable voices from everyone.
I will do it.
Please note$---

@We cannot accept petition by e-mail or e-mail about individual members of the Diet.

【(Residence] ---
【E-Maill . ,
[Subject] Opinions on public comments

[Opinions / impressions]

Age axis phone number

I agree with the Net Game Addiction Countermeasures Article.
I'm worried that there are children playing games and smartphones wherever I go

 

[ADDR]192. 168.7. 21

[DATE]2020/01/23 15:16: 42
[USERAGENT]Mozilla/5.0 (Windows NT 10.0: Win64: x64) AppleWeb
Kit/537.36 (KHTML, like Gecko) Chrome/70.0.3538. 102 faP/53 or 3
6 TOg9. 18362

Uh ~~~~~~~ There are some very unstable points, but I can get the date I plan to play with this time without any problem, so I'm okay for the time being.

Get all at once

It's PyOCR's turn. Characters that match the / ^ ([^ 0-9 \ n] * \ d) {12} [^ 0-9 \ n] * $ / regular expression (a line containing 12 "just" numbers) Extract the column. The numbers seem to be recognized fairly accurately, so you won't miss them that much. The acquired date is stored in four text files, "Agree", "Disagree", "Business operator", and "Proposal". This code, which was written based on this article, is written with some mysterious power to cause a miracle and quadruple the specifications of the personal computer. Believe and do.

OCR.py


from PIL import Image
import sys
import pyocr
import pyocr.builders
from pathlib import Path
import re
count = 0
tool = pyocr.get_available_tools()[0]
folders = list(Path("image").glob("*")) #imageフォルダのパスをすべて取得
agr, opp, bsp, rec = open("Agree.txt", "w"), open("Opposition.txt", "w"), open("business person.txt", "w"), open("Recommendation.txt", "w")  #Initialize the text file once
agr,opp,bsp,rec.close()
dic = {"Praise": "Praise成.txt", "Anti": "Anti対.txt", "Thing": "Thing業者.txt", "Proposal": "Proposal言.txt"}  #A dictionary for writing Switch statements
for fol in folders:
    with open(dic[str(fol)[3]],"a") as fil: #Judge the file to open with the "4th character" of the folder path
        for path in (Path(fol).glob("*")):
            count += 1
            text = tool.image_to_string(
                Image.open(path),
                lang="jpn",
                builder=pyocr.builders.TextBuilder(tesseract_layout=6)
            )
            match = re.search(r'^([^0-9\n]*\d){12}[^0-9\n]*$', text, re.MULTILINE)
            if match != None:  #For documents that span several pages, there may be no date anywhere on the page.
                match = match.group()
                fil.write(match + "\n")
            print(count) #For checking progress

By the way, no miracle happened to me, the execution time is too long Maybe there is a way to finish this a little earlier

Acquisition result

As a result of running this program, for example, the contents of "Agree.txt" look like this.

Agree.txt



Date and time:January 23, 2020(wood) 11:39 ー ー
Date and time:January 23, 2020(wood) 11:49 ー ー
-Time:January 23, 2020(Book) 11:50                              .
Date and time:January 23, 2020(wood) 11:55 ---
Date and time:January 23, 2020(wood) 13:49
Date and time:January 23, 2020(Book) 15:16 ---.
.Date and time:January 23, 2020(wood) 15:31
Date and time:January 23, 2020(wood) 15:51   .---
Date and time:January 23, 2020(wood) 15:58                            .
Date and time:January 23, 2020(wood) 17:55    .                 ----
Date and time:January 23, 2020(wood) 20:23       .
Date and time:January 23, 2020(wood) 12:22
Date and time:January 23, 2020(wood) 20:31      -"・
Date and time:January 23, 2020(wood) 13:10 ---.
Date and time:January 23, 2020(wood) 16:27                            ]      」
Date and time:January 23, 2020(wood) 17:03
Date and time:January 23, 2020(wood) 18:09             ]---
Date and time:January 23, 2020(wood) 21:41
22812 050 Return presentation IO008 "1-
Date and time:January 24, 2020(Money) 08:49 ー ー
.Date and time:January 24, 2020(Money) 12:40                .
Date and time:January 24, 2020(Money) 13:28
Date and time:January 24, 2020(Money) 13:31
Date and time:January 24, 2020(Money) 13:34                    -
Date and time:January 24, 2020(Money) 13:35
.Date and time:January 24, 2020(Money) 14:01    ]-
.Date and time:January 24, 2020(Money) 15:08 ー ー.
.. Date and time: "January 24, 2020(Money) 08:49  .---
Date and time:January 24, 2020(Money) 15:33 ー ー
Date and time:January 24, 2020(Money) 15:34
Date and time:January 24, 2020(Money) 15:37 ・
Date and time:January 24, 2020(Money) 15:44 ・
Date and time:January 24, 2020(Money) 16:03            」      -       -
Date and time:January 24, 2020(Money) 16:13 ー ー
-Date and time:January 24, 2020(Money) 16:14
Date and time:January 24, 2020(Money) 16:16     -"-
Date and time:January 24, 2020(Money) 16:39    -
-.At the time of:January 24, 2020(Money) 08:50 ー ー
Date and time:January 24, 2020(Money) 16:47      -
(The following is omitted)

It seems that some "non-date" is mixed in, but it seems to be generally successful. By the way, there were only a few "non-dates" in the whole, so I manually removed them, which was a moment.

Normalization

If this is left as it is, the noise will be terrible, so normalize the data. Easily unify with "a combination of all the numbers in the date". The number of characters should be fixed at 12, so you should be able to normalize with this.

Normalization.py


import re

for name in ["Agree","Opposition","business person","Recommendation"]:
    with open(name + ".txt") as fil:
        contents = fil.read()
    match = re.findall(r'([0-9]|\n)', contents, re.MULTILINE)
    with open(name + "_Normalization.txt","w") as fil:
        fil.write("".join(match))

Agree_Normalization.txt



202001231139
202001231149
202001231150
202001231155
202001231349
202001231516
202001231531
202001231551
202001231558
202001231755
202001232023
202001231222
202001232031
202001231310
(The following is omitted)

it is a good feeling.

Draw a scatter plot

Finally draw a scatter plot. The recruitment period for pub rice is ** 1/23 to 2/6 ** (isn't it short? This), so let's plot the ** distribution of votes in favor ** during this period for the time being. Find the best answer in this question on teratile.

Graph generation.py


import matplotlib.pyplot as plt
from matplotlib import dates as mdates
from datetime import datetime as dt
date = []
time = []
x = []
y = []
with open("Agree_Normalization.txt", "r") as fil:
    for line in fil:
        date.append(line[4:10])
        time.append(line[10:12])
for d in date:
    y.append(dt.strptime(d, "%m%d%H"))
for d in time:
    x.append(dt.strptime(d, "%M"))
ax = plt.subplot()
ax.scatter(x, y, alpha=0.1,c='red',s=40)
ax.set_xlim([dt.strptime('00', '%M'),
             dt.strptime('59', '%M')])
ax.set_ylim([dt.strptime('01/23', '%m/%d'), dt.strptime('02/06', '%m/%d')])
plt.xticks(rotation=90)
plt.savefig("Graph.png ")

Here is the output graph [^ 1]. グラフ.png ** Obviously something is happening. ** ** As mentioned in the annotation, the vertical line is engraved with "month and time" and the horizontal line is engraved with "minute". After all, these two clearly dark lines are probably due to the posting of pub rice at such a high speed that it can be seen continuously even in "minute" increments. Well, it's interesting.

Finally

It was a lot of fun. I'm quitting because I'm sleepy today, but pub rice is still open to the public so I think you should play with it if you have time.

Various things that I referred to

[^ 1]: I didn't set the label because I'm sleepy anymore, but to explain it, the x-axis represents "minutes" (0-59), and the y-axis represents "months and days" (1) in 1-hour increments. It is a feeling that represents / 23/00 to 2/6/23).

Recommended Posts

I got the date from the pub rice in Kagawa and drew a graph
"Deep Learning from scratch" Self-study memo (Part 8) I drew the graph in Chapter 6 with matplotlib
I got lost in the maze
I tried to display the altitude value of DTM in a graph
Use libsixel to output Sixel in Python and output a Matplotlib graph to the terminal.
I wrote a class in Python3 and Java
Draw multiple photos in a graph from multiple folders
I created a stacked bar graph with matplotlib in Python and added a data label
I compared the speed of the reference of the python in list and the reference of the dictionary comprehension made from the in list.
[Python] Representing the number of complaints from life insurance companies in a bar graph
Draw a graph in Julia ... I tried a little analysis
I tried to graph the packages installed in Python
Creating a graph using the plotly button and slider
I wrote python3.4 in .envrc with direnv and allowed it, but I got a syntax error
Create a script for your Pepper skill in a spreadsheet and load SayText directly from the script
I want to see the graph in 3D! I can make such a dream come true.
I also tried to imitate the function monad and State monad with a generator in Python
I got a sqlite3.OperationalError
[Addition] Vulnerability in git! I have to update! But yum doesn't have the latest version, and I got it from the source! Note when
Save the pystan model and results in a pickle file
I tried the super-resolution algorithm "PULSE" in a Windows environment
I scraped the Organization member team and made a ranking
Mezzanine introduction memo that I got stuck in the flow
I got an error in vim and zsh in Python 3.7 series
I tried to illustrate the time and time in C language
[Python] I installed the game from pip and played it
I tried programming the chi-square test in Python and Java.
I created a class in Python and tried duck typing
Get the current date and time in Python, considering the time difference
I implemented N-Queen in various languages and measured the speed
Graph the Poisson distribution and the Poisson cumulative distribution in Python and Java, respectively.
I wrote a script that splits the image in two
I want to get information from fstab at the ssh connection destination and execute a command
Hannari Python At the LT meeting in December, I made a presentation on "Python and Bayesian statistics".
I made a Line bot that guesses the gender and age of a person from an image
I want to create a graph with wavy lines omitted in the middle with matplotlib (I want to manipulate the impression)
I tried to find out the difference between A + = B and A = A + B in Python, so make a note
When I cut the directory for UNIX Socket under / var / run with systemd, I got stuck in a pitfall and what to do