[PYTHON] Story of image analysis of PDF file and data extraction


It is open data, which is a hot topic right now, but it is not necessarily raw data such as CSV, so it may be subtly difficult to handle. Of course, the fact that we are releasing new data that we have is something that should be praised, and we are expecting it with the feeling that "I really want raw data." That's why it's not a bad thing because it was published in PDF, but when using the data, it is necessary to make it more machine-readable than PDF. This time, CSV data is based on the PDF data of About the congestion situation in the car during the morning rush hour released by the Sapporo City Transportation Bureau. I will introduce the procedure for training.


The PDF file obtained from the above website has this format スクリーンショット 2020-03-18 21.38.35.png

For the time being, the data is structured, and it feels like Excel is output as a PDF. So, "from the structure to the analysis route" comes to mind, but it seemed that it would take time and effort, such as the Japanese stored in the table being garbled during analysis. That's why I devised a procedure like the image below.

スクリーンショット 2020-03-18 21.38.35.JPG

It was troublesome to use image processing software, so I handwritten it on my iPad. In short, looking at some PDFs, I found that the upper left cell is always in the same position in both the left and right tables, and the cell sizes are all the same. Then, if you extract the RGB value (= color) of the red dot in the image and convert it to congestion degree data, you will get the desired data.


I was able to implement it like this https://github.com/Kanahiro/sapporo_subway_analyze/

Output CSV like this スクリーンショット 2020-03-18 22.07.30.png

You can see that the red part is 4, the white part is 0, and the blue part is 2. Now, let's follow the steps of reading PDF → converting to image → getting the color of the specified pixel → generating the congestion degree data of the cell from the color → converting it to a CSV file.

Convert from PDF to image

Reference: Python PDF processing summary (combination / division, image conversion, password cancellation)

Use pdf2image. See the above article for how to use it. In the article, it says pip install poppler, but currently it cannot be installed with pip. For Linux: The explanation of other OS is omitted.

sudo apt install poppler-utils

Get the RGB value of the specified pixel from the image

The data read by pdf2image can be converted to numpy array. In other words, the RGB array is inserted into the two-dimensional array that matches the pixel structure to form a three-dimensional array.

#convert_from_path is a function of pdf2image
pdf_images = convert_from_path(pdffile)
img_array = np.asarray(pdf_images[0])
img_array sample

[[[255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]]



 [[255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]
  [255 255 255]]]

The edges of the PDF are white, so it's natural, but [255 255 255], that is, it's all white. I was able to store it in an array in pixel units.

Access to a specific pixel (data_of_pixel)

x = START_CELL[0] + c * CELL_SIZE[0] #x coordinate
y = START_CELL[1] + r * CELL_SIZE[1] #y coordinate
data_of_pixel = img_array[y][x]

It will be. Now you can get the RGB value of a specific part of the converted PDF image.

Judgment of congestion

From the image above, you can see that the congestion is higher in the order of white, light blue, blue, yellow, and red. However, in the legend and data area on the upper right, the PDF was finished with slightly different RGB values. Fortunately, the color difference is large for each stage, so I would like to judge here by the size of the difference in RGB values between the legend and the cells in the data area.

#RGB values in the congestion legend
    [255, 255, 255],
    [112, 200, 241],
    [57, 83, 164],
    [246, 235, 20],
    [237, 32, 36]

def rgb_to_type(rgb_list)->int:
    #Color difference threshold
    threshold = 50
    color_array = np.asarray(rgb_list)
    for i in range(len(CROWD_RGBs)):
        crowd_rgb_array = np.asarray(CROWD_RGBs[i])
        color_dist = abs(color_array - crowd_rgb_array)
        sum_dist = color_dist.sum()
        if sum_dist < threshold:
            return i #0 -4 How crowded

Passing a list of RGB values to this function will return the degree of congestion as an integer value from 0-4. What we are doing is comparing the RGB value of the pixel we got earlier with the RGB value of the congestion legend. The sum of the absolute values of the differences between the RGB values is defined as the color difference, and if the difference is within 50, the degree of congestion is determined.

With this, the congestion degree of all cells is judged, and if it is converted to CSV, the CSV data at the beginning is completed.

At the end

This kind of processing is a certain story in the open data area. Some people said, "Make rice from rice cakes," but can I call myself a rice brewer? However, after all, rice cannot be smelted without primary data (rice cake), so I am only grateful for that, always thank you. Since the amount of data has increased, I wonder if it will be the next stage of data quality ...

Recommended Posts

Story of image analysis of PDF file and data extraction
Clash of Clans and image analysis (3)
Data cleansing 3 Use of OpenCV and preprocessing of image data
Analysis of financial data by pandas and its visualization (2)
Analysis of financial data by pandas and its visualization (1)
Analysis of measurement data ②-Histogram and fitting, lmfit recommendation-
Python application: Data cleansing # 3: Use of OpenCV and preprocessing of image data
Data wrangling PDF file of My Number card issuance status
Beginning of Nico Nico Pedia analysis ~ JSON and touch the provided data ~
Recommended books and sources of data analysis programming (Python or R)
The story of Python and the story of NaN
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
The story of the "hole" in the file
10 selections of data extraction by pandas.DataFrame.query
Recommendation of data analysis using MessagePack
Time series analysis 3 Preprocessing of time series data
Data handling 2 Analysis of various data formats
Image the pdf file and stamp all pages with confidential stamps (images).
Summary of probability distributions that often appear in statistics and data analysis
Starbucks Twitter Data Location Visualization and Analysis
Data handling 1 Data formatting and file input / output
The story of verifying the open data of COVID-19
Extraction of tweet.js (json.loads and eval) (Python)
Separation of design and data in matplotlib
The story of trying deep3d and losing
Analysis of X-ray microtomography image by Python
[Image classification] Facial expression analysis of dogs
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
Image analysis was easy using the data and API provided by Microsoft COCO.
Practice of creating a data analysis platform with BigQuery and Cloud DataFlow (data processing)
Generate and post dummy image data with Django
Read table data in PDF file with Python
Image processing? The story of starting Python for
Smoothing of time series and waveform data 3 methods (smoothing)
The story of reading HSPICE data in Python
Data analysis planning collection processing and judgment (Part 1)
Sentiment analysis of large-scale tweet data by NLTK
I tried morphological analysis and vectorization of words
A well-prepared record of data analysis in Python
[Small story] Download the image of Ghibli immediately
Sort Fashion-MNIST data and save as PNG file
Data analysis planning collection processing and judgment (Part 2)
A story about data analysis by machine learning
Get a large amount of Starbucks Twitter data with python and try data analysis Part 1
Data wrangling (pdfplumber) PDF about influenza outbreak situation of Ministry of Health, Labor and Welfare
A story about improving the program for partial filling of 3D binarized image data
[Latest method] Visualization of time series data and extraction of frequent patterns using Pan-Matrix Profile
Data Langling PDF on the outbreak of influenza by the Ministry of Health, Labor and Welfare
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 4: Feature extraction of data using T-SQL
The story of trying to contribute to COVID-19 analysis with AWS free tier and failing
[Python beginner's memo] Importance and method of confirming missing value NaN before data analysis