Extract the table of image files with OneDrive & Python

I want to extract the table from the image

You may want to extract the ** table in the image file ** as table data.

For example, "scan a paper book or document and digitize it as an image file or PDF file".

(PDF_original.png

The table in this is not OCR processed ** just an image **, so it is not recognized as a character, let alone a table.

Therefore, of course, it cannot be treated as table data as it is. Then, is there no choice but to give up and steadily tabulate the data? No, ** don't give up! ** **

How to extract table data from an image

In fact, even from such images (jpg, png, pdf, etc.), the table can be extracted as data in the next step.

Preparation. Register for an account on Microsoft OneDrive (free) 0. Convert image files (jpg, png, etc.) to pdf files (this step is not necessary for PDF from the beginning)

  1. Save the PDF file to OneDrive, convert it to Word and apply OCR processing
  2. Save OCR processed Word as PDF
  3. Extract the table in PDF with Python

I will use Word on the way, but since I use free Office Online, it is okay if Microsoft Word is not installed on my PC.

Then, this time, I will explain using the PDF file of ↓. (PDF_original.png

If you want to extract a table from an image file (jpg, png, etc.), first convert it to a PDF file. There is also a free web service that converts image files to PDF, but the simplest is [Right-click the image file-> Print-> Select "Microcoft Print to PDF" on the printer to print].

Advance preparation. Register an account on OneDrive

Register an account on Microsoft OneDrive. Free.

[Get a Microsoft account] (https://www.microsoft.com/ja-jp/office/homeuse/onedrive-guide.aspx)

1. Save the PDF file to OneDrive

Upload the target PDF file to OneDrive. onedrive_upload.png

Right-click on the file and select Open. onedrive_open.png

At this point, if you try to select around the table, you can select the characters as text. The table structure is also recognized.

Press the "Edit with Desktop App" button. Then you will be asked if you want to convert the file, so press the "Convert" button.

onedrive_edit.png

Then the conversion will take place. When the conversion is complete, a confirmation screen will appear, so press "Edit". onedrive_edit_comp.png

This will open Word in your browser. It is properly converted as table data. onedrive_word.png

There may be some places where the characters are not recognized correctly, so if you can fix it at this point, fix it manually. In this case, "Copy" may be "Coby", but the conversion is almost correct. It is quite a recognition accuracy!

2. Save OCR processed Word as PDF

PDF files are easier to handle in Python than Word files, so convert them to PDF and download.

Select "File" in the upper left and select Save As → Download as PDF. onedrive_word_download_as_pdf.png

3. Extract the table in PDF with Python

Let's open the downloaded PDF file. Unlike the original PDF, the table is properly recognized as a table. It's bad to see the font because it's big or small, but you don't have to worry because it will be extracted as a pandas DataFrame.

PDF_ocr.png

Now that you've come this far, the rest is a simple table using Python using the method introduced in the article "Extract the table in PDF with Python". Can be extracted.

python


import pandas as pd
import tabula
 
# lattice=True to determine cells by table axis
dfs = tabula.read_pdf("PDF_ocr.pdf", lattice=True, pages='1')
for df in dfs:
    display(df)

Execution result PDF_ocr_df.png

Recommended Posts

Extract the table of image files with OneDrive & Python
[Automation] Extract the table in PDF with Python
Sorting image files with Python (2)
Sorting image files with Python (3)
Sorting image files with Python
Extract the band information of raster data with python
Extract the xz file with python
I tried to find the entropy of the image with python
[Python] Easy reading of serial number image files with OpenCV
Basics of binarized image processing with Python
Check the existence of the file with python
Download files on the web with Python
Drawing with Matrix-Reinventor of Python Image Processing-
How to crop the lower right part of the image with Python OpenCV
Try to image the elevation data of the Geographical Survey Institute with Python
The result of making the first thing that works with Python (image recognition)
I tried "smoothing" the image with Python + OpenCV
[Python] Get the files in a folder with Python
Prepare the execution environment of Python3 with Docker
2016 The University of Tokyo Mathematics Solved with Python
I tried "differentiating" the image with Python + OpenCV
Color page judgment of scanned image with python
[Note] Export the html of the site with python.
Calculate the total number of combinations with python
Check the date of the flag duty with Python
Image processing? The story of starting Python for
I tried "binarizing" the image with Python + OpenCV
Automating simple tasks with Python Table of contents
Convert the character code of the file with Python3
[Python] Determine the type of iris with SVM
[Python + OpenCV] Whiten the transparent part of the image
the zen of Python
Image processing with Python
Extract the color of the object in the image with Mask R-CNN and K-Means clustering
How to get a list of files in the same directory with python
Image processing by matrix Basics & Table of Contents-Reinventor of Python image processing-
Learn Nim with Python (from the beginning of the year).
Find out the location of Python class definition files.
[Python] Get the numbers in the graph image with OCR
Destroy the intermediate expression of the sweep method with Python
[OpenCV / Python] I tried image analysis of cells with OpenCV
Visualize the range of interpolation and extrapolation with python
Convert the image in .zip to PDF with Python
Calculate the regression coefficient of simple regression analysis with python
Extract files from EC2 storage with the scp command
Creating BINGO "Web Tools" with Python (Table of Contents)
Summary of the basic flow of machine learning with Python
[Python] How to rewrite the table style with python-pptx [python-pptx]
Get the operation status of JR West with Python
Extract images and tables from pdf with python to reduce the burden of reporting
Towards the retirement of Python2
Image editing with python OpenCV
About the ease of Python
Sort huge files with python
Image processing with Python (Part 1)
Tweet with image in Python
Integrate PDF files with Python
Image processing with Python (Part 3)
Reading .txt files with Python
Call the API with python3.
About the features of Python