[PYTHON] Convert Mobile Suica usage history PDF to pandas Data Frame format with tabula-py

"Creating a service that edits the usage history of Mobile Suica so that it can be easily used for expense settlement" Last time, I created an execution environment with Docker Click here for the finished product https://www.mobilesuica.work

Operating environment

Since it is troublesome to prepare the environment on Windows, I set up and run a container that runs Python with Docker

tabula-py tabula-py First of all, I tried this easily

test.py


import tabula
import pandas as pd
df = tabula.read_pdf('a.pdf')
print(df.head())

Unfortunately it was an error

AttributeError: 'list' object has no attribute 'head'

Apparently it's returned in an array If so, should I do this?

test.py


df = tabula.read_pdf('a.pdf')
print(df[0].head())

It was done (the result station name has changed)

Date Type Use Station Type.1 Station used.1 Balance difference
0 2 6 Repeat NaN NaN NaN\9,753   NaN
2 2 7 Enter Tokyo Exit Yokohama\9,573  -180
3 2 7 Enter Yokohama Exit Tokyo\9,393  -180

Multiple pages at once

By default, only the first page is used, so add pages ='all' and combine the two pages. The maximum usage history of Mobile Suica is 100, and if you include the deferral, there are 101, and if you convert it to PDF, it will be up to 2 pages.

test.py


df = tabula.read_pdf('a.pdf',pages='all')
for i in range(len(df)):
    print(f"{i+1}On the page{len(df[i])}There is a line")

Execution result

(app-root) bash-4.2# python3 test.py
There are 51 lines on one page
There are 50 lines on 2 pages

You can see that the first page has one more line for "deferred". Since it is difficult to handle if they are separated, we will use one DataFrmae. Use concat.

test.py


df = tabula.read_pdf('a.pdf',pages='all')
d = pd.concat(df,ignore_index=True)
print(f"{len(d)}There is a line")

Execution result

(app-root) bash-4.2# python3 test.py
There are 101 lines

** ignore_index = True **, but if you do not include this, the index will not be a serial number properly. Let's compare.

ignore_index = True None Index returns to 0 at the boundary between the 1st and 2nd pages

50 3 4 Enter Tokyo Depart Yokohama\9,443    -180
0 3 5 Cash NaN NaN NaN\9,543     100

ignore_index = True Yes It is a serial number as it is

50 3 4 Enter Tokyo Depart Yokohama\9,443    -180
51 3 5 Cash NaN NaN NaN\9,543     100

If you go this far, you can do the same with multiple files. There is also convert_into_by_batch in tabula-py, but it's a bit difficult to use because it's like specifying a directory and collecting PDF files under it. (Since I had only one usage history, I read the same file many times)

test.py


fileList = ['a.pdf','a.pdf','a.pdf']
dfList = []
for fileName in fileList:
    df = tabula.read_pdf(fileName,pages='all')
    for i in range(len(df)):
        dfList.append(df[i])
d = pd.concat(dfList,ignore_index=True)
print(f"{len(d)}There is a line")

Execution result

(app-root) bash-4.2# python3 test.py
There are 303 lines

So far, I noticed that there is a slight difference between the first page and the second page.

Charge part on the first page
24 2 21 Cash NaN NaN NaN\9,883    +100
Charge part on the second page
78 3 21 Cash NaN NaN NaN\10,317     100

There is no "+" on the second page. I can do it without any problem in terms of data processing, but it feels bad. When I looked it up, it was the difference between whether it was recognized as a number (second page) or as a character (first page).

14 2 15 Bus etc. OKK NaN NaN\9,103  -2,000

If it exceeds 3 digits, Thousand Separator (,) will be included, so it seems that all the pages are recognized as characters. Page 2 was recognized as a number because it wasn't there (all less than 3 digits). It is more convenient to take it as it is as a character, so when I looked it up with pandas, I found that ** dtype ='object' ** should be used. tabula-py is a great way to pass pandas options as is. The final result is as follows.

test.py


fileList = ['a.pdf','a.pdf','a.pdf']
dfList = []
for fileName in fileList:
    df = tabula.read_pdf(fileName,pages='all',pandas_options={'dtype':'object'})
    for i in range(len(df)):
        dfList.append(df[i])
d = pd.concat(dfList,ignore_index=True)

It's amazing that you can drop a PDF table into a DataFrame with just this.

Recommended Posts

Convert Mobile Suica usage history PDF to pandas Data Frame format with tabula-py
Convert 202003 to 2020-03 with pandas
How to convert horizontally held data to vertically held data with pandas
Convert PDF to image with ImageMagick
Convert xml format data to txt format data (yolov3)
Convert from PDF to CSV with pdfplumber
Convert Excel data to JSON with python
Use pandas to convert grid data to row-holding (?) Data
Convert FX 1-minute data to 5-minute data with Python
Convert PDF attached to email to text format
Try converting to tidy data with pandas
Convert PDF files to PNG files with GIMP
Try to aggregate doujin music data with pandas
Convert data with shape (number of data, 1) to (number of data,) with numpy.
Convert PDF to image (JPEG / PNG) with Python
Convert json format data to txt (using yolo)
Make holiday data into a data frame with pandas
Convert strings to character-by-character list format with python
Save pandas data in Excel format to data assets with Cloud Pak for Data (Watson Studio)
I tried scraping food recall information with Python to create a pandas data frame
I want to give a group_id to a pandas data frame
Convert the image in .zip to PDF with Python
How to extract non-missing value nan data with pandas
How to convert JSON file to CSV file with Python Pandas
Try to automate pdf format report creation with Python
Convert numeric variables to categorical with thresholds in pandas
How to extract non-missing value nan data with pandas
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
How to output a document in pdf format with Sphinx
Ingenuity to handle data with Pandas in a memory-saving manner
Convert garbled scanned images to PDF with Pillow and PyPDF
Convert GRIB2 format weather data that cannot be opened with pygrib to netCDF and visualize it