[PYTHON] A story about trying to automate a chot when cooking for yourself

Overview

It monitors the specified file path, and when you put the PDF file there, it automatically renames the PDF file to the title of the book. :octocat:book_maker

Operation confirmed OS

macOS Catalina

Things necessary

  $ brew install poppler
  $ brew install tesseract
  $ brew install tesseract-lang

How to use

$ python3 src/watch.py input_path [output_path] [*extensions]

Why made

I rushed to buy a cutting machine and a scanner because I wanted to digest a large number of books in my parents' house. However, I often heard that self-catering is troublesome, so I wanted to achieve some degree of automation, so I created this program.

Workflow

I assembled it in the following flow.

  1. Specify the directory to be monitored and start src / watch.py
  2. Place the PDF in the monitored directory
  3. Detect the event and get the ISBN code from the contents of the PDF file --How to get the ISBN code --Get from barcode using shell --Get from barcode on Python code --Get from text on Python code
  4. Get book information from each API based on ISBN --API you are using
  5. Correct the file name and move the PDF file to the output directory

Monitor a specific directory

I used a library called watchdog to constantly monitor the directory. The following documents and articles were very helpful for detailed usage of watchdog. Thank you very much.

--watchdog official documentation

-I tried using Watchdog -About python watchdog operation -Command execution triggered by file update (python version)


Now, to use watchdog, you need Handler and ʻObserver. Handler describes what to do and how to handle each event (create / delete / move / change). This time, only the ʻon_created function, which is the event at the time of creation, is defined. This ʻon_createdmethod overrides the method in theFileSystemEventHandler class in watchdog.event`.

src/handler/handler.py


from watchdog.events import PatternMatchingEventHandler

class Handler(PatternMatchingEventHandler):
    def __init__(self, input_path, output_path, patterns=None):
        if patterns is None:
            patterns = ['*.pdf']
        super(Handler, self).__init__(patterns=patterns,
                                      ignore_directories=True,
                                      case_sensitive=False)

    def on_created(self, event):
        #Do something

It defines a Handler class and inherits PatternMatchingEventHandler which allows pattern matching. By using this, you can limit the types of files that are detected by the event. There is also a RegexMatchingEventHandler that allows you to use regular expression patterns. This time, I wanted to process only PDF, so I set patterns = ['* .pdf']. I set ʻignore_directories = Trueto ignore the directory, and I wanted to be able to detect both* .pdf and * .PDF, so I set case_sensitive = False`.


Next, prepare ʻObserver`, which is the role to monitor the Handler.

src/watch.py


from watchdog.observers import Observer
from src.handler.handler import Handler


def watch(input_path, output_path, extensions):
    print([f'*.{extension}' for extension in extensions], flush=True)
    event_handler = Handler(input_path=input_path,
                            output_path=output_path,
                            patterns=[f'*.{extension}' for extension in extensions])
    observer = Observer()
    observer.schedule(event_handler, input_path, recursive=False)
    observer.start()
    print('--Start Observer--', flush=True)
    try:
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        observer.unschedule_all()
        observer.stop()
        print('--End Observer--', flush=True)
    observer.join()

In the created Observer object, describe whether to monitor the Handler object, monitored directory, and subdirectories recursively, and create it. Start monitoring with ʻobserver.start ()and keep it running with thewhile statement and time.sleep (1) to continue processing. When Ctrl + C is pressed, ʻobserver.unschedule_all () terminates all monitoring, detaches the event handler, and ʻobserver.stop () notifies the thread to stop. Finally, ʻobserver.join () causes the thread to wait for it to finish.

Get the ISBN code from the barcode using the shell

I referred to this blog. Thank you very much.

-I want to read the barcode image from the pdf file of the self-catering book, get the ISBN, and link the title obtained from Amazon's API

When getting the ISBN code, try to get it from the barcode. The ones I used to get the information from the PDF are pdfinfo, pdfimages, and zbarimg. pdfinfo is to get the total number of pages in the PDF. pdfimages is to make only the first and last pages jpeg based on the total pages obtained from pdfinfo. zbarimg was used to get the ISBN code from the jpeg generated by pdfimages.

getISBN.sh


#!/bin/bash

# Number of pages to check in PDF
PAGE_COUNT=1
# File path
FILE_PATH="$1"

# If the file extension is not pdf
shopt -s nocasematch
if [[ ! $1 =~ .+(\.pdf)$ ]]; then
  exit 1
fi
shopt -u nocasematch

# Delete all .image* generated by pdfimages
rm -f .image*

# Get total count of PDF pages
pages=$(pdfinfo "$FILE_PATH" | grep -E "^Pages" | sed -E "s/^Pages: +//") &&
# Generate JPEG from PDF
pdfimages -j -l "$PAGE_COUNT" "$FILE_PATH" .image_h &&
pdfimages -j -f $((pages - PAGE_COUNT)) "$FILE_PATH" .image_t &&
# Grep ISBN
isbnTitle="$(zbarimg -q .image* | sort | uniq | grep -E '^EAN-13:978' | sed -E 's/^EAN-13://' | sed 's/-//')" &&
# If the ISBN was found, echo the ISBN
[ "$isbnTitle" != "" ] &&
echo "$isbnTitle" && rm -f .image* && exit 0 ||
# Else, exit with error code
rm -f .image* && exit 1

Finally, when the ISBN code is obtained, ʻecho "$ isbnTitle" `is received as standard output on the Python side.

Also this&&Or||I didn't understand the meaning well, but the following article was helpful. Thank you very much.

Use Python to get the ISBN code

Get from barcode

To get from the barcode, pdf2image to image the PDF, and pyzbar to get from the barcode. pyzbar) was used.

With pdf2image, generate an image of jpeg for 2 pages counting from the last page, call decode () with pyzbar for those images, and use the regular expression pattern of ISBN code ( If there is a string that matches ^ 978), it will be returned.

I used TemporaryDirectory () because I wanted the directory to put the generated images to be temporary.

src/isbn_from_pdf.py


import re
import sys
import tempfile
import subprocess
from pyzbar.pyzbar import decode
from pdf2image import convert_from_path

input_path = input_path
texts = []
cmd = f'echo $(pdfinfo "{input_path}" | grep -E "^Pages" | sed -E "s/^Pages: +//")'
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
total_page_count = int(result.stdout.strip())

with tempfile.TemporaryDirectory() as temp_path:
    last_pages = convert_from_path(input_path,
                                    first_page=total_page_count - PAGE_COUNT,
                                    output_folder=temp_path,
                                    fmt='jpeg')
    # extract ISBN from using barcode
    for page in last_pages:
        decoded_data = decode(page)
        for data in decoded_data:
            if re.match('978', data[0].decode('utf-8', 'ignore')):
                return data[0].decode('utf-8', 'ignore').replace('-', '')

Get from text

Another option is to extract the ISBN code from the last page of the book, which contains information such as the publisher and edition of the book.

I used pyocr to extract the strings from the image. To use pyocr, you need the OCR tool, so you need to install Google's tesseract.

src/isbn_from_pdf.py


import re
import sys
import pyocr
import tempfile
import subprocess
import pyocr.builders
from pdf2image import convert_from_path

input_path = input_path
texts = []
cmd = f'echo $(pdfinfo "{input_path}" | grep -E "^Pages" | sed -E "s/^Pages: +//")'
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
total_page_count = int(result.stdout.strip())

with tempfile.TemporaryDirectory() as temp_path:
    last_pages = convert_from_path(input_path,
                                    first_page=total_page_count - PAGE_COUNT,
                                    output_folder=temp_path,
                                    fmt='jpeg')
    tools = pyocr.get_available_tools()
    if len(tools) == 0:
        print('[ERROR] No OCR tool found.', flush=True)
        sys.exit()

    # convert image to string and extract ISBN
    tool = tools[0]
    lang = 'jpn'
    for page in last_pages:
        text = tool.image_to_string(
            page,
            lang=lang,
            builder=pyocr.builders.TextBuilder(tesseract_layout=3)
        )
        texts.append(text)
    for text in texts:
        if re.search(r'ISBN978-[0-4]-[0-9]{4}-[0-9]{4}-[0-9]', text):
            return re.findall(r'978-[0-4]-[0-9]{4}-[0-9]{4}-[0-9]', text).pop().replace('-', '')

Get book information from each API

To get the information of the book, I used Google Books APIs and openBD. did.

Both can be obtained in JSON format, but since the shapes are different, I wanted to write code that is as common as possible, so I used a library called Box. I did.

Box is intended to allow you to get what you would normally get withdict.get ('key')anddict ['key']with dict.key.another_key. .. You can also use dict ['key'].

Other features include the ability for key to convert camelcase (camelCase) to Python's naming convention for snakecase (snake_case), and key for spaces like personal thoughts. There is also a handy feature that allows you to access it like dict.personal_thoughts when there is.

Below is the code to get from ʻopenBD`.

src/bookinfo_from_isbn.py


import re
import json
import requests
from box import Box

OPENBD_API_URL = 'https://api.openbd.jp/v1/get?isbn={}'

HEADERS = {"content-type": "application/json"}

class BookInfo:
    def __init__(self, title, author):
        self.title = title
        self.author = author

    def __str__(self):
        return f'<{self.__class__.__name__}>{json.dumps(self.__dict__, indent=4, ensure_ascii=False)}'


def _format_title(title):
    #Replace full-width brackets and full-width spaces with half-width spaces
    title = re.sub('[() ]', ' ', title).rstrip()
    #Replace one or more half-width spaces with one
    return re.sub(' +', ' ', title)


def _format_author(author):
    #Delete the character string after / written
    return re.sub('/.+', '', author)


def book_info_from_openbd(isbn):
    res = requests.get(OPENBD_API_URL.format(isbn), headers=HEADERS)
    if res.status_code == 200:
        openbd_res = Box(res.json()[0], camel_killer_box=True, default_box=True, default_box_attr='')
        if openbd_res is not None:
            open_bd_summary = openbd_res.summary
            title = _format_title(open_bd_summary.title)
            author = _format_author(open_bd_summary.author)
            return BookInfo(title=title, author=author)
    else:
        print(f'[WARNING] openBD status code was {res.status_code}', flush=True)

Since the title of the acquired book and the information of the author are mixed with full-width and half-width characters, we have prepared a function to correct each. (_Format_title _format_author) I haven't actually cut and tried it yet, so these functions will need to be adjusted.

In Box, camel_killer_box = True which converts camel case to snake case, and default_box = True and default_box_attr ='' even if there is no value.

Correct the file name and move to the appropriate directory

First, when you start it, make sure to create a folder to move the PDF after renaming it.

src/handler/handler.py


import os
import datetime
from watchdog.events import PatternMatchingEventHandler

class Handler(PatternMatchingEventHandler):
    def __init__(self, input_path, output_path, patterns=None):
        if patterns is None:
            patterns = ['*.pdf']
        super(Handler, self).__init__(patterns=patterns,
                                      ignore_directories=True,
                                      case_sensitive=False)
        self.input_path = input_path
        # If the output_path is equal to input_path, then make a directory named with current time
        if input_path == output_path:
            self.output_path = os.path.join(self.input_path, datetime.datetime.now().strftime('%Y%m%d_%H%M%S'))
        else:
            self.output_path = output_path
        os.makedirs(self.output_path, exist_ok=True)

        # Create tmp directory inside of output directory
        self.tmp_path = os.path.join(self.output_path, 'tmp')
        os.makedirs(self.tmp_path, exist_ok=True)

When the process starts, it will create a destination folder formatted with today's date or a specified destination folder. Then, create a tmp folder in the output folder to be placed when some error occurs (when there is the same PDF book, when the ISBN is not found, when the book information is missing). ..


src/handler/handler.py


    def __del__(self):
        # Delete the tmp directory, when the directory is empty
        tmp_files_len = len(os.listdir(self.tmp_path))
        if tmp_files_len == 0:
            os.rmdir(self.tmp_path)

        # Delete the output directory, when the directory is empty
        output_files_len = len(os.listdir(self.output_path))
        if output_files_len == 0:
            os.rmdir(self.output_path)

When the process is completed, describe the __del__ method so that if there is a file in the output destination folder / tmp folder, it will be left and if it is not, it will be deleted.


src/handler/handler.py


import shutil
import subprocess
from src.isbn_from_pdf import get_isbn_from_pdf, NoSuchISBNException
from src.bookinfo_from_isbn import book_info_from_google, book_info_from_openbd, NoSuchBookInfoException

    def on_created(self, event):
        print('!Create Event!', flush=True)
        shell_path = os.path.join(os.path.dirname(__file__), '../../getISBN.sh')
        event_src_path = event.src_path
        cmd = f'{shell_path} {event_src_path}'
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        try:
            if result.returncode == 0:
                # Retrieve ISBN from shell
                isbn = result.stdout.strip()
                print(f'ISBN from Shell -> {isbn}', flush=True)
                self._book_info_from_each_api(isbn, event_src_path)

            else:
                # Get ISBN from pdf barcode or text
                isbn = get_isbn_from_pdf(event_src_path)
                print(f'ISBN from Python -> {isbn}', flush=True)
                self._book_info_from_each_api(isbn, event_src_path)

        except (NoSuchISBNException, NoSuchBookInfoException) as e:
            print(e.args[0], flush=True)
            shutil.move(event_src_path, self.tmp_path)
            print(f'Move {os.path.basename(event_src_path)} to {self.tmp_path}', flush=True)

The ʻon_created` method describes the overall flow in the workflow.

When running the shell, make sure to run the shell with subprocess.run () to receive standard output, receive the shell status from result.returncode, and receive standard output with result.stdout. Can be done

Also, when retrieving book information from the ISBN code, a special exception is thrown.

Summary

Thank you for reading this far. I was struggling with the place to start the command and the variable name / function name, but I managed to make it the minimum form. At this stage, only PDF is supported, but I would like to be able to support epub. I want to be able to do it on Windows as well.

If there are any typographical errors or mistakes, this is the way to go! Please let me know if you have any. Thank you very much.

Recommended Posts

A story about trying to automate a chot when cooking for yourself
A story about trying to implement a private variable in Python.
A story about trying to run multiple python versions (Mac edition)
A story about trying to run JavaScripthon on Windows and giving up.
A story about trying to connect to MySQL using Heroku and giving up
A story about a beginner trying hard to set up CentOS 8 (procedure memo)
A story that suffered from OS differences when trying to implement a dissertation
A story about trying to improve the testing process of a system written in C language for 20 years
A story about trying to install uwsgi on an EC2 instance and failing
A story that required preparation when trying to do a Django tutorial with plain centos7
A story that failed when trying to remove the suffix from the string with rstrip
A story that got stuck when trying to upgrade the Python version on GCE
A story about trying a (Golang +) Python monorepo with Bazel
A story about a Python beginner trying to get Google search results using the API
A story about struggling to loop 3 million ID data
When you want to plt.save in a for statement
A story about trying to introduce Linter in the middle of a Python (Flask) project
[Note] A story about trying to override a class method with two underscores in Python 3 series.
A story about trying to use cron on a Raspberry Pi and getting stuck in space
[Django] A story about getting stuck in a swamp trying to validate a zip with form [TDD]
A story about how to specify a relative path in python.
[python] A note when trying to use numpy with Cython
A story about how to deal with the CORS problem
A story about a war when two newcomers developed an app
A story about a 40-year-old engineer manager passing "Deep Learning for ENGINEER"
The story of the algorithm drawing a ridiculous conclusion when trying to solve the traveling salesman problem properly
About the error I encountered when trying to use Adafruit_DHT from Python on a Raspberry Pi
A story about adding a REST API to a daemon made with Python
A story about wanting to think about garbled characters on GAE / P
A story about trying to reproduce Katsuwo Isono, who does not react to inconvenience, by natural language processing.
[AtCoder for beginners] A story about the amount of calculation that you want to know very roughly
A story when a beginner gets stuck trying to build a vim 8.2 + python 3.8.2 + lua plugin environment on Ubuntu 18.04.4 LTS
A refreshing story about Python's Slice
A sloppy story about Python's Slice
A story addicted to Azure Pipelines
UnicodeEncodeError when trying to run radon
A story about using Python's reduce
Things to watch out for when creating a Python environment on a Mac
A story I was addicted to when inserting from Python to a PostgreSQL table
A story I was addicted to trying to install LightFM on Amazon Linux
A story I was addicted to trying to get a video url with tweepy
[Memorandum] A story about trying OpenCV tutorial (face recognition) in a Windows environment
[Python / Pandas] A bug occurs when trying to replace a DataFrame with `None` with` replace`
A story of a deep learning beginner trying to classify guitars on CNN
I get a UnicodeDecodeError when trying to connect to oracle with python sqlalchemy
A note when looking for an alternative to pandas rolling for moving windows
Easy to try! A story about making a deep learning Othello and strengthening it until you beat yourself (Part 1)