[PYTHON] Process the dotted line as a solid line with camelot (Hough transform)


camelot is not good at dotted lines and often fails, so when I looked it up, I found the following reference article

Since camelot is extracted with opencv, it seems that you can rewrite the dotted line, so I extracted the dotted line with Hough transform and overwrote it with the solid line and it worked.


Using Python makes it easy to parse PDFs containing text ... I had a time when I was thinking that way

[Process the dotted line as a solid line with camelot](https://needtec.sakura.ne.jp/wod07672/2020/05/03/camelot%e3%81%a7%e7%82%b9%e7%b7%9a % e3% 82% 92% e5% ae% 9f% e7% b7% 9a% e3% 81% a8% e3% 81% 97% e3% 81% a6% e5% 87% a6% e7% 90% 86% e3 % 81% 99% e3% 82% 8b /)

I will use the dotted PDF next to this article


Hough transform

Linear detection by Hough transform of OpenCV


Straight line extraction with Hough transform


PDF of list of member stores of Go To Eat in Chiba


Extract only horizontal straight lines by Hough transform



import cv2
import numpy as np

import camelot

#Patch creation

def my_threshold(imagename, process_background=False, blocksize=15, c=-2):

    img = cv2.imread(imagename)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    edges = cv2.Canny(gray, 50, 150, apertureSize=3)

    lines = cv2.HoughLinesP(
        edges, rho=1, theta=np.pi / 180, threshold=80, minLineLength=3000, maxLineGap=50

    for line in lines:
        x1, y1, x2, y2 = line[0]
        #Y1 if horizontal==y2, x1 for vertical==Filter by x2 if
        cv2.line(img, (x1, y1), (x2, y2), (0, 0, 0), 1) 

    if process_background:
        threshold = cv2.adaptiveThreshold(
            gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, blocksize, c
        threshold = cv2.adaptiveThreshold(
    return img, threshold

camelot.parsers.lattice.adaptive_threshold = my_threshold

tables = camelot.read_pdf("data.pdf", pages="all")


Before patch description

Since the dotted line part does not react, it is vertically connected. Screenshot_2020-11-04 Google Colaboratory(1).png

After patch abstract

Screenshot_2020-11-04 Google Colaboratory.png

