Qt-based library "Poppler" that can quickly read PDF as an image in C ++ or Python

It's not surprising. PDF library.

I thought that there would be as many libraries as possible to easily read and convert PDFs, but unexpectedly, I couldn't find one that was easy to use. However, the library called Poppler written using Qt was quite good, so I tried using it. Poppler-qt4 (C++) http://people.freedesktop.org/~aacid/docs/qt4/ python-poppler-qt4 https://pypi.python.org/pypi/python-poppler-qt4/

Also, Poppler's documentation is here. First of all, read from Document type to find out what you can do.

There is also a Qt5 version, but this time I used the Qt4 version. Some were written in both C ++ and Python. Python used Python3.

All the code posted here is posted on github.

The good thing about Poppler

Let's use

As a demo, I made the following.

  1. Save PDF as an image of each page (Python only)
  2. Text extraction script from PDF (Python only)
  3. Super simple PDF viewer (implemented in both C ++ and Python. Only the C ++ version is introduced in the article)

Save the image

Load with doc = Poppler.Document.load (path). Note that if you don't write doc.setRenderHint (Poppler.Document.Text Antialiasing), the output will be very messy. After loading, retrieve the Page object with page = doc.page (i). Page numbers start at zero. If ʻimage = page.renderToImage (), the image will be returned by Qt's [QImage type](http://doc.qt.io/qt-4.8/qimage.html). Images can be saved with ʻimage.save ().

Written in Python. If you want to, I think it's easy to go to C ++.

dump_image.py


import sys
import os.path
from contextlib import closing

from PyQt4 import QtCore
from popplerqt4 import Poppler

FORMAT = 'PNG'
EXT = '.png'

def dump_image(path):
    doc = Poppler.Document.load(path)
    doc.setRenderHint(Poppler.Document.TextAntialiasing)
    filename_fmt = os.path.splitext(path)[0] + '_{0}' + EXT
    for n,page in ((i+1, doc.page(i)) for i in range(doc.numPages())):
        page.renderToImage().save(filename_fmt.format(n), FORMAT)

if __name__ == '__main__':
    app = QtCore.QCoreApplication(sys.argv)
    if len(sys.argv) != 2:
        print('Usage: {0} pdf_path'.format(sys.argv[0]))
    else:
        dump_image(sys.argv[1])
    sys.exit(0)

Let's extract the text

For text extraction, I think there are many libraries besides Poppler. I can do it for the time being. page.textList () returns a list of Textbox objects with words, so we're looping through each one.

dump_text.py


import sys

from PyQt4 import QtCore
from popplerqt4 import Poppler

def dump_text(path):
    doc = Poppler.Document.load(path)
    for n,page in ((i+1, doc.page(i)) for i in range(doc.numPages())):
        print('\n-------- Page {0} -------'.format(n))
        for txtbox in page.textList():
            print(txtbox.text(), end=' ')

if __name__ == '__main__':
    app = QtCore.QCoreApplication(sys.argv)
    if len(sys.argv) != 2:
        print('Usage: {0} pdf_path'.format(sys.argv[0]))
    else:
        dump_text(sys.argv[1])
    sys.exit(0)

Let's make a PDF viewer (C ++)

It turns out that Poppler can convert pages to QImage type images. You can also have QPainter draw the page. You can use QPainter, but this time, I will take the QImage type, convert it to the QPixmap type, and display it on the QLabel widget (PdfWidget that inherits it).

I didn't want to make something too elaborate, so I didn't have a file reading UI. The file path is specified in the widget constructor and it is loaded. I think I should have resized it to fit the widget size, but I haven't done it this time.

pdfwidget.h


#include <QLabel>
#include <QString>

namespace Poppler{
        class Document;
};

class PdfWidget : public QLabel{
        Q_OBJECT
public:
        PdfWidget(QString path, QWidget *parent = 0);
public slots:
        void next_page();
        void prev_page();
private:
        void load_current_page();
        int n_pages;
        int current_page;
        QString path;
        Poppler::Document *doc;
};

pdfwidget.cpp


#include "pdfwidget.h"
#include <poppler-qt4.h>
#include <QAction>
#include <QPixmap>

PdfWidget::PdfWidget(QString path, QWidget *parent)
                : QLabel(parent), path(path)
{
        doc = Poppler::Document::load(path);
        doc->setRenderHint(Poppler::Document::TextAntialiasing);
        n_pages = doc->numPages();
        current_page = 0;
        load_current_page();

        QAction *next_page = new QAction("Next Page", this);
        next_page->setShortcut(Qt::Key_Right);
        connect(next_page, SIGNAL(triggered()), this, SLOT(next_page()));
        addAction(next_page);

        QAction *prev_page = new QAction("Prev Page", this);
        prev_page->setShortcut(Qt::Key_Left);
        connect(prev_page, SIGNAL(triggered()), this, SLOT(prev_page()));
        addAction(prev_page);
}

void PdfWidget::next_page()
{
        current_page++;
        if(current_page >= n_pages) current_page = 0;
        load_current_page();
}

void PdfWidget::prev_page()
{
        if(current_page) current_page--;
        else current_page = n_pages - 1;
        load_current_page();
}

void PdfWidget::load_current_page()
{
        setPixmap(QPixmap::fromImage(doc->page(current_page)->renderToImage()));
        setWindowTitle(QString("%1/%2 - %3").arg(current_page+1)
                                            .arg(n_pages).arg(path));
}

For the time being, I try to go to the next page / previous page with the right / left of the keyboard.

Summary

As far as I know, there aren't many libraries that can convert PDF to images, either in C ++ or Python. I introduced Poppler as an easy-to-use library on multiple platforms, even if it's not so much.

It goes well with Qt, so it's very easy to use when you want to handle PDF with GUI.

Recommended Posts

Qt-based library "Poppler" that can quickly read PDF as an image in C ++ or Python
About psd-tools, a library that can process psd files in Python
The eval () function that calculates a string as an expression in python
[Python3] Code that can be used when you want to cut out an image in a specific size
I registered PyQCheck, a library that can perform QuickCheck with Python, in PyPI.
How to use the C library in Python
Output formatted output in Python, such as C / C ++ printf.
Building an environment that uses Python in Eclipse
How to use Python Image Library in python3 series
Read table data in PDF file with Python
Create an image with characters in python (Japanese)
Set up an FTP server that can be created and destroyed immediately (in Python)