Convert markdown to PDF in Python

What i did

In Python, I scraped a web page containing markdown notation, output it to markdown, and then converted the markdown to HTML and then to PDF. (Complicated) I used Beautiful Soup for scraping, but I will omit it this time. I didn't have much information about Markdown → HTML → PDF, so I summarized it.

Markdown → HTML used Markdown, HTML → PDF used pdfkit.

environment

The environment is as follows.

In case of Windows environment, it is necessary to install wkhtmltopdf before using pdfkit. Install the 64-bit version from this site. You can use the path, or you can specify the path of the executable file directly in the Python code as you do this time.

We also use Pygments to highlight the code blocks.

The folder structure is as follows.

   app
    |- file
    |   |- source
    |   |   └─ source.md ← Original markdown
    |└─ pdf ← PDF output destination
    |- app.py
    └─requirements.txt

Install each library.

requirements.txt


Markdown==3.2.1
pdfkit==0.6.1
Pygments==2.6.1
> pip install -r requirements.txt

The markdown that is the conversion source is as follows.

source.md


#title

##subtitle

Markdown text.

###list

1.Numbered list
    1.Nested numbered list
1.Numbered list

###text

this is*italic*It is a description of.
this is**Emphasis**It is a description of.
this is[Link](https://qiita.com/)It is a description of.

###Source code

- Java

\```java

class HelloWorld {

    public static void main(String[] args) {
        System.out.println("Hello World");
    }
}

\```

- Python

\```python

class HelloWorld:
    def __init__(self, id, name):
        self.id = id
        self.name = name

if __name__=='__main__':
    hello = HelloWorld(1, 'python')

\```

###Ruled line

***

---

* * *

###table

| Table Header | Table Header | Table Header |
| :-- | :--: | --: |
| Body | Body | Body |
| Left | Center | Right |


Markdown → HTML

Use the Markdown library to convert Markdown to HTML. Enable the extension codehilite and highlight the source code with Pygments.

app.py


import markdown
from pygments import highlight
from pygments.formatters import HtmlFormatter

def mark_to_html():
    #Read markdown file
    f = open('file/source/source.md', mode='r', encoding='UTF-8')
    with f:
        text = f.read()
        #Create stylesheets for highlights with Pygments
        style = HtmlFormatter(style='solarized-dark').get_style_defs('.codehilite')
        # #Markdown → HTML conversion
        md = markdown.Markdown(extensions=['extra', 'codehilite'])
        body = md.convert(text)
        #Fit to HTML format
        html = '<html lang="ja"><meta charset="utf-8"><body>'
        #Import stylesheets created with Pygments
        html += '<style>{}</style>'.format(style)
        #Add style to add border to Table tag
        html += '''<style> table,th,td { 
            border-collapse: collapse;
            border:1px solid #333; 
            } </style>'''
        html += body + '</body></html>' 
        return html

Markdown has the following extensions that you can specify as a list when creating an object.

md = markdown.Markdown(extensions=["Extensions"])

Only the ones that can be used are excerpted. For all extensions below.

https://python-markdown.github.io/extensions/#officially-supported-extensions

Expansion function
extra Convert basic markdown notation such as abbreviation elements, lists, code blocks, citations, tables to HTML
admonition A note can be output
codehilite You can add syntax highlighting defined in Pygments to code blocks. Requires Pygments.
meta You can get the meta information of the file
nl2br One line feed code<br>Convert to tag
sane_lists Supports lists with line breaks and numbered lists from specified numbers
smarty ",>Supports HTML special characters such as
toc Automatically create a table of contents from the composition of headings
wikilinks [[]]Corresponds to the link notation in

You can also use third-party extensions.

https://github.com/Python-Markdown/markdown/wiki/Third-Party-Extensions

The Pygments that I'm trying to highlight in the code specifies the style and class names to apply.

style = HtmlFormatter(style="Style name").get_style_defs("name of the class")

Applicable styles can be obtained from the Pygments command line tool.

> pygmentize -L styles

The class name will be output as <code class =" codehilite "> when codehilite is enabled in markdown, so set it to codehilite.

HTML → PDF

We use a library called pdfkit to convert from HTML to PDF. Since this library uses wkhtmltopdf internally, you need to put the path in the environment variable or specify the path of the executable file at runtime.

app.py


import pdfkit

def html_to_pdf(html:str):
    """
    html : str HTML
    """
    #Specifying the output file
    outputfile = 'file/pdf/output.pdf'
    #Specifying the path of the wkhtmltopdf executable file
    path_wkhtmltopdf = r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'
    config = pdfkit.configuration(wkhtmltopdf=path_wkhtmltopdf)
    #Perform HTML → PDF conversion
    pdfkit.from_string(html, outputfile, configuration=config)

This time, the MTML format string is converted to a PDF file, but you can also convert the website to PDF by specifying the URL. In that case, use the from_url function. Also, if you want to convert an HTML file to PDF, use from_file.

Source code

Overall source code
import markdown
import pdfkit
from pygments import highlight
from pygments.formatters import HtmlFormatter

def mark_to_html():
    #Read markdown file
    f = open('file/source/source.md', mode='r', encoding='UTF-8')
    with f:
        text = f.read()
        #Create stylesheets for highlights with Pygments
        style = HtmlFormatter(style='solarized-dark').get_style_defs('.codehilite')
        # #Markdown → HTML conversion
        md = markdown.Markdown(extensions=['extra', 'codehilite'])
        body = md.convert(text)
        #Fit to HTML format
        html = '<html lang="ja"><meta charset="utf-8"><body>'
        #Import stylesheets created with Pygments
        html += '<style>{}</style>'.format(style)
        #Add style to add border to Table tag
        html += '''<style> table,th,td { 
            border-collapse: collapse;
            border:1px solid #333; 
            } </style>'''
        html += body + '</body></html>' 
        return html

def html_to_pdf(html: str):
    #Specifying the output file
    outputfile = 'file/pdf/output.pdf'
    #Specifying the path of the wkhtmltopdf executable file
    path_wkhtmltopdf = r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'
    config = pdfkit.configuration(wkhtmltopdf=path_wkhtmltopdf)
    #Perform HTML → PDF conversion
    pdfkit.from_string(html, outputfile, configuration=config)

if __name__=='__main__':
    html = mark_to_html()
    html_to_pdf(html)

Conversion result

The result of executing the above function and converting it is as follows.

output.png

The output result of markdown can be converted to PDF. The code block is also highlighted. I didn't include it in the example this time, but you can output the image without any problem.

Summary

I used the Python libraries Markdown and pdfkit to convert markdown to PDF. If you create a template, you can convert the minutes to PDF immediately, which is convenient. I hope you can use it as a material for improving work efficiency.

Recommended Posts

Convert markdown to PDF in Python
Convert from Markdown to HTML in Python
Convert the image in .zip to PDF with Python
Convert psd file to png in Python
Convert absolute URLs to relative URLs in Python
Rasterize PDF in Python
Handle markdown in python
Convert files written in python etc. to pdf with syntax highlighting
Convert FBX files to ASCII <-> BINARY in Python
Convert PDF to image (JPEG / PNG) with Python
Convert PDFs to images in bulk with Python
How to convert SVG to PDF and PNG [Python]
Batch convert PSD files in directory to PDF
Convert exponential notation float to str in Python
Convert cubic mesh code to WKT in Python
I want to convert a table converted to PDF in Python back to CSV
[python] Convert date to string
To flush stdout in Python
Convert numpy int64 to python int
[Python] Convert list to Pandas [Pandas]
Login to website in Python
OCR from PDF in Python
Convert Scratch project to Python
[Python] Convert Shift_JIS to UTF-8
Convert CIDR notation in Python
Speech to speech in python [text to speech]
How to develop in Python
Convert python 3.x code to python 2.x
Post to Slack in Python
Convert timezoned date and time to Unixtime in Python2.7
How to convert / restore a string with [] in python
Convert NumPy array "ndarray" to lilt in Python [tolist ()]
Convert CIDR notation netmask to dotted decimal notation in Python
[Python] Convert PDF text to CSV page by page (2/24 postscript)
How to convert floating point numbers to binary numbers in Python
Convert callback-style asynchronous API to async / await in Python
How to add page numbers to PDF files (in Python)
Convert / return class object to JSON format in Python
[Python] Created a method to convert radix in 1 second
Convert Webpay Entity type to Dict type (recursively in Python)
[Python] How to do PCA in Python
Convert PDF to Documents by OCR
Convert A4 PDF to A3 every 2 pages
How to collect images in Python
How to use SQLite in Python
Workflow to convert formula (image) to python
In the python command python points to python3.8
Convert list to DataFrame with python
Try to calculate Trace in Python
How to convert 0.5 to 1056964608 in one shot
Python> list> Convert double list to single list
Convert from pdf to txt 2 [pyocr]
How to use Mysql in python
How to wrap C in Python
How to use ChemSpider in Python
6 ways to string objects in Python
How to use PubChem in Python
[Python] Convert natural numbers to ordinal numbers
Convert images passed to Jason Statham-like in Python to ASCII art
Convert decimal numbers to n-ary numbers [python]
Convert PDF to image with ImageMagick