Write Pandoc filters in Python

Write Pandoc filters in Python

I use the general-purpose document format conversion tool pandoc. For more information on Pandoc, please refer to Japanese User's Guide.

However, in some cases, you may want to subtly modify the document during conversion. For example, you may want to replace the URL of a link all at once when converting a document written in Markdown to HTML. It's easy to convert with a regular expression, but pandoc actually provides a filter feature. Filters allow you to take advantage of the syntax tree of parsed documents. Filters can be written in Haskell as well as pandoc itself, but mechanically they can be written in any language and Python is officially supported.

As shown below, the syntax tree of the document parsed by pandoc is converted to JSON format and passed to the filter via standard input / output (the figure is from the manual).

                         source format
                              ↓
                           (pandoc)
                              ↓
                      JSON-formatted AST
                              ↓
                           (filter)
                              ↓
                      JSON-formatted AST
                              ↓
                           (pandoc)
                              ↓
                        target format

You can use it to write intelligent filters. First, let's install the officially provided pandocfilters.

pip install pandocfilters

Let's use this to write a filter that changes the link URL in the document immediately.

convertlink.py


from pandocfilters import toJSONFilter, Link


def myfilter(key, value, format_, meta):
    if key == 'Link':
        value[1][0] = "prefix/" + value[1][0]
        return Link(*value)


if __name__ == "__main__":
    toJSONFilter(myfilter)

To do this, specify the filter option when running pandoc. Note that you have to write "./convertlink.py" to specify the script in the current directory.

sample.txt


## sample document
text text text

[link](path/to/otherpage)
$ pandoc --filter=./convertlink.py -t markdown sample.txt
sample document
---------------

text text text

[link](prefix/path/to/otherpage)

Other tips

A sample syntax tree (pandoc AST) used by pandoc can be output with pandoc. If you specify json, you can check it in JSON format, and if you specify native, you can check it in Haskell format.


$ pandoc -t json sample.txt 
[{"unMeta":{}},[{"t":"Header","c":[2,["sample-document",[],[]],[{"t":"Str","c":"sample"},{"t":"Space","c":[]},{"t":"Str","c":"document"}]]},{"t":"Para","c":[{"t":"Str","c":"text"},{"t":"Space","c":[]},{"t":"Str","c":"text"},{"t":"Space","c":[]},{"t":"Str","c":"text"}]},{"t":"Para","c":[{"t":"Link","c":[[{"t":"Str","c":"link"}],["path/to/otherpage",""]]}]}]]
$ pandoc -t native sample.txt
[Header 2 ("sample-document",[],[]) [Str "sample",Space,Str "document"]
,Para [Str "text",Space,Str "text",Space,Str "text"]
,Para [Link [Str "link"]("path/to/otherpage","")]]

Format details can be found in the Text.Pandoc.Definition documentation (http://hackage.haskell.org/package/pandoc-types).

Also, specifying filter options is equivalent to the following command pipeline, which you can use while debugging.


$ pandoc -t json sample.txt | python ./convertlink.py | pandoc -f json -t markdown

Recommended Posts

Write Pandoc filters in Python
Write Python in MySQL
Write beta distribution in Python
Write python in Rstudio (reticulate)
Write a binary search in Python
Write JSON Schema in Python DSL
Write an HTTP / 2 server in Python
Write AWS Lambda function in Python
Write A * (A-star) algorithm in Python
Write selenium test code in python
Write a pie chart in Python
Write a vim plugin in Python
Write a depth-first search in Python
Write C unit tests in Python
Write documentation in Sphinx with Python Livereload
Quadtree in Python --2
Python in optimization
Metaprogramming in Python
Python 3.3 in Anaconda
Geocoding in python
SendKeys in Python
Meta-analysis in Python
Unittest in python
Write the test in a python docstring
Implement FIR filters in Python and C
Epoch in Python
Discord in Python
Write a short property definition in Python
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
Write O_SYNC file in C and Python
N-Gram in Python
Programming in python
Write a Caesar cipher program in Python
Plink in Python
Constant in python
Read and write JSON files in Python
Write a simple greedy algorithm in Python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
Write python modules in fortran using f2py
Write a simple Vim Plugin in Python 3
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Constant in python
nCr in Python.
format in python
Scons in Python3
Puyo Puyo in python
python in virtualenv
PPAP in Python
Quad-tree in Python
Reflection in Python
Chemistry in Python