I want to manupulate a Markdown document.

If you look up something like that in Python, it seems common to convert Markdown to HTML. Instead, if you want to work with Markdown-formatted documents, you have to be able to use something AST-like. There wasn't much in itself, so I searched for something that could be used as a base for remodeling.

Basic knowledge

Manupulate in Markdown documents is hard.

For example, suppose you want to create an oleore extension (nanbuwks extension) that inserts a specific template when you write the following comment in the sentence.


[](nanbuwks:template)

Is it possible to simply search / replace a character string for Makrdown text?

――No, if the code block contains this line, you have to ignore it. --The start / end of the code block is "` \ \ ", but if" \ \ \ " appears, you should ignore it. ――But if "\ \ \ "is in \ [] \ (), you have to ignore it. -\ [] \ () should be ignored if it is in the embedded HTML ...

After all, you have to parse the document structure of markdown to determine whether it can be regarded as raw text, and then search / replace the character string.

This is not limited to oleore expansion, but also when changing the image path, changing the level of Heading, automatic formatting, etc.

... How to parse markdown document structure?

General parser

In general, it seems to output Markdown to HTML.

Markdown → Talkerizer → Document structure data → Renderer → Output to HTML

Will be. Imagine the following.

Talker

Like this?

--markdown has block level and span level --There are block tokens and span tokens as tokens. --First, make the entire document one block --Inspection line by line --Check if the block token fits recursively --Check if the span token fits the surplus text

Document structure data

Store the document structure examined by the talker in internal data Document structure data has a tree structure

Renderer

Reconstruct the document by adding tags and control text based on the document structure data. In general, there are many things that are converted to HTML here.

What you want

Those who can operate raw Markdown

If you can get the document structure data by linking it to the original Markdown text structure, you can operate while referring to the document structure. But I couldn't find it. Then I wanted to make something like that by trying to modify the existing one, but it seems to take a lot of time, so I gave up this time.

What can make Markdown from abstract data

Markdown can be regenerated from document structure data and context. However, since it is based on abstract data, the original document cannot be completely reproduced. Isn't there such a thing because it's okay? → It didn't seem to be there.

Those that can output abstract data

Can you get something like AST? There seems to be no such thing as getting output without modifying the code.

(Added on 2020/1/4 I found some things to convert from Markdown to JSON when I searched for it later. I have not investigated the details, but it may have met the requirements.)

Things that could make Markdown from abstract data

It seems that Markdown can output by modifying the existing code. Is it easy to add more functions if it can be done?

Investigation

Investigate with the following requirements

Python3 --It is desirable to be able to convert not only HTML but also multiple other formats. --Things that could be developed into raw Markdown operations in the future

Python-Markdown

https://github.com/Python-Markdown/markdown It feels like a standard library. There are many plugins. However, it doesn't seem to be converted to anything other than HTML. For example, Markdown-LaTeX also seems to convert Markdown inline Markdown-LaTeXTEX notation to HTML. I gave up because it seems that it is different from what I want.

commonmark.py

https://github.com/readthedocs/commonmark.py

A port for commonmark to Python. The common mark is made based on Markdown's standard idea. The reference implementation is below. https://github.com/commonmark/commonmark.js

Focusing on the reference part, I give up because it seems that it is not suitable for application.

mistune

Processing seems to be fast Version2 is out on December 2019, but as of January 1, 2019, Ubuntu pip3 will install 0.8.4. I did the following to install the latest version.


$ sudo pip3 install git+https://github.com/lepture/mistune.git

Give up due to lack of documentation

mistletoe

By default, mistletoe has output to LaTeX in addition to HTML.

LaTeX
JIRA
Scheme ?

It seems to be easy to use, so I will try to make it based on this.

What I learned after the survey

The one that seems to be good that came out after examining the above four and making it with mistletoe. I haven't checked it properly, but I'll take another opportunity.

pycmark

https://github.com/tk0miya/pycmark

There is a description that "because it has extensibility, flexible parsing such as GFM (GitHub Flavored Markdown) support and addition of original notation is possible" https://www.papercall.io/speakers/tk0miya/speaker_talks/78833-markdown

marko

https://pypi.org/project/marko/#extend-marko

Among all implementations of Python's markdown parser, it is a common issue that user can't easily extend it to add his own features. a.

Investigating what could be used as a Markdown parser in Python