I want to manupulate a Markdown document.
If you look up something like that in Python, it seems common to convert Markdown to HTML. Instead, if you want to work with Markdown-formatted documents, you have to be able to use something AST-like. There wasn't much in itself, so I searched for something that could be used as a base for remodeling.
Manupulate in Markdown documents is hard.
For example, suppose you want to create an oleore extension (nanbuwks extension) that inserts a specific template when you write the following comment in the sentence.
[](nanbuwks:template)
Is it possible to simply search / replace a character string for Makrdown text?
――No, if the code block contains this line, you have to ignore it.
--The start / end of the code block is "` \ \
", but if" \ \
\ " appears, you should ignore it. ――But if "\
\ \
"is in \ [] \ (), you have to ignore it.
-\ [] \ () should be ignored if it is in the embedded HTML ...
After all, you have to parse the document structure of markdown to determine whether it can be regarded as raw text, and then search / replace the character string.
This is not limited to oleore expansion, but also when changing the image path, changing the level of Heading, automatic formatting, etc.
... How to parse markdown document structure?
In general, it seems to output Markdown to HTML.
Markdown → Talkerizer → Document structure data → Renderer → Output to HTML
Will be. Imagine the following.
Like this?
--markdown has block level and span level --There are block tokens and span tokens as tokens. --First, make the entire document one block --Inspection line by line --Check if the block token fits recursively --Check if the span token fits the surplus text
Store the document structure examined by the talker in internal data Document structure data has a tree structure
Reconstruct the document by adding tags and control text based on the document structure data. In general, there are many things that are converted to HTML here.
If you can get the document structure data by linking it to the original Markdown text structure, you can operate while referring to the document structure. But I couldn't find it. Then I wanted to make something like that by trying to modify the existing one, but it seems to take a lot of time, so I gave up this time.
Markdown can be regenerated from document structure data and context. However, since it is based on abstract data, the original document cannot be completely reproduced. Isn't there such a thing because it's okay? → It didn't seem to be there.
Can you get something like AST? There seems to be no such thing as getting output without modifying the code.
(Added on 2020/1/4 I found some things to convert from Markdown to JSON when I searched for it later. I have not investigated the details, but it may have met the requirements.)
It seems that Markdown can output by modifying the existing code. Is it easy to add more functions if it can be done?
Investigate with the following requirements
Python-Markdown
https://github.com/Python-Markdown/markdown It feels like a standard library. There are many plugins. However, it doesn't seem to be converted to anything other than HTML. For example, Markdown-LaTeX also seems to convert Markdown inline Markdown-LaTeXTEX notation to HTML. I gave up because it seems that it is different from what I want.
commonmark.py
https://github.com/readthedocs/commonmark.py
A port for commonmark to Python. The common mark is made based on Markdown's standard idea. The reference implementation is below. https://github.com/commonmark/commonmark.js
Focusing on the reference part, I give up because it seems that it is not suitable for application.
mistune
Processing seems to be fast Version2 is out on December 2019, but as of January 1, 2019, Ubuntu pip3 will install 0.8.4. I did the following to install the latest version.
$ sudo pip3 install git+https://github.com/lepture/mistune.git
Give up due to lack of documentation
mistletoe
By default, mistletoe has output to LaTeX in addition to HTML.
It seems to be easy to use, so I will try to make it based on this.
The one that seems to be good that came out after examining the above four and making it with mistletoe. I haven't checked it properly, but I'll take another opportunity.
pycmark
https://github.com/tk0miya/pycmark
There is a description that "because it has extensibility, flexible parsing such as GFM (GitHub Flavored Markdown) support and addition of original notation is possible" https://www.papercall.io/speakers/tk0miya/speaker_talks/78833-markdown
marko
https://pypi.org/project/marko/#extend-marko
Among all implementations of Python's markdown parser, it is a common issue that user can't easily extend it to add his own features. a.
Recommended Posts