Investigating what could be used as a Markdown parser in Python

I want to manupulate a Markdown document.

If you look up something like that in Python, it seems common to convert Markdown to HTML. Instead, if you want to work with Markdown-formatted documents, you have to be able to use something AST-like. There wasn't much in itself, so I searched for something that could be used as a base for remodeling.

Basic knowledge

Manupulate in Markdown documents is hard.

For example, suppose you want to create an oleore extension (nanbuwks extension) that inserts a specific template when you write the following comment in the sentence.


[](nanbuwks:template)

Is it possible to simply search / replace a character string for Makrdown text?

――No, if the code block contains this line, you have to ignore it. --The start / end of the code block is "` \ \ ", but if" \ \ \ " appears, you should ignore it. ――But if "\ \ \ "is in \ [] \ (), you have to ignore it. -\ [] \ () should be ignored if it is in the embedded HTML ...

After all, you have to parse the document structure of markdown to determine whether it can be regarded as raw text, and then search / replace the character string.

This is not limited to oleore expansion, but also when changing the image path, changing the level of Heading, automatic formatting, etc.

... How to parse markdown document structure?

General parser

In general, it seems to output Markdown to HTML.

Markdown → Talkerizer → Document structure data → Renderer → Output to HTML

Will be. Imagine the following.

Talker

Like this?

--markdown has block level and span level --There are block tokens and span tokens as tokens. --First, make the entire document one block --Inspection line by line --Check if the block token fits recursively --Check if the span token fits the surplus text

Document structure data

Store the document structure examined by the talker in internal data Document structure data has a tree structure

Renderer

Reconstruct the document by adding tags and control text based on the document structure data. In general, there are many things that are converted to HTML here.

What you want

Those who can operate raw Markdown

If you can get the document structure data by linking it to the original Markdown text structure, you can operate while referring to the document structure. But I couldn't find it. Then I wanted to make something like that by trying to modify the existing one, but it seems to take a lot of time, so I gave up this time.

What can make Markdown from abstract data

Markdown can be regenerated from document structure data and context. However, since it is based on abstract data, the original document cannot be completely reproduced. Isn't there such a thing because it's okay? → It didn't seem to be there.

Those that can output abstract data

Can you get something like AST? There seems to be no such thing as getting output without modifying the code.

(Added on 2020/1/4 I found some things to convert from Markdown to JSON when I searched for it later. I have not investigated the details, but it may have met the requirements.)

Things that could make Markdown from abstract data

It seems that Markdown can output by modifying the existing code. Is it easy to add more functions if it can be done?

Investigation

Investigate with the following requirements

Python-Markdown

https://github.com/Python-Markdown/markdown It feels like a standard library. There are many plugins. However, it doesn't seem to be converted to anything other than HTML. For example, Markdown-LaTeX also seems to convert Markdown inline Markdown-LaTeXTEX notation to HTML. I gave up because it seems that it is different from what I want.

commonmark.py

https://github.com/readthedocs/commonmark.py

A port for commonmark to Python. The common mark is made based on Markdown's standard idea. The reference implementation is below. https://github.com/commonmark/commonmark.js

Focusing on the reference part, I give up because it seems that it is not suitable for application.

mistune

Processing seems to be fast Version2 is out on December 2019, but as of January 1, 2019, Ubuntu pip3 will install 0.8.4. I did the following to install the latest version.


$ sudo pip3 install git+https://github.com/lepture/mistune.git

Give up due to lack of documentation

mistletoe

By default, mistletoe has output to LaTeX in addition to HTML.

It seems to be easy to use, so I will try to make it based on this.

What I learned after the survey

The one that seems to be good that came out after examining the above four and making it with mistletoe. I haven't checked it properly, but I'll take another opportunity.

pycmark

https://github.com/tk0miya/pycmark

There is a description that "because it has extensibility, flexible parsing such as GFM (GitHub Flavored Markdown) support and addition of original notation is possible" https://www.papercall.io/speakers/tk0miya/speaker_talks/78833-markdown

marko

https://pypi.org/project/marko/#extend-marko

Among all implementations of Python's markdown parser, it is a common issue that user can't easily extend it to add his own features. a.

Recommended Posts

Investigating what could be used as a Markdown parser in Python
A record that GAMEBOY could not be done in Python. (PYBOY)
33 strings that should not be used as variable names in python
Tkinter could not be imported in Python
What happens if you do "import A, B as C" in Python?
I made a familiar function that can be used in statistics with Python
I tried to implement what seems to be a Windows snipping tool in Python
Japanese can be used with Python in Docker environment
Specify a subcommand as a command line argument in Python
A collection of code often used in personal Python
list comprehension because operator.methodcaller cannot be used in python 2.5
How to display DataFrame as a table in Markdown
Operators ++,-cannot be used in python (difference from php)
Handle markdown in python
Can be used in competition pros! Python standard library
A collection of Excel operations often used in Python
[Redash] Standard library cannot be used in python function
What does the last () in a function mean in Python?
What seems to be a template of the standard input part of the competition pro in python3
Grayscale image is displayed as a color image in OpenCV / Python
Scripts that can be used when using bottle in Python
Take a screenshot in Python
Create a function in Python
Make a bookmarklet in Python
Draw a heart in Python
What I learned in Python
Compiler in Python: PL / 0 parser
What is a python map?
Run the output code on the local web server as "A, pretending to be B" in python
[Python] A memo of frequently used phrases (by myself) in Python scripts
What to do if you get a minus zero in Python
A timer (ticker) that can be used in the field (can be used anywhere)
New features in Python 3.9 (1)-Union operators can be used in dictionary types
Python standard input summary that can be used in competition pro
A class that summarizes frequently used methods in twitter api (python)
Get the formula in an excel file as a string in Python
I wrote a tri-tree that can be used for high-speed dictionary implementation in D language and Python.
What is Python? What is it used for?
[Python] What is a zip function?
Maybe in a python (original title: Maybe in Python)
[Python] What is a with statement?
Write a binary search in Python
Use pymol as a python library
[python] Manage functions in a list
Hit a command in Python (Windows)
8 Frequently Used Commands in Python Django
Create a DI Container in Python
Draw a scatterplot matrix in python
ABC166 in Python A ~ C problem
Write A * (A-star) algorithm in Python
[Python] Basic knowledge used in AtCoder
Use blender as a python module
Solve ABC036 A ~ C in Python
Write a pie chart in Python
Write a vim plugin in Python
Write a depth-first search in Python
Implementing a simple algorithm in Python 2
Create a Kubernetes Operator in Python
Solve ABC037 A ~ C in Python
Launch a Python script as a service
Run a simple algorithm in Python