Purpose

I decided to learn how the markdown parser library is implemented. (Abrupt) It is said that the design (design pattern) of Java's dom parser is wonderful, but first I searched for a Python library that I am accustomed to.

By the way, this is Java. https://www.tutorialspoint.com/java_xml/java_dom_parse_document.htm

This time I will read the source of the following library.

Python-Markdown https://github.com/Python-Markdown

It seems that you can convert to Markdown-> HTML. Of course, we are analyzing Markdown inside, so let's see what kind of design it is!

If you find something wrong with your understanding, please point it out ...!

Note

It seems that the core functions are collected under Python-Markdown / markdown / markdown /. Since it will be troublesome in the future, basically the files under this directory will be abbreviated as sample.py.

In addition, the source code posted below is basically an excerpt of only the necessary parts (+ comment out and write a memo).

Spoilers for the main part first

The bottom line is that each processor that detects a particular element was activated block by block. The target to act on is each block of the original text divided by "\ n \ n". This is, for example

<b>The tag is incomplete, but in bold

Not in bold</b>

If a blank line is inserted (= "\ n \ n" appears) like this, the effective range of the element will be cut off.

In this example, first

["<b>The tag is incomplete, but in bold", "Not in bold</b>"」

First, the " <b> tag is incomplete, but the processor that detects each element is operated in order for ", which is bold, and then proceed to the next block ... I think that it will be the flow.

Process flow

At the heart of the user's interface in this library are the following Markdown classes and their convert methods.

`core.py`


class Markdown:

    #Convert to html here
    def convert(self, source):
        # source :Markdown text

There is a comment on the convert method, which looks like this when translated into Japanese.

The preprocessors convert the text Parse the high-level structure element of the text preprocessed in 2, 1 into the Element Tree

The tree processors process the Element Tree. For example, Inline Patterns finds inline elements
Make some post-processors work on the serialized version of ElementTree.
Write the result to a character string

The original text may be easier to read anymore w

Step 1 preprocessors

`core.py`



class Markdown:

    def convert(self, source):
        <Abbreviation>
        self.lines = source.split("\n")
        for prep in self.preprocessors:
            self.lines = prep.run(self.lines)

First, divide into each line and let the pretreatment bite. The preprocessors are obtained below.

`preprocessors.py`



def build_preprocessors(md, **kwargs):
    """ Build the default set of preprocessors used by Markdown. """
    preprocessors = util.Registry()
    preprocessors.register(NormalizeWhitespace(md), 'normalize_whitespace', 30)
    preprocessors.register(HtmlBlockPreprocessor(md), 'html_block', 20)
    preprocessors.register(ReferencePreprocessor(md), 'reference', 10)
    return preprocessors

NormalizeWhitespace > HtmlBlockPreprocessor > ReferencePreprocessor Preprocessing is registered in the order of priority.

As the name suggests NormalizeWhitespace: Normalization of blank and line feed characters (implementation of about 10 lines) HtmlBlockPreprocessor: Analysis of html elements (250 lines ...!) ReferencePreprocessor: Find the link expressed in the format of Title and register it in the dictionaryMarkdown.references (30 lines)

Let's skip the contents of HtmlBlockPreprocessor. A similar element detection process should be waiting in steps 2 and 3 ...

(Digression) Like ReferencePreprocessor, the scene where you have to judge whether it will be the contents of the element until the next line often appears in the parser, and when you make it yourself (study)

while lines:
    line_num += 1
    line = self.lines[line_num]
    <processing>
    if (Include up to next line):
        line_num += 1
        line = self.lines[line_num]

However, Reference Preprocessor uses pop.

while lines:
    line = lines.pop(0)

Yeah, I should have done that ... sweat

(End of digression)

Step 2 Perspective on the Erement Tree

This process is the following part.

`core.py`



class Markdown:

    def convert(self, source):
        <Abbreviation>
        # Parse the high-level elements.
        root = self.parser.parseDocument(self.lines).getroot()

Here, self.parser is the BlockParser obtained by the following function.

`blockprocessors.py`



def build_block_parser(md, **kwargs):
    """ Build the default block parser used by Markdown. """
    parser = BlockParser(md)
    parser.blockprocessors.register(EmptyBlockProcessor(parser), 'empty', 100)
    parser.blockprocessors.register(ListIndentProcessor(parser), 'indent', 90)
    parser.blockprocessors.register(CodeBlockProcessor(parser), 'code', 80)
    parser.blockprocessors.register(HashHeaderProcessor(parser), 'hashheader', 70)
    parser.blockprocessors.register(SetextHeaderProcessor(parser), 'setextheader', 60)
    parser.blockprocessors.register(HRProcessor(parser), 'hr', 50)
    parser.blockprocessors.register(OListProcessor(parser), 'olist', 40)
    parser.blockprocessors.register(UListProcessor(parser), 'ulist', 30)
    parser.blockprocessors.register(BlockQuoteProcessor(parser), 'quote', 20)
    parser.blockprocessors.register(ParagraphProcessor(parser), 'paragraph', 10)
    return parser

Processors are also registered here along with their priorities. Click here for BlockParser.praseDocument ().

`blockparser.py`



class BlockParser:

    def __init__(self, md):
        self.blockprocessors = util.Registry()
        self.state = State()
        self.md = md

    #Create an Element Tree
    def parseDocument(self, lines):
        self.root = etree.Element(self.md.doc_tag)
        self.parseChunk(self.root, '\n'.join(lines))
        return etree.ElementTree(self.root)

    def parseChunk(self, parent, text):
        self.parseBlocks(parent, text.split('\n\n'))

    def parseBlocks(self, parent, blocks):
        while blocks:
            for processor in self.blockprocessors:
                if processor.test(parent, blocks[0]):
                    if processor.run(parent, blocks) is not False:
                        break

in short,

`core.py`


root = self.parser.parseDocument(self.lines).getroot()

In the part of, each Block Processor is made to process.

For example, a processor that handles the hashtag header "# header" format is defined as follows:

`blockprocessors.py`



class HashHeaderProcessor(BlockProcessor):
    """ Process Hash Headers. """

    RE = re.compile(r'(?:^|\n)(?P<level>#{1,6})(?P<header>(?:\\.|[^\\])*?)#*(?:\n|$)')

    def test(self, parent, block):
        return bool(self.RE.search(block))

    def run(self, parent, blocks):
        block = blocks.pop(0)
        m = self.RE.search(block)
        if m:
            ```from here```
            before = block[:m.start()]
            after = block[m.end():]
            if before:
                #Recursive processing only for the before part
                self.parser.parseBlocks(parent, [before])
            h = etree.SubElement(parent, 'h%d' % len(m.group('level')))
            h.text = m.group('header').strip()
            if after:
　　　　　　　　　 #Then add to the beginning of blocks to process after
                blocks.insert(0, after)
            ```This is the core```
        else:
            logger.warn("We've got a problem header: %r" % block)

Let's take another look at the processor for citation blocks in the "> text" format.

What you have to think about in the quote block -Blocks that are continuous on multiple lines are regarded as one block. ・ The contents of the block must also be parsed That's right.

`blockprocessors.py`



class BlockQuoteProcessor(BlockProcessor):

    RE = re.compile(r'(^|\n)[ ]{0,3}>[ ]?(.*)')

    def test(self, parent, block):
        return bool(self.RE.search(block))

    def run(self, parent, blocks):
        block = blocks.pop(0)
        m = self.RE.search(block)
        if m:
            before = block[:m.start()]
            #This is the same as the Hash Header Processor
            self.parser.parseBlocks(parent, [before])
            #At the beginning of each line">"Delete
            block = '\n'.join(
                [self.clean(line) for line in block[m.start():].split('\n')]
            )
        ```Consider whether the citation block has continued or is this the beginning```
        sibling = self.lastChild(parent)
        if sibling is not None and sibling.tag == "blockquote":
            quote = sibling
        else:
            quote = etree.SubElement(parent, 'blockquote')
        self.parser.state.set('blockquote')
        ```Parse the contents of the quoted block. Parents are in the current block (quote)```
        self.parser.parseChunk(quote, block)
        self.parser.state.reset()

By the way, sibling is a word for brothers and sisters.

I've looked at two processor classes, but where do you store your parsing results? etree.SubElement(parent, <tagname>) The part is suspicious.

In the first place, etree is an instance of xml.etree.ElementTree in the standard library of python. By ʻetree.SubElement (parent, ), you are adding a child element to BlockParser (). Root (also an instance of ʻElementTree).

As the process progresses, the results are saved as BlockParser (). Root.

Step 3 tree processor

As before, this time bite the tree processor.

`treeprocessors.py`



def build_treeprocessors(md, **kwargs):
    """ Build the default treeprocessors for Markdown. """
    treeprocessors = util.Registry()
    treeprocessors.register(InlineProcessor(md), 'inline', 20)
    treeprocessors.register(PrettifyTreeprocessor(md), 'prettify', 10)
    return treeprocessors

ʻInlineProcessor: Processing for inline elements PrettifyTreeprocessor`: Processing of line feed characters, etc.

At the end

Yeah, it's over! ?? But have you seen the rough design patterns of this library? After that, if there is a part you care about, it is better to see it by yourself ...

I'm a little tired.

Thank you for watching until the end ...!

Read the Python-Markdown source: How to create a parser

Purpose

Note

Spoilers for the main part first

Process flow

core.py

Step 1 preprocessors

core.py

preprocessors.py

Step 2 Perspective on the Erement Tree

core.py

blockprocessors.py

blockparser.py

core.py

blockprocessors.py

blockprocessors.py

Step 3 tree processor

treeprocessors.py

At the end

`core.py`

`core.py`

`preprocessors.py`

`core.py`

`blockprocessors.py`

`blockparser.py`

`core.py`

`blockprocessors.py`

`blockprocessors.py`

`treeprocessors.py`