I decided to learn how the markdown parser library is implemented. (Abrupt) It is said that the design (design pattern) of Java's dom parser is wonderful, but first I searched for a Python library that I am accustomed to.
By the way, this is Java. https://www.tutorialspoint.com/java_xml/java_dom_parse_document.htm
This time I will read the source of the following library.
Python-Markdown https://github.com/Python-Markdown
It seems that you can convert to Markdown-> HTML. Of course, we are analyzing Markdown inside, so let's see what kind of design it is!
If you find something wrong with your understanding, please point it out ...!
It seems that the core functions are collected under Python-Markdown / markdown / markdown /
.
Since it will be troublesome in the future, basically the files under this directory will be abbreviated as sample.py
.
In addition, the source code posted below is basically an excerpt of only the necessary parts (+ comment out and write a memo).
The bottom line is that each processor that detects a particular element was activated block by block. The target to act on is each block of the original text divided by "\ n \ n". This is, for example
<b>The tag is incomplete, but in bold
Not in bold</b>
If a blank line is inserted (= "\ n \ n" appears) like this, the effective range of the element will be cut off.
In this example, first
["<b>The tag is incomplete, but in bold", "Not in bold</b>"」
First, the " <b> tag is incomplete, but the processor that detects each element is operated in order for "
, which is bold, and then proceed to the next block ...
I think that it will be the flow.
At the heart of the user's interface in this library are the following Markdown
classes and their convert
methods.
core.py
class Markdown:
#Convert to html here
def convert(self, source):
# source :Markdown text
There is a comment on the convert
method, which looks like this when translated into Japanese.
- The preprocessors convert the text Parse the high-level structure element of the text preprocessed in 2, 1 into the Element Tree
The original text may be easier to read anymore w
core.py
class Markdown:
def convert(self, source):
<Abbreviation>
self.lines = source.split("\n")
for prep in self.preprocessors:
self.lines = prep.run(self.lines)
First, divide into each line and let the pretreatment bite.
The preprocessors
are obtained below.
preprocessors.py
def build_preprocessors(md, **kwargs):
""" Build the default set of preprocessors used by Markdown. """
preprocessors = util.Registry()
preprocessors.register(NormalizeWhitespace(md), 'normalize_whitespace', 30)
preprocessors.register(HtmlBlockPreprocessor(md), 'html_block', 20)
preprocessors.register(ReferencePreprocessor(md), 'reference', 10)
return preprocessors
NormalizeWhitespace
> HtmlBlockPreprocessor
> ReferencePreprocessor
Preprocessing is registered in the order of priority.
As the name suggests
NormalizeWhitespace
: Normalization of blank and line feed characters (implementation of about 10 lines)
HtmlBlockPreprocessor
: Analysis of html elements (250 lines ...!)
ReferencePreprocessor
: Find the link expressed in the format of Title and register it in the dictionaryMarkdown.references
(30 lines)
Let's skip the contents of HtmlBlockPreprocessor
. A similar element detection process should be waiting in steps 2 and 3 ...
(Digression)
Like ReferencePreprocessor
, the scene where you have to judge whether it will be the contents of the element until the next line often appears in the parser, and when you make it yourself (study)
while lines:
line_num += 1
line = self.lines[line_num]
<processing>
if (Include up to next line):
line_num += 1
line = self.lines[line_num]
However, Reference Preprocessor
uses pop
.
while lines:
line = lines.pop(0)
Yeah, I should have done that ... sweat
(End of digression)
This process is the following part.
core.py
class Markdown:
def convert(self, source):
<Abbreviation>
# Parse the high-level elements.
root = self.parser.parseDocument(self.lines).getroot()
Here, self.parser
is the BlockParser
obtained by the following function.
blockprocessors.py
def build_block_parser(md, **kwargs):
""" Build the default block parser used by Markdown. """
parser = BlockParser(md)
parser.blockprocessors.register(EmptyBlockProcessor(parser), 'empty', 100)
parser.blockprocessors.register(ListIndentProcessor(parser), 'indent', 90)
parser.blockprocessors.register(CodeBlockProcessor(parser), 'code', 80)
parser.blockprocessors.register(HashHeaderProcessor(parser), 'hashheader', 70)
parser.blockprocessors.register(SetextHeaderProcessor(parser), 'setextheader', 60)
parser.blockprocessors.register(HRProcessor(parser), 'hr', 50)
parser.blockprocessors.register(OListProcessor(parser), 'olist', 40)
parser.blockprocessors.register(UListProcessor(parser), 'ulist', 30)
parser.blockprocessors.register(BlockQuoteProcessor(parser), 'quote', 20)
parser.blockprocessors.register(ParagraphProcessor(parser), 'paragraph', 10)
return parser
Processors are also registered here along with their priorities.
Click here for BlockParser.praseDocument ()
.
blockparser.py
class BlockParser:
def __init__(self, md):
self.blockprocessors = util.Registry()
self.state = State()
self.md = md
#Create an Element Tree
def parseDocument(self, lines):
self.root = etree.Element(self.md.doc_tag)
self.parseChunk(self.root, '\n'.join(lines))
return etree.ElementTree(self.root)
def parseChunk(self, parent, text):
self.parseBlocks(parent, text.split('\n\n'))
def parseBlocks(self, parent, blocks):
while blocks:
for processor in self.blockprocessors:
if processor.test(parent, blocks[0]):
if processor.run(parent, blocks) is not False:
break
in short,
core.py
root = self.parser.parseDocument(self.lines).getroot()
In the part of, each Block Processor
is made to process.
For example, a processor that handles the hashtag header "# header" format is defined as follows:
blockprocessors.py
class HashHeaderProcessor(BlockProcessor):
""" Process Hash Headers. """
RE = re.compile(r'(?:^|\n)(?P<level>#{1,6})(?P<header>(?:\\.|[^\\])*?)#*(?:\n|$)')
def test(self, parent, block):
return bool(self.RE.search(block))
def run(self, parent, blocks):
block = blocks.pop(0)
m = self.RE.search(block)
if m:
```from here```
before = block[:m.start()]
after = block[m.end():]
if before:
#Recursive processing only for the before part
self.parser.parseBlocks(parent, [before])
h = etree.SubElement(parent, 'h%d' % len(m.group('level')))
h.text = m.group('header').strip()
if after:
#Then add to the beginning of blocks to process after
blocks.insert(0, after)
```This is the core```
else:
logger.warn("We've got a problem header: %r" % block)
Let's take another look at the processor for citation blocks in the "> text" format.
What you have to think about in the quote block -Blocks that are continuous on multiple lines are regarded as one block. ・ The contents of the block must also be parsed That's right.
blockprocessors.py
class BlockQuoteProcessor(BlockProcessor):
RE = re.compile(r'(^|\n)[ ]{0,3}>[ ]?(.*)')
def test(self, parent, block):
return bool(self.RE.search(block))
def run(self, parent, blocks):
block = blocks.pop(0)
m = self.RE.search(block)
if m:
before = block[:m.start()]
#This is the same as the Hash Header Processor
self.parser.parseBlocks(parent, [before])
#At the beginning of each line">"Delete
block = '\n'.join(
[self.clean(line) for line in block[m.start():].split('\n')]
)
```Consider whether the citation block has continued or is this the beginning```
sibling = self.lastChild(parent)
if sibling is not None and sibling.tag == "blockquote":
quote = sibling
else:
quote = etree.SubElement(parent, 'blockquote')
self.parser.state.set('blockquote')
```Parse the contents of the quoted block. Parents are in the current block (quote)```
self.parser.parseChunk(quote, block)
self.parser.state.reset()
By the way, sibling
is a word for brothers and sisters.
I've looked at two processor classes, but where do you store your parsing results?
etree.SubElement(parent, <tagname>)
The part is suspicious.
In the first place, etree is an instance of xml.etree.ElementTree
in the standard library of python.
By ʻetree.SubElement (parent, , you are adding a child element to
BlockParser (). Root (also an instance of ʻElementTree
).
As the process progresses, the results are saved as BlockParser (). Root
.
As before, this time bite the tree processor
.
treeprocessors.py
def build_treeprocessors(md, **kwargs):
""" Build the default treeprocessors for Markdown. """
treeprocessors = util.Registry()
treeprocessors.register(InlineProcessor(md), 'inline', 20)
treeprocessors.register(PrettifyTreeprocessor(md), 'prettify', 10)
return treeprocessors
ʻInlineProcessor: Processing for inline elements
PrettifyTreeprocessor`: Processing of line feed characters, etc.
Yeah, it's over! ?? But have you seen the rough design patterns of this library? After that, if there is a part you care about, it is better to see it by yourself ...
I'm a little tired.
Thank you for watching until the end ...!
Recommended Posts