Speed comparison of Python XML parsing

Python's ElementTree XML API provides several parsing methods. Focus on speed and compare.

The environment used for the measurement is as follows.

Overview

Python has an ElementTree XML API for working with XML.

There are four main methods offered.

  1. fromstring
  2. XMLParser
  3. XMLPullParser
  4. iterparse

These are used properly according to conditions such as memory efficiency and blocking avoidance. This time, I don't care about those conditions, I just focus on the processing time of huge XML.

Target

Wiktionary For English dump files. The dump data is provided compressed with bzip2. It is used as it is compressed without decompressing it. (It will be about 6GB when expanded)

The dump file has the following tag structure.

<mediawiki>
    <siteinfo> ⋯ </siteinfo>
    <page> ⋯ </page>
    <page> ⋯ </page>
           ⋮
    <page> ⋯ </page>
</mediawiki>

The whole is so huge that it is compressed by a multi-stream method so that only a part can be extracted and processed.

siteinfopage × 100page × 100

This time, we will process the whole stream while fetching it one by one.

The second and subsequent streams contain 100 <page> tags.

    <page> ⋯ </page>
    <page> ⋯ </page>
           ⋮
    <page> ⋯ </page>

Each <page> tag has the following structure.

  <page>
    <title>title</title>
    <ns>Namespace</ns>
    <id>ID</id>
    <revision>
      <id>Revision ID</id>
      <parentid>Revision ID before update</parentid>
      <timestamp>Update date and time</timestamp>
      <contributor>
        <username>Editor</username>
        <id>Editor ID</id>
      </contributor>
      <comment>Comments at the time of editing</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="length" xml:space="preserve">Article content</text>
      <sha1>Hash value</sha1>
    </revision>
  </page>

Of these, the contents of the <text> tag are extracted and the number of lines is counted. These are some of the surveys conducted in the following articles.

The scripts used in this article are contained in the following repositories:

Implement getpage to extract<id>and<text>from<page>in each way.

The part that counts the number of lines using getpage is shared.

common part


lines = 0
with open(target, "rb") as f:
    f.seek(slen[0])
    for length in slen[1:-1]:
        for id, text in getpages(f.read(length)):
            for line in text():
                lines += 1
print(f"lines: {lines:,}")

fromstring

Create a tree by passing XML as a string. Perform operations on the tree (search for tags, retrieve them, etc.).

Code Processing time
python/research/countlines-text-xml.py 5m50.826s
def getpages(bz2data):
    xml = bz2.decompress(bz2data).decode("utf-8")
    pages = ET.fromstring(f"<pages>{xml}</pages>")
    for page in pages:
        if int(page.find("ns").text) == 0:
            id = int(page.find("id").text)
            with io.StringIO(page.find("revision/text").text) as t:
                def text():
                    while (line := t.readline()):
                        yield line
                yield id, text

Since XML requires a root tag, it is temporarily enclosed in <pages>. If you remove this, an error will occur.

<ns> represents the page type. Since 0 is a normal page, we have narrowed it down to that.

There are several types of <id>. page.find ("id ") applies only to <id> directly under <page>. Specify page.find ("revision / text ") because <text> is inside <revision>. This method of specification is called XPath.

XMLParser

fromstring is the parser used internally to build the tree. Using this directly is speedy because you can skip building the tree and just parse it.

It is a parser of the type called ** push type **, which is almost the same as the method called SAX. Prepare a class with methods to handle events such as tag start and tag end.

Code Processing time
python/research/countlines-text-xmlparser.py 6m46.163s
class XMLTarget:
    def __init__(self):
        self._ns   = 0
        self._id   = 0
        self._data = []
        self.pages = []

    def start(self, tag, attrib):
        self._data = []

    def data(self, data):
        self._data.append(data)

    def end(self, tag):
        if tag == "ns":
            self._ns = int(self._data[0])
            self._id = 0
        elif self._id == 0 and tag == "id":
            self._id = int(self._data[0])
        elif self._ns == 0 and tag == "text":
            text = []
            cur = ""
            for d in self._data:
                if d == "\n":
                    text.append(cur)
                    cur = ""
                else:
                    cur += d
            if cur: text.append(cur)
            self.pages.append((self._id, text))
        self._data = []

def getpages(bz2data):
    target = XMLTarget()
    parser = ET.XMLParser(target=target)
    parser.feed("<pages>")
    parser.feed(bz2.decompress(bz2data).decode("utf-8"))
    return target.pages

data is the content of the tag. Since it is sent separated by long lines or line breaks, it is grouped by line.

Class variables

The documentation shows sample code similar to the following.

>>> class MaxDepth:                     # The target object of the parser
...     maxDepth = 0
...     depth = 0
...     def start(self, tag, attrib):   # Called for each opening tag.
...         self.depth += 1
...         if self.depth > self.maxDepth:
...             self.maxDepth = self.depth
...     def end(self, tag):             # Called for each closing tag.
...         self.depth -= 1
...     def data(self, data):
...         pass            # We do not need to do anything with data.
...     def close(self):    # Called when all data has been parsed.
...         return self.maxDepth

The variables maxDepth and depth defined directly under the class are class variables and are common to all instances. In this code, instance variables are created and class variables are hidden when updating such as + = and =. Therefore, the value of the class variable is never updated with 0.

But the story is different if the class variable uses a method such as ʻappend` in the list. I was addicted to this and couldn't write the code that worked as I expected.

XMLPullParser

It is a type of parser called ** pull type **. In Java, it is called StAX for SAX.

Unlike the push type, there is no need to prepare a class, and it can be processed by loops and conditional branches, which simplifies the code.

[Reference] Comparison between XmlReader and SAX Reader | Microsoft Docs

Code Processing time
python/research/countlines-text-xmlpull.py 6m4.553s
def getpages(bz2data):
    xml = bz2.decompress(bz2data).decode("utf-8")
    parser = ET.XMLPullParser()
    parser.feed("<pages>")
    parser.feed(xml)
    ns, id = 0, 0
    for ev, el in parser.read_events():
        if el.tag == "ns":
            ns = int(el.text)
            id = 0
        elif id == 0 and el.tag == "id":
            id = int(el.text)
        elif ns == 0 and el.tag == "text":
            with io.StringIO(el.text) as t:
                def text():
                    while (line := t.readline()):
                        yield line
                yield id, text

Since there are several types of <id>, initialize and check with ʻid = 0so that only the one immediately after` is picked up.

iterparse

It is a pull type parser. The difference from XMLPullParser is explained as follows.

XMLPullParser is so flexible that it may be inconvenient for users who simply want to use it. If your application does not have to block when reading XML data, but wants the ability to parse incrementally, see ʻiterparse ()`. This is useful if you are reading a large XML document and do not want it to be all in memory.

ʻIterparse ()returns an iterator and advances the parse each time you proceed withnext (). The structure of the code used is almost the same as XMLPullParser`.

Code Processing time
python/research/countlines-text-xmliter.py 6m29.298s
def getpages(bz2data):
    xml = bz2.decompress(bz2data).decode("utf-8")
    ns, id = 0, 0
    for ev, el in ET.iterparse(io.StringIO(f"<pages>{xml}</pages>")):
        if el.tag == "ns":
            ns = int(el.text)
            id = 0
        elif id == 0 and el.tag == "id":
            id = int(el.text)
        elif ns == 0 and el.tag == "text":
            with io.StringIO(el.text) as t:
                def text():
                    while (line := t.readline()):
                        yield line
                yield id, text

Summary

Of all the tests I've tried, the tree-building fromstring was the fastest. However, the difference is not large, so unless you are dealing with data of this scale (6GB), it will not be a big problem.

method code processing time
fromstring python/research/countlines-text-xml.py 5m50.826s
XMLParser python/research/countlines-text-xmlparser.py 6m46.163s
XMLPullParser python/research/countlines-text-xmlpull.py 6m04.553s
iterparse python/research/countlines-text-xmliter.py 6m29.298s

For reference, the .NET Framework also uses the pull-type XmlReader to build trees. It is faster to use XmlReader directly without building a tree.

method code processing time Remarks
XmlDocument fsharp/research/countlines-text-split-doc.fsx 6m21.588s There is a tree
XmlReader fsharp/research/countlines-text-split-reader.fsx 3m17.916s No tree

Recommended Posts

Speed comparison of Python XML parsing
[Python3] Coarse graining of numpy.ndarray Speed comparison etc.
Python, Java, C ++ speed comparison
Comparison of 4 Python web frameworks
[Python] Parsing randomly generated XML [ElementTree]
Comparison of calculation speed by implementation of python mpmath (arbitrary precision calculation) (Note)
(Java, JavaScript, Python) Comparison of string processing
Comparison of Japanese conversion module in Python3
python string comparison / use'list'and'in' instead of'==' and'or'
Try speed comparison of BigQuery Storage API
Comparison of Python serverless frameworks-Zappa vs Chalice
Comparison of matrix transpose speeds with Python
Speed comparison of murmurhash3, md5 and sha1
Introduction of Python
First Python 3 ~ First comparison ~
Basics of python ①
Copy of python
Introduction of Python
Performance comparison of face detector with Python + OpenCV
Python speed comparison regex vs startswith vs str [: word_length]
Compare the speed of Python append and map
Thorough comparison of three Python morphological analysis libraries
Simple comparison of Python libraries that operate Excel
Speed: Add element to end of Python array
Comparison of R and Python writing (Euclidean algorithm)
Speed evaluation of CSV file output in Python
Comparison of Python and Ruby (Environment / Grammar / Literal)
[Python] Operation of enumerate
List of python modules
What to use for Python stacks and queues (speed comparison of each data structure)
Python SDP runtime comparison
Python ~ Grammar speed learning ~
Speed comparison of each language by Monte Carlo method
Unification of Python environment
Copy of python preferences
Basics of Python scraping basics
Python list comprehension speed
Python netCDF4 read speed and nesting of for statements
[python] behavior of argmax
Comparison of LDA implementations
File write speed comparison experiment between python 2.7.9 and pypy 2.5.0
Python implementation comparison of multi-index moving averages (DEMA, TEMA)
Comparison of online classifiers
Usage of Python locals ()
the zen of Python
Installation of Python 3.3 rc1
A quick comparison of Python and node.js test libraries
Comparison of fitting programs
# 4 [python] Basics of functions
Basic knowledge of Python
Comparison table of frequently used processes of Python and Clojure
Sober trivia of python3
Summary of Python arguments
Python package manager comparison
Basics of python: Output
Installation of matplotlib (Python 3.3.2)
Application of Python 3 vars
Various processing of Python
Comparison of CoffeeScript with JavaScript, Python and Ruby grammar
[Maya Python] Crush the contents of the script 1 ~ Camera Speed Editor
Comparison of exponential moving average (EMA) code written in Python