How to save memory when reading huge XML of several GB or more in Python

Introduction

It's been a long time since JSON has become the mainstream when exchanging machine-readable data, but sometimes data is distributed in XML (such as data published by an old institution). Or if you are doing natural language processing, for example, the parser CaboCha has an option ( -f 3) to output the analysis result in XML format, so the result processing is in the so-called lattice format. I think it may be used in the sense that it will be easier.

In the latter case, I was trying to drop the parsing result of a large corpus into XML, but when I tried to process the 8GB XML on the machine with 64GB of memory at hand, the memory was full. I got stuck in the middle (I don't even vomit an error). I was a little surprised because I made it 64GB with the intention of trying my best to increase the memory.

The XML in question is in the form of a list with a number of <item> tags hanging under the <root> tag. It seems that it is also a record format.

<root>
    <item>...</item>
    <item>...</item>
    ...
    <item>...</item>
</root>

When processing each ʻitem, it has nothing to do with the other ʻitems, and it is good to look at them one by one. Many of you know that using ʻiterator (generator) is memory friendly when that type of data is huge. Of course, the library that handles XML also has a method that can read an XML file with ʻiterator, but it required a little trick.

XML in the Python standard library

It's easy to use the standard xml.etree.ElementTree when working with XML in Python. There is also a famous dokoro BeautifulSoup, but since it is specialized in HTML, it is analyzed in XML that I want to handle. There is a part that causes an error [^ 1], and I'm addicted to it, so I've settled on the standard library. This article describes the precautions to be taken when doing ʻiteratorXML parsing with this standard libraryxml`.

Normal usage (put everything in memory)

This is the case when using it normally without using ʻiterator`.

import xml.etree.ElementTree as ET

tree = ET.parse('path/to/xml')

for item in tree.iterfind('item'):
    # do something on item

You are reading the <item> tag in the XML tree with .iterfind () while ʻiterator. But just before that, ʻET.parse () is likefile.readlines (). I eat a lot of memory.

When itering (but eating memory)

This is when you want to read while ʻiter`.

import xml.etree.ElementTree as ET

context = ET.iterparse('path/to/xml')

for event, elem in context:
    if elem.tag == 'item':
        # do something on item

If ʻET.parse () is changed to ʻET.iterparse (), the XML in the argument path will be read in ʻiteratorformat. I read it tag by tag, but only when the end of the tag is reached,context returns ʻevent and ʻelem. ʻEvent ==" end " and ʻelem` is an element.

Now you can save memory! If you think about it, it's a big mistake. Actually, even if # do something on item is pass, it uses as much memory as ** "usual usage" **.

** ʻiter, but context` saves all the tags I've read so far **.

Somewhere, a local variable called context.root is hidden inside the iterator. I didn't know that because I didn't even write it in the official documentation. Maybe some people are happy in the sense that they can be accessed repeatedly later, unlike the usual generator. Well, I can imagine that such a mechanism is necessary to read and hold the nested structure of XML.

When iter (do not eat memory)

Then, what should I do? Tips on the Official Page before it was incorporated into the standard in Python 2.5 as a library named ʻElementTree` long ago. had. Python was a newcomer from 3 so I didn't do it at all.

import xml.etree.ElementTree as ET

context = ET.iterparse('path/to/xml', events=('start', 'end'))

_, root = next(context)  #Go one step further and get root

for event, elem in context:
    if event == 'end' and elem.tag == 'item':
        # do something on item
        root.clear()  #Empty root when you're done

You can specify the keyword argument ʻevents in ʻET.iterparse (), and if you specify 'start' to this, it will tell you the opening tag. The first open tag is <root>, so save this as a variable. At this time, the value discarded by _ contains the character string'start'.

If you take root [^ 2], you can get the element information out of memory by.clear ()every time. I'm happy.


[^ 1]: If a single tag such as <link /> reserved in HTML is used in XML, even if there is text inside, it will be erased. There was probably a workaround, but I remember it didn't work.

[^ 2]: Sounds like Android a long time ago and is wonderful.

Recommended Posts

How to save memory when reading huge XML of several GB or more in Python
[Python] Extract text data from XML data of 10GB or more.
How to check the memory size of a variable in Python
How to check the memory size of a dictionary in Python
How to format a list of dictionaries (or instances) well in Python
Summary of how to import files in Python 3
How to implement shared memory in Python (mmap.mmap)
Summary of how to use MNIST in Python
How to get the number of digits in Python
How to measure processing time in Python or Java
How to exit when using Python in Terminal (Mac)
How to change python version of Notebook in Watson Studio (or Cloud Pak for Data)
How to develop in a virtual environment of Python [Memo]
Comparison of how to use higher-order functions in Python 2 and 3
How to get a list of built-in exceptions in python
How to develop in Python
Put the process to sleep for a certain period of time (seconds) or more in Python
How to not escape Japanese when dealing with json in python
How to determine the existence of a selenium element in Python
How to know the internal structure of an object in Python
How to make a string into an array or an array into a string in Python
How to avoid duplication of data when inputting from Python to SQLite.
[Beginner memo] How to specify the library reading path in Python
[Python] How to do PCA in Python
How to collect images in Python
How to use SQLite in Python
How to use Mysql in python
How to wrap C in Python
How to use ChemSpider in Python
How to use PubChem in Python
How to handle Japanese in Python
How to resolve "No kernel of grammar Python found" error in Atom
How to hide the command prompt when running python in visual studio 2015
How to send a visualization image of data created in Python to Typetalk
[Python] How to put any number of standard inputs in a list
How to deal with SSL error when connecting to S3 with boto of Python
How to write a string when there are multiple lines in python
[Python] How to open two or more files at the same time
[Python] Summary of how to use pandas
[Introduction to Python] How to use class in Python?
How to access environment variables in Python
How to dynamically define variables in Python
How to do R chartr () in Python
[Itertools.permutations] How to put permutations in Python
How to work with BigQuery in Python
How to get a stacktrace in python
How to display multiplication table in python
How to extract polygon area in Python
When looking at memory usage in Python 3
How to check opencv version in python
[Python2.7] Summary of how to use unittest
How to switch python versions in cloud9
How to adjust image contrast in Python
How to use __slots__ in Python class
How to dynamically zero pad in Python
Non-logical operator usage of or in python
Summary of how to use Python list
How to use regular expressions in Python
[Python2.7] Summary of how to use subprocess
How to display Hello world in python
How to use is and == in Python