[Python] Extract text data from XML data of 10GB or more.

The other party is the XML file of "Wikipedia Japanese version"

I decided to try natural language processing after a long time, [Wikipedia: Database download](https://ja.wikipedia.org/wiki/Wikipedia:% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 83% 99% E3% 83% BC% E3% 82% B9% E3% 83% 80% E3% 82% A6% E3% 83% B3% E3% 83% AD% E3% 83% BC% I downloaded "Wikipedia Japanese version" from E3% 83% 89). Wikipedia lets you download dumps instead of requesting a ban on crawling. Great ... but the downloaded file is a single XML file. As a matter of course, the file size is over 12GB when decompressed after downloading.

$ ll
-rwxrwxrwx 1 k k 12927699165 Apr 12 17:17 xml_jawiki-20200401-pages-articles-multistream.xml*

When I try the head ...

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="ja">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <dbname>jawiki</dbname>
    <base>https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8</base>
    <generator>MediaWiki 1.35.0-wmf.25</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">media</namespace>
      <namespace key="-1" case="first-letter">special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Note</namespace>
      <namespace key="2" case="first-letter">user</namespace>
      ...
      <namespace key="2302" case="case-sensitive">Gadget definition</namespace>
      <namespace key="2303" case="case-sensitive">Gadget definition talk</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>Wikipedia:Upload log April 2004</title>

Going a little further, you will see the text data you are looking for. Example:

    <title>Information engineering</title>
    <ns>0</ns>
    <id>63</id>
    <revision>
      <id>76256715</id>
      <parentid>73769903</parentid>
      <timestamp>2020-02-18T20:19:50Z</timestamp>
      <contributor>
        <username>Fuda Juban-dori</username>
        <id>1352763</id>
      </contributor>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="5880" xml:space="preserve">'''Information engineering'''(Johokougaku)[[information]]About the field[[engineering]]Is. As for the feeling of words[[information科学]]という語がもっぱらおおまかに「科学」という語が指す範囲を中心としているのに対し、「engineering」的な分野に重心があるが、内実としてはどれもたいして変わらないことが多い(たとえば、大学の学部学科名などに関しては、個々の大学の個性による違いのほうが、名前による違いより大きい)。日本で、大学のengineering部などに[[Computer science|Computer science]]ないしinformation関係の学科を設置する際に、「engineering」部という語との整合のためだけに便利に使われた、という面が大きい([[Information engineering科]]See the article).

In English[[:en:Information engineering|information engineering]]Is[[Software engineering]]It is one of the methods in Japan, and does not correspond to "information engineering" in Japanese. In similar words[[Informatics]]There is.

==Overview==
Here, we will extract from the introduction of departments of some universities (not specialized commentary in research etc.). What is information engineering?[[information]]Is an academic field for engineering use of&lt;ref&gt;[http://www.ics.keio.ac.jp/dept/concept.html.ja information工学科のねらい] [[Keio University]] information工学科&lt;/ref&gt;。informationの発生([[Data mining]]、[[Computer graphics]]など)、informationの伝達([[Computer network]]など)、informationの収集([[Computer vision]]、[[Search engine]]など)、informationの蓄積([[Database]]、[[Data compression]]など)、informationの処理([[Computer engineering]]、[[Computer science]]、[[Software engineering]]) Can be said to be a comprehensive engineering field&lt;ref&gt;[http://www.osakac.ac.jp/dept/p/zyuken/gakka2.html information工学とは何か] [[Osaka Electro-Communication University]] information工学科&lt;/ref&gt;。またinformation工学を、物理現象を支配している原理や法則や社会・経済活動をinformationという観点から捉え,コンピュータ上の設計手順に変換することにより自動化する方法を創出する学問分野とする見方もあり、これは英語でいう[[Computing]]Corresponds to (computing)&lt;ref&gt;[http://www.ise.shibaura-it.ac.jp/main02.html curriculum] [[Shibaura Institute of Technology]] information工学科&lt;/ref&gt;.. In any case, the above explanation is an excerpt from the introduction of university departments.

Computer science[[Information science]]・ As an academic society dealing with information engineering, it was established early in the United States, so[[Association for Computing Machinery|ACM]]Translates literally as "Japan Society of Mechanical Engineers". Is an international organization[[International Federation for Information Processing]]By the time of its inauguration in 1960, there was widespread recognition that computers were machines that process information as well as (number) calculations, and Hiroshi Wada also related to the inauguration of Japanese academic societies.&lt;ref&gt;http://museum.ipsj.or.jp/pioneer/h-wada.html&lt;/ref&gt;「[[Information Processing Society of Japan]]Is named[[Information processing]]Has come to be used. Also[[Institute of Electronics, Information and Communication Engineers]]Also uses "information" as a word to describe this field.

[[Japan Engineers Association]]There is an "Information Engineering Subcommittee" in Japan, and the secondary examinations imposed by the subcommittee are by department, but the department related to computer software is called the "Information Engineering Department".&lt;ref&gt;http://www.engineer.or.jp/c_categories/index02022.html&lt;/ref&gt;([[Engineer Information Engineering Department]])。

*As a department name,[[Kyoto University]]([[Faculty of Engineering]])and[[Osaka University]]([[基礎Faculty of Engineering]]), The Department of Computer Science appears for the first time in 1970. same year,[[Tokyo Institute of Technology]]Department of Computer Science, also[[University of Electro-Communications]]and[[University of Yamanashi]]Department of Computer Science[[Kanazawa Institute of Technology]]The Department of Information Processing Engineering was established in.
*The name of the faculty is the Faculty of Information Engineering of Kyushu Institute of Technology, which was established in 1986. In addition to the two departments of Intelligent Information Engineering and Electronic Information Engineering, which began accepting students in 1987, the Department of Control Systems Engineering, Department of Mechanical Systems Engineering, and Department of Biochemical Systems Engineering (currently Departments of System Creation Information Engineering, Department of Mechanical Information Engineering, Department of Bioinformatics) All five departments specialize in information engineering. Also,[[Information science]]As a faculty specializing in[[Information science部]]Exists and[[Informatics]]As a faculty specializing in[[Informatics部]]Exists.[[engineering]]In 1996[[Osaka Institute of Technology]]Is installed first.&lt;ref&gt;https://www.oit.ac.jp/is/&lt;/ref&gt;
* &lt;!--Information engineering is often translated as Computer Science.--&gt;&lt;!--←? Computer science is computer science--&gt;"[[Department of Computer Science]], The English name is often Computer Science. This is because the term Computer Science is overwhelmingly easy to understand in English-speaking countries, and is treated as an academic background such as "CS degree" in the IT industry. Information engineering is 8 as of 2007/It is about 33. For example, information engineering[[Cambridge University]]Information Engineering Division&lt;ref&gt;[http://www.eng.cam.ac.uk/research/div-f/ CUED - Division F: Information Engineering]Cambridge University&lt;/ref&gt;。
*In the graduate school[[Graduate School of Informatics]]Such.

It seems that id and contributor information are also included, but hurry up and want the title and text. Natural language processing seems to make sense when the body length is longer than a certain level, so it seems good to make a title and body length list. Now, to work with huge XML in python ... there was a predecessor: Memory saving method when reading huge XML of several GB or more with Python

Oh yeah, it's important to save memory by not reading anything other than the latest analysis target into memory. If you handle data that overflows from memory, the processing will be extremely slow, so it is good to record the processing speed for each fixed number of cases. Here is the result of trial and error while impersonating.

code

import xml.etree.ElementTree as ET #Use the standard library ElementTree

body ="jawiki-20200401-pages-articles"
path =f"../xml_{body}-multistream.xml"
path_w =f"../short_{body}-title.csv"
context = ET.iterparse(path, events=('start', 'end'))

#Skip unnecessary nodes such as root
_, r0 = next(context)
_, r1 = next(context)
_, r2 = next(context)

print(r2.text)
count = 0
title =""
import time
start = time.time()
prev = start
with open(path_w, mode='w',encoding="utf-8") as fw:
    for event, elem in context:
        if event=="start" and elem.text:
            txt = elem.text
            if "title" in elem.tag:
                str = f"{elem.tag}".replace("{http://www.mediawiki.org/xml/export-0.10/}","")
                title = txt
            if "text" in elem.tag:
                str = f"{elem.tag}".replace("{http://www.mediawiki.org/xml/export-0.10/}","")
                if len(txt) <= 1000:
                    x= f"{title},{len(txt)}"
                    fw.write(x+"\n")
                    count += 1
                    if count % 1000 == 0: #Output the processing time for every 1000 cases
                        next = time.time()
                        print(count, f"elapsed_time = {next - prev}", x)
                        prev = next

        elem.clear() #Release the analyzed elem from memory ... ★

print(count)

Kimo is at ★. Without this clear () call, the processing speed of my PC slowed down sharply from the time I read 40,000 items.

Processing result

636000 elapsed_time = 0.635291576385498 Wikipedia:Request for deletion/Wikipedia:Request for deletion/log/November 18, 2019,2692
637000 elapsed_time = 0.6260190010070801 Wikipedia:Request for deletion/Taku Inoue(Ambiguity avoidance),1112
638000 elapsed_time = 0.6080121994018555 Static web page,7108
639000 elapsed_time = 0.6243636608123779 Bambina(Entertainment agency),2590
640000 elapsed_time = 0.6858398914337158 Yui Wakui,1106
641000 elapsed_time = 0.632981538772583 Tsukinokicho(Ikeda City),6688
642000 elapsed_time = 0.5550098419189453 Simon Sluga,2868
643000 elapsed_time = 1.671999454498291 My Sweet Maiden/Welcome To Our Diabolic Paradise,1955
644000 elapsed_time = 0.6001503467559814 West Koriyama Industrial Park No. 2,1699
645000 elapsed_time = 0.6131837368011475 Template:UEFA U-17 European Championship 2009 Spain National Team,1002
646000 elapsed_time = 0.6151120662689209 Gengoya,1496
647000 elapsed_time = 0.6119420528411865 Tsushima City Higashi Elementary School,2119
647933

Process finished with exit code 0

Even after 600,000 cases, it can be processed in 1 second or so per 1000 cases. Fine, fine. The completed csv is completed in the order of "Title, Body length".

EU (Ambiguity avoidance),1480
Organism,5850
Geography,2217
Children's culture,3866
Everyday life,1764
Information engineering,2650
Context-free language,2367
Regular language,1830
Natural language,1396
Gouda cheese,1142
theology,1767
Thailand,1792
Newspaper studies,2344
Pharmacy,2338
Nematic LCD,1209
Wikipedia:FAQ,1645
Wikipedia:Keep calm even if the discussion heats up,2269
musician,4380
Mailing list,3687
Wikipedia:FAQ edit,4163
Record,2004
Wikipedia:Image provision request,2714
Wikipedia:Database download,3331
Japanese cartoonist,5433
List of game titles,2730
Shin Takahashi,5675
Buichi Terasawa,3951

For comparison, I also extracted titles with 1000 characters or less.

1321000 elapsed_time = 0.3331446647644043 Category:JA Gifu Welfare Federation,165
1322000 elapsed_time = 0.19999384880065918 Category:Taiwanese women,165
1323000 elapsed_time = 0.22401905059814453 Togenin,11
1324000 elapsed_time = 0.26406049728393555 Campanian floor,14
1325000 elapsed_time = 0.2306044101715088 Hirugahara River,934
1326000 elapsed_time = 0.25202035903930664 Strombosiaceae,27
1327000 elapsed_time = 0.1840074062347412 Yumekure,23
1328000 elapsed_time = 0.21496915817260742 Wikipedia Library,28
1329000 elapsed_time = 0.22100353240966797 Rate floor,14
1330000 elapsed_time = 0.19901561737060547 Category:Lebanese multi-sport event athlete,95
1331000 elapsed_time = 0.20599126815795898 Spring Yokomachi,12
1332000 elapsed_time = 0.24999594688415527 Ambroise Marie Francoine Joseph Pariso de Beauvois,42
1333000 elapsed_time = 0.20502090454101562 Category:20th Century Northern Irish Actress,273
1333333

Process finished with exit code 0

... If the title includes Category, there seems to be another use.

With that said, I was able to add a hit to the XML data containing 2 million titles in tens of minutes. Thanks.

Recommended Posts

[Python] Extract text data from XML data of 10GB or more.
Extract text from images in Python
How to save memory when reading huge XML of several GB or more in Python
Extract text from PowerPoint with Python! (Compatible with tables)
Get 10 or more data from SSM parameter store
Extract data from a web page with Python
Extract data from S3
Use data class for data storage of Python 3.7 or higher
Python: Japanese text: Characteristic of utterance from word similarity
[Python] Get the text of the law from the e-GOV Law API
Generate a vertical image of a novel from text data
Challenge principal component analysis of text data with Python
Python: Japanese text: Characteristic of utterance from word continuity
[Basics of data science] Collecting data from RSS with python
Extract the band information of raster data with python
Extract template of EML file saved from Thunderbird with python3.7
Notes on importing data from MySQL or CSV with Python
Extract classification information etc. from genbank data in xml format
[Python] What do you do with visualization of 4 or more variables?
[python] Extract text from pdf and read characters aloud with Open-Jtalk
Python: Exclude tags from html data
Hit treasure data from Python Pandas
Get data from Quandl in Python
Extract specific data from complex JSON
Existence from the viewpoint of Python
Speed comparison of Python XML parsing
Process Pubmed .xml data with python
Extract strings from files in Python
Comparing R, Python, SAS, SPSS from the perspective of European data scientists
Extract lines that match the conditions from a text file with python
[Python] Format text full of line feed codes copied from PDF well
Receive textual data from mysql with python
[Note] Get data from PostgreSQL with Python
Process Pubmed .xml data with python [Part 2]
Use PostgreSQL data type (jsonb) from Python
Python: Reading JSON data from web API
Acquisition of plant growth data Acquisition of data from sensors
# 5 [python3] Extract characters from a character string
Connect a lot of Python or and and
Extract Japanese text from PDF with PDFMiner
[Python] Web application from 0! Hands-on (4) -Data molding-
Recommendation of Altair! Data visualization with Python
Non-logical operator usage of or in python
Get Precipitation Probability from XML in Python
Learning notes from the beginning of Python 2
[Introduction to Data Scientists] Basics of Python ♬
[Python] (Line) Extract values from graph images
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
Use PIL in Python to extract only the data you want from Exif
Extract elements (using a list of indexes) in a NumPy style from a Python list / tuple