[PYTHON] Extract a page from a Wikipedia dump

Wikipedia provides a dump of all pages. Although it is a huge amount of data, an index is prepared so that it can be handled while being compressed. Let's actually retrieve the data.

Preparation

A description of the dump data is below.

Due to the huge file size, please be careful not to open the unzipped XML with a normal editor or browser.

The data for the Japanese version of Wikipedia is below.

From the May 1, 2020 edition available at the time of writing, the following two files will be used.

  1. jawiki-20200501-pages-articles-multistream.xml.bz2 3.0 GB
  2. jawiki-20200501-pages-articles-multistream-index.txt.bz2 23.9 MB

The first XML is the body data. Since it is already compressed and this size, it will be a ridiculous size when expanded, but it will not be expanded this time because it is considered so that it can be handled as it is compressed.

The second index expands. It will be about 107MB.

specification

The following article examines the structure of dumped XML tags.

The structure of the main part is as follows. One item is stored in one page tag.

<mediawiki>
    <siteinfo> ⋯ </siteinfo>
    <page> ⋯ </page>
    <page> ⋯ </page>
           ⋮
    <page> ⋯ </page>
</mediawiki>

The bz2 file does not simply compress the entire XML file, but consists of blocks of 100 items. You can take out the block and deploy it pinpoint. This structure is called ** multi-stream **.

siteinfopage × 100page × 100

The index has the following structure for each row.

offset of bz2:id:title

Check the actual data.

$ head -n 5 jawiki-20200501-pages-articles-multistream-index.txt
690:1:Wikipedia:Upload log April 2004
690:2:Wikipedia:Delete record/Past log December 2002
690:5:Ampersand
690:6:Wikipedia:Sandbox
690:10:language

To know the length of a block starting at 690, you need to know where the next block starts.

$ head -n 101 jawiki-20200501-pages-articles-multistream-index.txt | tail -n 2
690:217:List of musicians(group)
814164:219:List of song titles

Since each item is one item, you can find the total number of items by counting the number of lines. There are about 2.5 million items.

$ wc -l jawiki-20200501-pages-articles-multistream-index.txt
2495246 jawiki-20200501-pages-articles-multistream-index.txt

take out

Let's actually take out a specific item. The target is "Qiita".

Information acquisition

Search for "Qiita".

$ grep Qiita jawiki-20200501-pages-articles-multistream-index.txt
2919984762:3691277:Qiita
3081398799:3921935:Template:Qiita tag
3081398799:3921945:Template:Qiita tag/doc

Ignore Template and target the first id = 3691277.

Basically, there are 100 items per block, but there are exceptions and it seems that they are out of alignment, so manually check the start position of the next block.

2919984762:3691305:Category:Gabon's Bilateral Relations
2920110520:3691306:Category:Japan-Cameroon relations

You have all the information you need.

id block 次のblock
3691277 2919984762 2920110520

Python

Start Python.

$ python
Python 3.8.2 (default, Apr  8 2020, 14:31:25)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

Open the compressed file.

>>> f = open("jawiki-20200501-pages-articles-multistream.xml.bz2", "rb")

Specify the offset to retrieve the block containing the Qiita item.

>>> f.seek(2919984762)
2919984762
>>> block = f.read(2920110520 - 2919984762)

Expand the block to get the string.

>>> import bz2
>>> data = bz2.decompress(block)
>>> xml = data.decode(encoding="utf-8")

And check the contents. Contains 100 page tags.

>>> print(xml)
  <page>
    <title>Category:Mayor of Eniwa</title>
    <ns>14</ns>
    <id>3691165</id>
(Omitted)

It's awkward as it is, so parse it as XML. Since the root element is required for parsing, add it appropriately.

>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring("<root>" + xml + "</root>")

Check the contents. There are 100 page tags under the root.

>>> len(root)
100
>>> [child.tag for child in root]
['page', 'page',(Omitted), 'page']

Get the page by specifying the id. The argument to find is a notation called XPath.

>>> page = root.find("page/[id='3691277']")

And check the contents.

>>> page.find("title").text
'Qiita'
>>> page.find("revision/text").text[:50]
'{{Infobox Website\n|Site name=Qiita\n|logo=\n|screenshot=\n|Skull'

Save as a file.

>>> tree = ET.ElementTree(page)
>>> tree.write("Qiita.xml", encoding="utf-8")

You will get a file that looks like this:

Qiita.xml


<page>
    <title>Qiita</title>
    <ns>0</ns>
    <id>3691277</id>
    <revision>
      <id>77245770</id>
      <parentid>75514051</parentid>
      <timestamp>2020-04-26T12:21:10Z</timestamp>
      <contributor>
        <username>Linuxmetel</username>
        <id>1613984</id>
      </contributor>
      <comment>Added explanation of Qiita controversy and LGTM stock</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="4507" xml:space="preserve">{{Infobox Website
|Site name=Qiita
(Omitted)
[[Category:Japanese website]]</text>
      <sha1>mtwuh9z42c7j6ku1irgizeww271k4dc</sha1>
    </revision>
  </page>

script

I have summarized the flow in a script.

Store and use the index in SQLite.

SQLite

The script converts the index to TSV and generates SQL for ingestion.

python conv_index.py jawiki-20200501-pages-articles-multistream-index.txt

Three files will be generated.

Import into SQLite.

sqlite3 jawiki.db ".read jawiki-20200501-pages-articles-multistream-index.sql"

You are now ready.

How to use

The DB contains only the index, so you need the xml.bz2 file in the same directory. Do not rename the xml.bz2 file name as it is recorded in the DB.

If you specify the DB and item name, the result will be displayed. By default, only the contents of the text tag are output, but if you specify -x, all the tags inside the page tag will be output.

python mediawiki.py jawiki.db Qiita
python mediawiki.py -x jawiki.db Qiita

You can output to a file.

python mediawiki.py -o Qiita.txt jawiki.db Qiita
python mediawiki.py -o Qiita.xml -x jawiki.db Qiita

mediawiki.py is designed to be used as a library as well.

import mediawiki
db = mediawiki.DB("jawiki.db")
print(db["Qiita"].text)

Related article

Articles about multi-stream and bz2 modules.

reference

I referred to the Wikipedia index specifications.

The ElementTree XML API referenced the documentation.

I investigated how to use SQLite when processing example sentence data.

Recommended Posts

Extract a page from a Wikipedia dump
Extract table from wikipedia
Extract data from a web page with Python
Extract redirects from Wikipedia dumps
How to get a list of links from a page from wikipedia
# 5 [python3] Extract characters from a character string
How to extract coefficients from a fractional formula
Extract features (features) from sentences.
Get an image from a web page and resize it
I made a tool to create a word cloud from wikipedia
Extract the value closest to a value from a Python list element
Extract specific languages from Wiktionary
Try to extract a character string from an image with Python3
How to extract the desired character string from a line 4 commands
I wrote a script to extract a web page link in Python
Find all patterns to extract a specific number from the set