[PYTHON] How to use WikiExtractor.py

There may be occasions when you need Wikipedia's raw text, for example because you need a sample for learning in natural language processing. There is an OSS called wikiextractor that extracts the article body from the dump data provided by Wikipedia, so make a note of how to use it.

How to Use First, clone https://github.com/attardi/wikiextractor and copy WikiExtractor.py to your working directory.

Basic

python WikiExtractor.py <path_to_the_wikipedia_dump_file>

It seems that you can directly specify jawiki-latest-pages-articles.xml.bz2 etc. without decompressing the dump file.

Wikipedia's xml dump file

curl https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 -o jawiki-latest-pages-articles.xml.bz2

If you want to specify the year in detail, fetch it from here: [https://ja.wikipedia.org/wiki/Wikipedia:%E3%83%87%E3%83%BC% E3% 82% BF% E3% 83% 99% E3% 83% BC% E3% 82% B9% E3% 83% 80% E3% 82% A6% E3% 83% B3% E3% 83% AD% E3% 83% BC% E3% 83% 89](https://ja.wikipedia.org/wiki/Wikipedia: Database download)

The processing status of about 4 million titles is output to the standard log. The processing takes tens of minutes to several hours.

Output format

A directory called text is created, and files are created with the following structure.

/text
├─/AA
│ ├─wiki_00
│ ├─wiki_01
│ :
│ └─wiki_99
├─/AB
│ ├─wiki_00
| :
│ └─wiki_99
:
├─/AZ
| ├─wiki_00
| :
├─/BA
| ├─wiki_00
: :

Wkipedia articles are listed in alphabetical order (Japanese Wiki in alphabetical order) in order from ʻAA`.

Each wiki_XX is plain text and the format is as follows. It can also be output in Json format (described later).

    <doc id="" revid="" url="" title="">
        ... (Article Text)
        </doc>

Recommended options

python WikiExtractor.py <input_path> --processes <process_num> -o <output_path> --json -b <n[KMG]>

---processes: Specify as --processes 8, specify the number of CPU cores, and process in multiple processes.

---- json: Outputs output in json every line.

{"id": "", "revid": "", "url":"", "title": "", "text": "..."}

--- b: By default, each wiki_XX file is separated by 1MB. You can relax the limit by specifying -b 1G. If you want the output to be one file, - b If you do something like 5G, it will be output to one file.

I want to output to one file

When using it in natural language processing, it may be convenient to have one text file.

cat text/*/* > jawiki.txt

reference

https://github.com/attardi/wikiextractor
http://ankaji92.hatenablog.com/entry/2016/11/27/212507