[PYTHON] How to use WikiExtractor.py

There may be occasions when you need Wikipedia's raw text, for example because you need a sample for learning in natural language processing. There is an OSS called wikiextractor that extracts the article body from the dump data provided by Wikipedia, so make a note of how to use it.

How to Use First, clone https://github.com/attardi/wikiextractor and copy WikiExtractor.py to your working directory.

Basic

python WikiExtractor.py <path_to_the_wikipedia_dump_file>

It seems that you can directly specify jawiki-latest-pages-articles.xml.bz2 etc. without decompressing the dump file.

Wikipedia's xml dump file

curl https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 -o jawiki-latest-pages-articles.xml.bz2

If you want to specify the year in detail, fetch it from here: [https://ja.wikipedia.org/wiki/Wikipedia:%E3%83%87%E3%83%BC% E3% 82% BF% E3% 83% 99% E3% 83% BC% E3% 82% B9% E3% 83% 80% E3% 82% A6% E3% 83% B3% E3% 83% AD% E3% 83% BC% E3% 83% 89](https://ja.wikipedia.org/wiki/Wikipedia: Database download)

The processing status of about 4 million titles is output to the standard log. The processing takes tens of minutes to several hours.

Output format

A directory called text is created, and files are created with the following structure.

/text
├─/AA
│ ├─wiki_00
│ ├─wiki_01
│ :
│ └─wiki_99
├─/AB
│ ├─wiki_00
| :
│ └─wiki_99
:
├─/AZ
| ├─wiki_00
| :
├─/BA
| ├─wiki_00
: :

Wkipedia articles are listed in alphabetical order (Japanese Wiki in alphabetical order) in order from ʻAA`.

Each wiki_XX is plain text and the format is as follows. It can also be output in Json format (described later).

    <doc id="" revid="" url="" title="">
        ... (Article Text)
        </doc>

Recommended options

python WikiExtractor.py <input_path> --processes <process_num> -o <output_path> --json -b <n[KMG]>

---processes: Specify as --processes 8, specify the number of CPU cores, and process in multiple processes.

---- json: Outputs output in json every line.

{"id": "", "revid": "", "url":"", "title": "", "text": "..."}

--- b: By default, each wiki_XX file is separated by 1MB. You can relax the limit by specifying -b 1G. If you want the output to be one file, - b If you do something like 5G, it will be output to one file.

I want to output to one file

When using it in natural language processing, it may be convenient to have one text file.

cat text/*/* > jawiki.txt

reference

Recommended Posts

How to use WikiExtractor.py
How to use Python-shell
How to use tf.data
How to use virtualenv
How to use Seaboan
How to use shogun
How to use Pandas 2
How to use Virtualenv
How to use numpy.vectorize
How to use pytest_report_header
How to use partial
How to use Bio.Phylo
How to use SymPy
How to use x-means
How to use IPython
How to use virtualenv
How to use Matplotlib
How to use iptables
How to use numpy
How to use TokyoTechFes2015
How to use venv
How to use dictionary {}
How to use Pyenv
How to use list []
How to use python-kabusapi
How to use OptParse
How to use return
How to use dotenv
How to use pyenv-virtualenv
How to use Go.mod
How to use imutils
How to use import
How to use Qt Designer
How to use search sorted
[gensim] How to use Doc2Vec
python3: How to use bottle (2)
Understand how to use django-filter
How to use the generator
[Python] How to use list 1
How to use FastAPI ③ OpenAPI
How to use Python argparse
How to use IPython Notebook
How to use Pandas Rolling
[Note] How to use virtualenv
How to use redis-py Dictionaries
Python: How to use pydub
[Python] How to use checkio
[Go] How to use "... (3 periods)"
How to use Django's GeoIp2
[Python] How to use input ()
How to use the decorator
[Introduction] How to use open3d
How to use Python lambda
How to use Jupyter Notebook
[Python] How to use virtualenv
python3: How to use bottle (3)
python3: How to use bottle
How to use Google Colaboratory
How to use Python bytes
How to use cron (personal memo)
Python: How to use async with