There may be occasions when you need Wikipedia's raw text, for example because you need a sample for learning in natural language processing. There is an OSS called wikiextractor that extracts the article body from the dump data provided by Wikipedia, so make a note of how to use it.
How to Use
First, clone https://github.com/attardi/wikiextractor and copy WikiExtractor.py
to your working directory.
python WikiExtractor.py <path_to_the_wikipedia_dump_file>
It seems that you can directly specify jawiki-latest-pages-articles.xml.bz2
etc. without decompressing the dump file.
Wikipedia's xml dump file
curl https://dumps.wikimedia.org/jawiki/latest/jawiki-latest-pages-articles.xml.bz2 -o jawiki-latest-pages-articles.xml.bz2
If you want to specify the year in detail, fetch it from here: [https://ja.wikipedia.org/wiki/Wikipedia:%E3%83%87%E3%83%BC% E3% 82% BF% E3% 83% 99% E3% 83% BC% E3% 82% B9% E3% 83% 80% E3% 82% A6% E3% 83% B3% E3% 83% AD% E3% 83% BC% E3% 83% 89](https://ja.wikipedia.org/wiki/Wikipedia: Database download)
The processing status of about 4 million titles is output to the standard log. The processing takes tens of minutes to several hours.
A directory called text
is created, and files are created with the following structure.
/text
├─/AA
│ ├─wiki_00
│ ├─wiki_01
│ :
│ └─wiki_99
├─/AB
│ ├─wiki_00
| :
│ └─wiki_99
:
├─/AZ
| ├─wiki_00
| :
├─/BA
| ├─wiki_00
: :
Wkipedia articles are listed in alphabetical order (Japanese Wiki in alphabetical order) in order from ʻAA`.
Each wiki_XX
is plain text and the format is as follows. It can also be output in Json format (described later).
<doc id="" revid="" url="" title="">
... (Article Text)
</doc>
python WikiExtractor.py <input_path> --processes <process_num> -o <output_path> --json -b <n[KMG]>
---processes: Specify as --processes 8
, specify the number of CPU cores, and process in multiple processes.
---- json: Outputs output in json every line.
{"id": "", "revid": "", "url":"", "title": "", "text": "..."}
--- b: By default, each wiki_XX
file is separated by 1MB. You can relax the limit by specifying -b 1G
. If you want the output to be one file, - b If you do something like 5G
, it will be output to one file.
When using it in natural language processing, it may be convenient to have one text file.
cat text/*/* > jawiki.txt
Recommended Posts