The entire Wiktionary data is too large to investigate a particular language, so I created a script to specify and extract the language as a pre-process.

This is a series of articles.

Search for efficient Wiktionary processing
Compare Wiktionary processing speed between F # and Python
Get the language code of Wiktionary
Extract a specific language from Wiktionary ← This article
Investigate English Irregular Verbs in Wiktionary

The script for this article is posted in the following repositories.

https://github.com/7shi/wiktionary-tools/tree/master/python/tools

Overview

It's wasteful to process the whole sentence to investigate a particular language. Extract by specifying the language as preprocessing.

It is a text file that can be opened with an editor, so it is easy to handle. Information can be extracted by ordinary text processing without devising a special method for speeding up as before.

Preparation

Use the Wiktionary English version of the dump file.

Wiktionary Japanese version is also supported. Other language version has some differences in the description method, so it is necessary to deal with it individually.

The dump data is provided compressed with bzip2. The May 1, 2020 edition, available at the time of writing, will be used uncompressed and compressed. (It will be about 6GB when expanded)

Other date versions can also be used.
https://dumps.wikimedia.org/enwiktionary/
enwiktionary-20200501-pages-articles-multistream.xml.bz2 890.8 MB
An index is provided with the dump truck, but it is not necessary because it will be recreated independently.

You need to keep the downloaded xml.bz2 somewhere. It can be anywhere, but this time I will create a dedicated folder in my home directory.

~/share/wiktionary/

Examine the length of the stream as you expand the data. Check page information (equivalent to index) and language headings in parallel from the expanded data.

https://github.com/7shi/wiktionary-tools/blob/master/python/tools/db-make.py
The fastest method among the scripts created in First article has been unified.

`Execution result`


$ python db-make.py ~/share/wiktionary/enwiktionary-20200501-pages-articles-multistream.xml.bz2
934,033,103 / 934,033,103 | 68,608
checking redirects...
reading language codes...
checking language names...
writing DB files...

Record the location of the xml.bz2 file in the DB. Since it is read as needed, an error will occur if you move or delete it.

Eight files will be generated.

File name	Contents
db-settings.tsv	Setting information (file name)
db-streams.tsv	Stream information (ID, offset, length)
db-namespaces.tsv	Namespaces (ns tag)
db-pages.tsv	Page information (ID, stream, namespace, title, transfer, namespace)
db-idlang.tsv	Language (ID) included in page (ID)
db-langname.tsv	Correspondence table of language ID and language name (including alias)
db-langcode.tsv	Language code and language name correspondence table
db-templates.tsv	Embedded templates

Use the prepared SQL to submit to SQLite. It's a simple SQL, so I think it's a quick way to read this to see the table structure.

https://github.com/7shi/wiktionary-tools/blob/master/python/tools/db.sql

$ sqlite3 enwiktionary.db ".read db.sql"
importing 'db-settings.tsv'...
importing 'db-streams.tsv'...
importing 'db-namespaces.tsv'...
importing 'db-pages.tsv'...
importing 'db-idlang.tsv'...
importing 'db-langname.tsv'...
Importing 'db-langcode.tsv'...
Importing 'db-templates.tsv'...

The preparation is complete.

Once you've populated the data, you don't need the generated db-*. Tsv, but if you're looking at commands like grep as well as SQLite, it's a good idea to keep it.

Parallelization and generator

Introducing ideas for parallelization.

`db-make.py (excerpt)`


    with concurrent.futures.ProcessPoolExecutor() as executor:
        for pgs, idl in executor.map(getlangs, f(getstreams(target))):

f and getstreams are generators. ʻExecutor.map parallelizes getlangs` and looks like a generator to the main process.

getstreams is a process that cannot be parallelized. Data filtered by f and yield is passed to getlangs. f is more than just a filter, it displays progress information and processes data that you don't pass to getlangs.

Language name

You can find the language name in the generated db-langname.tsv.

We have prepared SQL that creates rankings in descending order of the number of recorded words.

https://github.com/7shi/wiktionary-tools/blob/master/python/tools/rank.sql

$ sqlite3 enwiktionary.db ".read rank.sql" > rank.tsv
$ head -n 10 rank.tsv
1       English 928987
2       Latin   805426
3       Spanish 668035
4       Italian 559757
5       Russian 394340
6       French  358570
7       Portuguese      282596
8       German  272451
9       Chinese 192619
10      Finnish 176100

Language extraction

I have prepared a script that specifies the language name and extracts it to a file called language name.txt.

https://github.com/7shi/wiktionary-tools/blob/master/python/tools/collect-lang.py
Consideration is given to processing speed, such as reading continuous streams at once and parallelizing data expansion. The script is a bit complicated.

The extracted text has page breaks as comments. The title corresponds to the headword.

<!-- <title>title</title> -->

Extract English as an example.

$ time python collect-lang.py enwiktionary.db English
reading positions... 928,987 / 928,987
optimizing... 49,835 -> 6,575
reading streams... 6,575 / 6,575
English: 928,988

Check the number of lines and file size.

$ wc -l English.txt
14461960 English.txt
$ wc --bytes English.txt
452471057 English.txt

The number of words recorded in English is the largest, but after extraction, it will be about 430MB in size and can be opened with an editor.

An example of how to use the extracted data is shown in Next article.

It is possible to specify multiple language names.

$ python collect-lang.py enwiktionary.db Arabic Estonian Hebrew Hittite Ido Interlingua Interlingue Novial "Old English" "Old High German" "Old Saxon" Phoenician Vietnamese Volapük Yiddish
reading positions... 143,926 / 143,926
optimizing... 25,073 -> 10,386
reading streams... 10,386 / 10,386
Arabic: 50,380
Estonian: 8,756
Hebrew: 9,845
Hittite: 392
Ido: 19,978
Interlingua: 3,271
Interlingue: 638
Novial: 666
Old English: 10,608
Old High German: 1,434
Old Saxon: 1,999
Phoenician: 129
Vietnamese: 25,588
Volapük: 3,918
Yiddish: 6,324

Separate language

Newly added artificial languages and reconstructed proto-languages cannot be extracted with the previous script because Wiktionary is stored differently.

These have their own pages for each word.

Appendix: Language / Word
Reconstruction: Language / Word

Use the script to find out what languages are available.

https://github.com/7shi/wiktionary-tools/blob/master/python/tools/search-title.py

$ python search-title.py enwiktionary.db
reading `pages`... 6,860,637 / 6,860,637

search-title.tsv is output. The words are dropped from the title of the page and arranged in descending order of appearance.

$ grep Appendix search-title.tsv | head -n 5
3492    Appendix:Lojban/
3049    Appendix:Proto-Germanic/
2147    Appendix:Klingon/
1851    Appendix:Quenya/
888     Appendix:Proto-Slavic/
$ grep Reconstruction search-title.tsv | head -n 5
5096    Reconstruction:Proto-Germanic/
3009    Reconstruction:Proto-Slavic/
1841    Reconstruction:Proto-West Germanic/
1724    Reconstruction:Proto-Indo-European/
1451    Reconstruction:Proto-Samic/

I prepared a script to specify the title with a regular expression and extract it.

https://github.com/7shi/wiktionary-tools/blob/master/python/tools/collect-title.py
We do not pay attention to the processing speed on the assumption that we will not process so many pages.

Here is an example of use. You must specify the output file name.

`Proto-Indo-European`


$ python collect-title.py enwiktionary.db PIE.txt "^Reconstruction:Proto-Indo-European/"
reading `pages`... 6,860,557 / 6,860,557
Sorting...
writing `pages`... 1,726 / 1,726

`Toki Pona (artificial language)`


$ python collect-title.py enwiktionary.db Toki_Pona.txt "^Appendix:Toki Pona/"
reading `pages`... 6,860,637 / 6,860,637
Sorting...
writing `pages`... 130 / 130

Since the script only handles regular expressions, it is possible to extract all pages that contain a particular language name.

`Novial (artificial language)`


$ python collect-title.py enwiktionary.db Novial2.txt Novial
reading `pages`... 6,860,557 / 6,860,557
Sorting...
writing `pages`... 148 / 148

Script template

We have prepared a template as a reference when writing your own script. Read all the data while showing the progress.

https://github.com/7shi/wiktionary-tools/blob/master/python/tools/db-template.py

$ python db-template.py enwiktionary.db
reading `settings`... 1 / 1
reading `streams`... 68,609 / 68,609
reading `namespaces`... 46 / 46
reading `pages`... 6,860,557 / 6,860,557
reading `idlang`... 6,916,807 / 6,916,807
reading `langname`... 3,978 / 3,978
reading `langcode`... 8,146 / 8,146
reading `templates`... 32,880 / 32,880

The data is just read and not processed.

[PYTHON] Extract specific languages from Wiktionary