The entire Wiktionary data is too large to investigate a particular language, so I created a script to specify and extract the language as a pre-process.
This is a series of articles.
The script for this article is posted in the following repositories.
It's wasteful to process the whole sentence to investigate a particular language. Extract by specifying the language as preprocessing.
It is a text file that can be opened with an editor, so it is easy to handle. Information can be extracted by ordinary text processing without devising a special method for speeding up as before.
Use the Wiktionary English version of the dump file.
The dump data is provided compressed with bzip2. The May 1, 2020 edition, available at the time of writing, will be used uncompressed and compressed. (It will be about 6GB when expanded)
Other date versions can also be used.
https://dumps.wikimedia.org/enwiktionary/
enwiktionary-20200501-pages-articles-multistream.xml.bz2 890.8 MB
An index is provided with the dump truck, but it is not necessary because it will be recreated independently.
You need to keep the downloaded xml.bz2 somewhere. It can be anywhere, but this time I will create a dedicated folder in my home directory.
Examine the length of the stream as you expand the data. Check page information (equivalent to index) and language headings in parallel from the expanded data.
https://github.com/7shi/wiktionary-tools/blob/master/python/tools/db-make.py
The fastest method among the scripts created in First article has been unified.
Execution result
$ python db-make.py ~/share/wiktionary/enwiktionary-20200501-pages-articles-multistream.xml.bz2
934,033,103 / 934,033,103 | 68,608
checking redirects...
reading language codes...
checking language names...
writing DB files...
Eight files will be generated.
File name | Contents |
---|---|
db-settings.tsv | Setting information (file name) |
db-streams.tsv | Stream information (ID, offset, length) |
db-namespaces.tsv | Namespaces (ns tag) |
db-pages.tsv | Page information (ID, stream, namespace, title, transfer, namespace) |
db-idlang.tsv | Language (ID) included in page (ID) |
db-langname.tsv | Correspondence table of language ID and language name (including alias) |
db-langcode.tsv | Language code and language name correspondence table |
db-templates.tsv | Embedded templates |
Use the prepared SQL to submit to SQLite. It's a simple SQL, so I think it's a quick way to read this to see the table structure.
$ sqlite3 enwiktionary.db ".read db.sql"
importing 'db-settings.tsv'...
importing 'db-streams.tsv'...
importing 'db-namespaces.tsv'...
importing 'db-pages.tsv'...
importing 'db-idlang.tsv'...
importing 'db-langname.tsv'...
Importing 'db-langcode.tsv'...
Importing 'db-templates.tsv'...
The preparation is complete.
Once you've populated the data, you don't need the generated db-*. Tsv, but if you're looking at commands like grep
as well as SQLite, it's a good idea to keep it.
Introducing ideas for parallelization.
db-make.py (excerpt)
with concurrent.futures.ProcessPoolExecutor() as executor:
for pgs, idl in executor.map(getlangs, f(getstreams(target))):
f
and getstreams
are generators. ʻExecutor.map parallelizes
getlangs` and looks like a generator to the main process.
getstreams
is a process that cannot be parallelized. Data filtered by f
and yield
is passed to getlangs
. f
is more than just a filter, it displays progress information and processes data that you don't pass to getlangs
.
You can find the language name in the generated db-langname.tsv.
We have prepared SQL that creates rankings in descending order of the number of recorded words.
$ sqlite3 enwiktionary.db ".read rank.sql" > rank.tsv
$ head -n 10 rank.tsv
1 English 928987
2 Latin 805426
3 Spanish 668035
4 Italian 559757
5 Russian 394340
6 French 358570
7 Portuguese 282596
8 German 272451
9 Chinese 192619
10 Finnish 176100
I have prepared a script that specifies the language name and extracts it to a file called language name.txt
.
https://github.com/7shi/wiktionary-tools/blob/master/python/tools/collect-lang.py
Consideration is given to processing speed, such as reading continuous streams at once and parallelizing data expansion. The script is a bit complicated.
The extracted text has page breaks as comments. The title corresponds to the headword.
<!-- <title>title</title> -->
Extract English as an example.
$ time python collect-lang.py enwiktionary.db English
reading positions... 928,987 / 928,987
optimizing... 49,835 -> 6,575
reading streams... 6,575 / 6,575
English: 928,988
Check the number of lines and file size.
$ wc -l English.txt
14461960 English.txt
$ wc --bytes English.txt
452471057 English.txt
The number of words recorded in English is the largest, but after extraction, it will be about 430MB in size and can be opened with an editor.
It is possible to specify multiple language names.
$ python collect-lang.py enwiktionary.db Arabic Estonian Hebrew Hittite Ido Interlingua Interlingue Novial "Old English" "Old High German" "Old Saxon" Phoenician Vietnamese Volapük Yiddish
reading positions... 143,926 / 143,926
optimizing... 25,073 -> 10,386
reading streams... 10,386 / 10,386
Arabic: 50,380
Estonian: 8,756
Hebrew: 9,845
Hittite: 392
Ido: 19,978
Interlingua: 3,271
Interlingue: 638
Novial: 666
Old English: 10,608
Old High German: 1,434
Old Saxon: 1,999
Phoenician: 129
Vietnamese: 25,588
Volapük: 3,918
Yiddish: 6,324
Newly added artificial languages and reconstructed proto-languages cannot be extracted with the previous script because Wiktionary is stored differently.
These have their own pages for each word.
Use the script to find out what languages are available.
$ python search-title.py enwiktionary.db
reading `pages`... 6,860,637 / 6,860,637
search-title.tsv is output. The words are dropped from the title of the page and arranged in descending order of appearance.
$ grep Appendix search-title.tsv | head -n 5
3492 Appendix:Lojban/
3049 Appendix:Proto-Germanic/
2147 Appendix:Klingon/
1851 Appendix:Quenya/
888 Appendix:Proto-Slavic/
$ grep Reconstruction search-title.tsv | head -n 5
5096 Reconstruction:Proto-Germanic/
3009 Reconstruction:Proto-Slavic/
1841 Reconstruction:Proto-West Germanic/
1724 Reconstruction:Proto-Indo-European/
1451 Reconstruction:Proto-Samic/
I prepared a script to specify the title with a regular expression and extract it.
https://github.com/7shi/wiktionary-tools/blob/master/python/tools/collect-title.py
We do not pay attention to the processing speed on the assumption that we will not process so many pages.
Here is an example of use. You must specify the output file name.
Proto-Indo-European
$ python collect-title.py enwiktionary.db PIE.txt "^Reconstruction:Proto-Indo-European/"
reading `pages`... 6,860,557 / 6,860,557
Sorting...
writing `pages`... 1,726 / 1,726
Toki Pona (artificial language)
$ python collect-title.py enwiktionary.db Toki_Pona.txt "^Appendix:Toki Pona/"
reading `pages`... 6,860,637 / 6,860,637
Sorting...
writing `pages`... 130 / 130
Since the script only handles regular expressions, it is possible to extract all pages that contain a particular language name.
Novial (artificial language)
$ python collect-title.py enwiktionary.db Novial2.txt Novial
reading `pages`... 6,860,557 / 6,860,557
Sorting...
writing `pages`... 148 / 148
We have prepared a template as a reference when writing your own script. Read all the data while showing the progress.
$ python db-template.py enwiktionary.db
reading `settings`... 1 / 1
reading `streams`... 68,609 / 68,609
reading `namespaces`... 46 / 46
reading `pages`... 6,860,557 / 6,860,557
reading `idlang`... 6,916,807 / 6,916,807
reading `langname`... 3,978 / 3,978
reading `langcode`... 8,146 / 8,146
reading `templates`... 32,880 / 32,880
Recommended Posts