You can now research the full text of Wiktionary in about a minute. As an example of extracting meaningful information, we will investigate irregular verbs in English.

This is a series of articles.

Search for efficient Wiktionary processing
Compare Wiktionary processing speed between F # and Python
Investigate English irregular verbs with Wiktionary ← This article

The script for this article is posted in the following repositories.

https://github.com/7shi/wiktionary-tools/tree/master/python/en-verb

Overview

Since it is difficult to grasp the entire huge dump data, we will roughly narrow it down and proceed with the processing.

First, read the full text and collect verb information.

Analyze the information obtained and remove the regular verbs in several steps.

Verb information

Let's take see as an example to see how the variants of English verbs are stored.

https://en.wiktionary.org/wiki/see#Verb

Verb

see (third-person singular simple present sees, present participle seeing, simple past saw or (dialectal) seen or (dialectal) seent or (dialectal) seed, past participle seen or (dialectal) seent or (dialectal) seed)

Since it is written in sentences and there is also other forms of information such as dialectal, it seems difficult to parse this programmatically.

If you check the source, it is in a format that is easy to handle.

====Verb====
{{en-verb|sees|seeing|saw|past2=seen|past2_qual=dialectal|past3=seent|past3_qual=dialectal|past4=seed|past4_qual=dialectal|seen|past_ptc2=seent|past_ptc2_qual=dialectal|past_ptc3=seed|past_ptc3_qual=dialectal}}

It seems that you can ignore additional information such as past2 =. The following information remains.

{{en-verb|sees|seeing|saw|seen}}

The original form seems to be obtained from the title of the page.

Heading

I checked the heading surrounded by equals.

There are no cases where the number of equals does not match on the left and right, but it seems that there may be spaces.

[Example] Spaces are indicated by underscores.

== Japanese == _ (last space)
====_Translations_==== (space between equals)
Since such a pattern is a small part of the whole, it seems that there is no space originally. Even a part of it exists, so it is necessary to take measures.

It seems that the hierarchy of headings is not always unified.

===Verb===

script

Implemented in Python.

Extract title, id and text as page information.

def getpages(bz2data):
    xml = bz2.decompress(bz2data).decode("utf-8")
    pages = ET.fromstring(f"<pages>{xml}</pages>")
    for page in pages:
        if int(page.find("ns").text) == 0:
            title = page.find("title").text
            id = int(page.find("id").text)
            with io.StringIO(page.find("revision/text").text) as text:
                yield id, title, text

Split text by pattern.

def splittext(pattern, text):
    line = next(text, "")
    while line:
        m = pattern.match(line)
        line = next(text, "")
        if m:
            def g():
                nonlocal line
                while line and not pattern.match(line):
                    yield line
                    line = next(text, "")
            yield m[1].strip(), g()

Collect the {{en-verb ...}} at ==== Verb ==== of == English ==. Exclude headwords that contain spaces or hyphens as idioms or compound words.

def en_verb(args):
    target, pos, length = args
    with open(target, "rb") as f:
        f.seek(pos)
        bz2data = f.read(length)
    pattern1 = re.compile("==([^=].*)==")
    pattern2 = re.compile("===+([^=].*?)===")
    result = []
    for id, title, text in getpages(bz2data):
        if " " in title or "-" in title: continue
        for lang, text2 in splittext(pattern1, text):
            if lang != "English": continue
            for subsub, text3 in splittext(pattern2, text2):
                if subsub != "Verb": continue
                for line in text3:
                    if line.startswith("{{en-verb"):
                        result.append((id, title, line.strip()))
                        break
    return result

Add parallelism and file output to this. I put the whole script below.

https://github.com/7shi/wiktionary-tools/blob/master/python/en-verb/en-verb.py

Run the script and check the number of lines of output data.

`Execution result`


$ time python en-verb.py
6,861 / 6,861

real    1m50.159s
user    13m53.906s
sys     0m17.438s
$ wc -l en-verb.tsv
31334 en-verb.tsv

Data shaping

Format the acquired data so that it is easy to process.

Remove links and comments

If you look at the acquired data, you will find links and comments.

$ grep "\[" en-verb.tsv | head -n 2
4109    U-turn  {{en-verb|head=[[U]]-[[turn]]|U-turns|U-turning|U-turned}}
5661    read    {{en-verb|reads|reading|[[:en:#Etymology_2|read]]|[[:en:#Etymology_2|read]]|past_ptc2=readen|past_ptc2_qual=archaic, dialectal}}
$ grep '<!--' en-verb.tsv | head -n 2
5443    sing    {{en-verb|sings|singing|sang|sung|past_ptc2=sungen|past_ptc2_qual=archaic}}<!--Sang or sung for preterite, according to AHD.-->
6959    hide    {{en-verb|hides|hiding|hided}}<!--not hid, hidden!-->

Remove the parentheses and comments on the link. With a link[[:en:#Etymology_2|read]]Inside like|What is separated by is the last element (in this exampleread) Is left.

`en-verb-2.py`


import re, sys
pattern1 = re.compile("\[\[(.+?)\]\]")
pattern2 = re.compile("<!--(.*?)-->")
for line in sys.stdin:
    while (m := pattern1.search(line)):
        data = m[1].split("|")
        line = line[:m.start()] + data[-1] + line[m.end():]
    sys.stdout.write(pattern2("", line))

`Execution result`


$ python en-verb-2.py < en-verb.tsv > en-verb-2.tsv
$ diff -U 0 en-verb.tsv en-verb-2.tsv | head -n 8
--- en-verb.tsv 2020-06-12 21:22:01.943343500 +0900
+++ en-verb-2.tsv       2020-06-12 21:22:54.506793900 +0900
@@ -747 +747 @@
-5443   sing    {{en-verb|sings|singing|sang|sung|past_ptc2=sungen|past_ptc2_qual=archaic}}<!--Sang or sung for preterite, according to AHD.-->
+5443   sing    {{en-verb|sings|singing|sang|sung|past_ptc2=sungen|past_ptc2_qual=archaic}}
@@ -763 +763 @@
-5661   read    {{en-verb|reads|reading|[[:en:#Etymology_2|read]]|[[:en:#Etymology_2|read]]|past_ptc2=readen|past_ptc2_qual=archaic, dialectal}}
+5661   read    {{en-verb|reads|reading|read|read|past_ptc2=readen|past_ptc2_qual=archaic, dialectal}}

There is something interesting in the comments. AHD (Althochdeutsche) is Old High German.

Removal of additional information

Additional information like past_ptc2 = readen in read is needed if you're looking at dialects or older material, but this time we'll remove it. Remove any other information after }}.

`en-verb-3.py`


import re, sys
pattern1 = re.compile("{{(.*?)}}")
pattern2 = re.compile("[a-z0-9_]+?=")
for line in sys.stdin:
    if (m := pattern1.search(line)):
        data = [d for d in m[1].split("|") if not pattern2.match(d)]
        line = line[:m.start()] + "{{" + "|".join(data) + "}}\n"
    sys.stdout.write(line)

`Execution result`


$ python en-verb-3.py < en-verb-2.tsv > en-verb-3.tsv
$ diff -U 0 en-verb-2.tsv en-verb-3.tsv | head -n 8
--- en-verb-2.tsv       2020-06-12 21:22:54.506793900 +0900
+++ en-verb-3.tsv       2020-06-12 21:25:39.197882200 +0900
@@ -14 +14 @@
-71     crow    {{en-verb|crows|crowing|crowed|past2=crew|past2_qual=UK|crowed|past_ptc2=crown|past_ptc2_qual=archaic}}
+71     crow    {{en-verb|crows|crowing|crowed|crowed}}
@@ -19 +19 @@
-114    may     {{en-verb|may|-|might|-|past_ptc2=mought|past_ptc2_qual=obsolete}}
+114    may     {{en-verb|may|-|might|-}}

Regular verb

This time, the purpose is to investigate irregular verbs, so we will exclude regular verbs.

Let's check how to write a regular verb from the data generated earlier.

2157    open    {{en-verb}}
46912   like    {{en-verb|lik}}
58007   wish    {{en-verb|es}}
60426   hone    {{en-verb|hon|es}}
34295   chop    {{en-verb|chop|p|ing}}
39760   compel  {{en-verb|compel|l|ed}}

This seems to mean the following. A hyphen is a delimiter between the stem and the ending.

word	3 simple	Present participle	Past participle	Interpretation
open	open-s	open-ing	open-ed	No word form information
like	like-s	lik-ing	lik-ed	The last two stems lik
wish	wish-es	wish-ing	wish-ed	3 simple es
hone	hon-es	hon-ing	hon-ed	Stem hon and 3 simple es
chop	chop-s	chop-p-ing	chop-p-ed	Duplicate p and present participle ing
compel	compel-s	compel-l-ing	compel-l-ed	Duplicate l and past form ed

Like and hone have the same pattern, but both styles seem to give the same result. The hyphens are in different positions, but they do not have linguistic significance and are intended for programs that generate variants.

Remove words with no additional information, up to two terms, or three items ing or ed. Also remove the beginning {{en-verb | and the ending }} to make them tab-delimited.

`en-verb-4.py`


import re, sys
pattern = re.compile("{{en-verb\\|(.*?)\\|*}}")
for line in sys.stdin:
    id, verb, forms = line.split("\t")
    if (m := pattern.match(forms)):
        forms = m[1].split("|")
        if len(forms) > 2 and forms[2] != "ing" and forms[2] != "ed":
            forms = "\t".join(forms)
            print(f"{id}\t{verb}\t{forms}")

`Execution result`


$ python en-verb-4.py < en-verb-3.tsv > en-verb-4.tsv
$ head -n 5 en-verb-4.tsv
71      crow    crows   crowing crowed  crowed
112     march   marches marching        marched
114     may     may     -       might   -
167     swop    swops   swopping        swopped
180     deal    deals   dealing dealt
$ wc -l en-verb-4.tsv
5296 en-verb-4.tsv

I narrowed it down to a certain extent, but there are still many.

Inflection survey

Looking at the data, it seems that the notation that specifies the stem is not unified, and regular verbs are mixed.

71  crow    crows   crowing crowed  crowed
167 swop    swops   swopping    swopped

3 Ignore the simple and present participle and other forms of information, and exclude those that meet the following conditions as regular verbs.

There is no description of the past form.
The past form and the past participle have the same form and end with ed.
The original form ends with e, and the past form adds d to the original form.
The original form ends with y, and the past form changes y to i and adds ed.
For the past form, add ed to the original form.
For the past form, add ed by duplicating the last character of the original form. (There is a word that follows k at the end c)

There are conditions for duplicate characters, but this time it is not necessary, so I will not enter.

Words with apostrophes are judged by removing the symbol.

127005	F	F's|F'ing|F'ed

Remove undefined past forms and participles that are both -.

`en-verb-5.py`


import sys
for line in sys.stdin:
    id, *forms = line.strip().split("\t")
    if len(forms) == 4: forms.append(forms[3])
    verb, _, _, past, pp = [f.replace("'", "").replace("-", "") for f in forms]
    if past == pp:
        if not past: continue
        if past.endswith("ed"):
            if verb.endswith("e") and verb + "d" == past: continue
            if verb.endswith("y") and verb[:-1] + "ied" == past: continue
            if verb + "ed" == past: continue
            if verb + verb[-1] + "ed" == past: continue
            if verb[-1] == "c" and verb + "ked" == past: continue
    forms = "\t".join(forms)
    print(f"{id}\t{forms}")

`Execution result`


$ python en-verb-5.py < en-verb-4.tsv > en-verb-5.tsv
$ wc -l en-verb-5.tsv
1461 en-verb-5.tsv

Removal of the same pattern

If you look at the data, you may see multiple identical words.

5438    think   thinks  thinking    thought thought
5438    think   thinks  thinking    thought thought

In addition, there are compound words that have the same pattern of change.

5664    draw    draws   drawing drew    drawn
7404    overdraw    overdraws   overdrawing overdrew    overdrawn
7761    withdraw    withdraws   withdrawing withdrew    withdrawn

Combine the same words into one and narrow down the same pattern to only short words.

`en-verb-6.py`


import sys
verbs = {}
for line in sys.stdin:
    id, verb, *forms = line.strip().split("\t")
    if verb in verbs: continue
    verbs[verb] = (id, forms)
verbs2 = []
for v1, (id, forms) in verbs.items():
    contains = False
    for v2, (_, f2) in verbs.items():
        if v1 != v2 and v1.endswith(v2):
            c = True
            for f1, f2 in zip(forms, f2):
                if not f1.endswith(f2):
                    c = False
                    break
            if c:
                contains = True
                break
    if not contains:
        forms = "\t".join(forms)
        print(f"{id}\t{v1}\t{forms}")

`Execution result`


$ python en-verb-6.py < en-verb-5.tsv > en-verb-6.tsv
$ wc -l en-verb-6.tsv
378 en-verb-6.tsv

It was squeezed considerably. After that, it would be realistic to look at them individually.

Put the created file. There are still some issues, but it seems that the original data will need to be modified.

https://github.com/7shi/wiktionary-tools/blob/master/python/en-verb/en-verb-6.tsv

application

By applying this method, you can investigate languages other than English. Of course, Wiktionary isn't the only source of information, but it's a good place to start because the information is gathered as soon as you write the program.

When learning a new language, it may be useful for creating self-study materials.

If I try something, I will add it.

[PYTHON] Investigate English Irregular Verbs with Wiktionary

Overview

Verb information

Heading

script

Execution result

Data shaping

Remove links and comments

en-verb-2.py

Execution result

Removal of additional information

en-verb-3.py

Execution result

Regular verb

en-verb-4.py

Execution result

Inflection survey

en-verb-5.py

Execution result

Removal of the same pattern

en-verb-6.py

Execution result

application

`Execution result`

`en-verb-2.py`

`Execution result`

`en-verb-3.py`

`Execution result`

`en-verb-4.py`

`Execution result`

`en-verb-5.py`

`Execution result`

`en-verb-6.py`

`Execution result`