You can now research the full text of Wiktionary in about a minute. As an example of extracting meaningful information, we will investigate irregular verbs in English.
This is a series of articles.
The script for this article is posted in the following repositories.
Since it is difficult to grasp the entire huge dump data, we will roughly narrow it down and proceed with the processing.
First, read the full text and collect verb information.
Analyze the information obtained and remove the regular verbs in several steps.
Let's take see as an example to see how the variants of English verbs are stored.
Verb
see (third-person singular simple present sees, present participle seeing, simple past saw or (dialectal) seen or (dialectal) seent or (dialectal) seed, past participle seen or (dialectal) seent or (dialectal) seed)
Since it is written in sentences and there is also other forms of information such as dialectal, it seems difficult to parse this programmatically.
If you check the source, it is in a format that is easy to handle.
====Verb====
{{en-verb|sees|seeing|saw|past2=seen|past2_qual=dialectal|past3=seent|past3_qual=dialectal|past4=seed|past4_qual=dialectal|seen|past_ptc2=seent|past_ptc2_qual=dialectal|past_ptc3=seed|past_ptc3_qual=dialectal}}
It seems that you can ignore additional information such as past2 =
. The following information remains.
{{en-verb|sees|seeing|saw|seen}}
The original form seems to be obtained from the title of the page.
I checked the heading surrounded by equals.
There are no cases where the number of equals does not match on the left and right, but it seems that there may be spaces.
[Example] Spaces are indicated by underscores.
== Japanese == _
(last space)
====_Translations_====
(space between equals)
Since such a pattern is a small part of the whole, it seems that there is no space originally. Even a part of it exists, so it is necessary to take measures.
It seems that the hierarchy of headings is not always unified.
===Verb===
Implemented in Python.
Extract title, id and text as page information.
def getpages(bz2data):
xml = bz2.decompress(bz2data).decode("utf-8")
pages = ET.fromstring(f"<pages>{xml}</pages>")
for page in pages:
if int(page.find("ns").text) == 0:
title = page.find("title").text
id = int(page.find("id").text)
with io.StringIO(page.find("revision/text").text) as text:
yield id, title, text
Split text
by pattern.
def splittext(pattern, text):
line = next(text, "")
while line:
m = pattern.match(line)
line = next(text, "")
if m:
def g():
nonlocal line
while line and not pattern.match(line):
yield line
line = next(text, "")
yield m[1].strip(), g()
Collect the {{en-verb ...}}
at ==== Verb ====
of == English ==
. Exclude headwords that contain spaces or hyphens as idioms or compound words.
def en_verb(args):
target, pos, length = args
with open(target, "rb") as f:
f.seek(pos)
bz2data = f.read(length)
pattern1 = re.compile("==([^=].*)==")
pattern2 = re.compile("===+([^=].*?)===")
result = []
for id, title, text in getpages(bz2data):
if " " in title or "-" in title: continue
for lang, text2 in splittext(pattern1, text):
if lang != "English": continue
for subsub, text3 in splittext(pattern2, text2):
if subsub != "Verb": continue
for line in text3:
if line.startswith("{{en-verb"):
result.append((id, title, line.strip()))
break
return result
Add parallelism and file output to this. I put the whole script below.
Run the script and check the number of lines of output data.
Execution result
$ time python en-verb.py
6,861 / 6,861
real 1m50.159s
user 13m53.906s
sys 0m17.438s
$ wc -l en-verb.tsv
31334 en-verb.tsv
Format the acquired data so that it is easy to process.
If you look at the acquired data, you will find links and comments.
$ grep "\[" en-verb.tsv | head -n 2
4109 U-turn {{en-verb|head=[[U]]-[[turn]]|U-turns|U-turning|U-turned}}
5661 read {{en-verb|reads|reading|[[:en:#Etymology_2|read]]|[[:en:#Etymology_2|read]]|past_ptc2=readen|past_ptc2_qual=archaic, dialectal}}
$ grep '<!--' en-verb.tsv | head -n 2
5443 sing {{en-verb|sings|singing|sang|sung|past_ptc2=sungen|past_ptc2_qual=archaic}}<!--Sang or sung for preterite, according to AHD.-->
6959 hide {{en-verb|hides|hiding|hided}}<!--not hid, hidden!-->
Remove the parentheses and comments on the link. With a link[[:en:#Etymology_2|read]]
Inside like|
What is separated by is the last element (in this exampleread
) Is left.
en-verb-2.py
import re, sys
pattern1 = re.compile("\[\[(.+?)\]\]")
pattern2 = re.compile("<!--(.*?)-->")
for line in sys.stdin:
while (m := pattern1.search(line)):
data = m[1].split("|")
line = line[:m.start()] + data[-1] + line[m.end():]
sys.stdout.write(pattern2("", line))
Execution result
$ python en-verb-2.py < en-verb.tsv > en-verb-2.tsv
$ diff -U 0 en-verb.tsv en-verb-2.tsv | head -n 8
--- en-verb.tsv 2020-06-12 21:22:01.943343500 +0900
+++ en-verb-2.tsv 2020-06-12 21:22:54.506793900 +0900
@@ -747 +747 @@
-5443 sing {{en-verb|sings|singing|sang|sung|past_ptc2=sungen|past_ptc2_qual=archaic}}<!--Sang or sung for preterite, according to AHD.-->
+5443 sing {{en-verb|sings|singing|sang|sung|past_ptc2=sungen|past_ptc2_qual=archaic}}
@@ -763 +763 @@
-5661 read {{en-verb|reads|reading|[[:en:#Etymology_2|read]]|[[:en:#Etymology_2|read]]|past_ptc2=readen|past_ptc2_qual=archaic, dialectal}}
+5661 read {{en-verb|reads|reading|read|read|past_ptc2=readen|past_ptc2_qual=archaic, dialectal}}
Additional information like past_ptc2 = readen
in read is needed if you're looking at dialects or older material, but this time we'll remove it. Remove any other information after }}
.
en-verb-3.py
import re, sys
pattern1 = re.compile("{{(.*?)}}")
pattern2 = re.compile("[a-z0-9_]+?=")
for line in sys.stdin:
if (m := pattern1.search(line)):
data = [d for d in m[1].split("|") if not pattern2.match(d)]
line = line[:m.start()] + "{{" + "|".join(data) + "}}\n"
sys.stdout.write(line)
Execution result
$ python en-verb-3.py < en-verb-2.tsv > en-verb-3.tsv
$ diff -U 0 en-verb-2.tsv en-verb-3.tsv | head -n 8
--- en-verb-2.tsv 2020-06-12 21:22:54.506793900 +0900
+++ en-verb-3.tsv 2020-06-12 21:25:39.197882200 +0900
@@ -14 +14 @@
-71 crow {{en-verb|crows|crowing|crowed|past2=crew|past2_qual=UK|crowed|past_ptc2=crown|past_ptc2_qual=archaic}}
+71 crow {{en-verb|crows|crowing|crowed|crowed}}
@@ -19 +19 @@
-114 may {{en-verb|may|-|might|-|past_ptc2=mought|past_ptc2_qual=obsolete}}
+114 may {{en-verb|may|-|might|-}}
This time, the purpose is to investigate irregular verbs, so we will exclude regular verbs.
Let's check how to write a regular verb from the data generated earlier.
2157 open {{en-verb}}
46912 like {{en-verb|lik}}
58007 wish {{en-verb|es}}
60426 hone {{en-verb|hon|es}}
34295 chop {{en-verb|chop|p|ing}}
39760 compel {{en-verb|compel|l|ed}}
This seems to mean the following. A hyphen is a delimiter between the stem and the ending.
word | 3 simple | Present participle | Past participle | Interpretation |
---|---|---|---|---|
open | open-s | open-ing | open-ed | No word form information |
like | like-s | lik-ing | lik-ed | The last two stems lik |
wish | wish-es | wish-ing | wish-ed | 3 simple es |
hone | hon-es | hon-ing | hon-ed | Stem hon and 3 simple es |
chop | chop-s | chop-p-ing | chop-p-ed | Duplicate p and present participle ing |
compel | compel-s | compel-l-ing | compel-l-ed | Duplicate l and past form ed |
Like and hone have the same pattern, but both styles seem to give the same result. The hyphens are in different positions, but they do not have linguistic significance and are intended for programs that generate variants.
Remove words with no additional information, up to two terms, or three items ing or ed. Also remove the beginning {{en-verb |
and the ending }}
to make them tab-delimited.
en-verb-4.py
import re, sys
pattern = re.compile("{{en-verb\\|(.*?)\\|*}}")
for line in sys.stdin:
id, verb, forms = line.split("\t")
if (m := pattern.match(forms)):
forms = m[1].split("|")
if len(forms) > 2 and forms[2] != "ing" and forms[2] != "ed":
forms = "\t".join(forms)
print(f"{id}\t{verb}\t{forms}")
Execution result
$ python en-verb-4.py < en-verb-3.tsv > en-verb-4.tsv
$ head -n 5 en-verb-4.tsv
71 crow crows crowing crowed crowed
112 march marches marching marched
114 may may - might -
167 swop swops swopping swopped
180 deal deals dealing dealt
$ wc -l en-verb-4.tsv
5296 en-verb-4.tsv
I narrowed it down to a certain extent, but there are still many.
Looking at the data, it seems that the notation that specifies the stem is not unified, and regular verbs are mixed.
71 crow crows crowing crowed crowed
167 swop swops swopping swopped
3 Ignore the simple and present participle and other forms of information, and exclude those that meet the following conditions as regular verbs.
There are conditions for duplicate characters, but this time it is not necessary, so I will not enter.
Words with apostrophes are judged by removing the symbol.
127005 F F's|F'ing|F'ed
Remove undefined past forms and participles that are both -
.
en-verb-5.py
import sys
for line in sys.stdin:
id, *forms = line.strip().split("\t")
if len(forms) == 4: forms.append(forms[3])
verb, _, _, past, pp = [f.replace("'", "").replace("-", "") for f in forms]
if past == pp:
if not past: continue
if past.endswith("ed"):
if verb.endswith("e") and verb + "d" == past: continue
if verb.endswith("y") and verb[:-1] + "ied" == past: continue
if verb + "ed" == past: continue
if verb + verb[-1] + "ed" == past: continue
if verb[-1] == "c" and verb + "ked" == past: continue
forms = "\t".join(forms)
print(f"{id}\t{forms}")
Execution result
$ python en-verb-5.py < en-verb-4.tsv > en-verb-5.tsv
$ wc -l en-verb-5.tsv
1461 en-verb-5.tsv
If you look at the data, you may see multiple identical words.
5438 think thinks thinking thought thought
5438 think thinks thinking thought thought
In addition, there are compound words that have the same pattern of change.
5664 draw draws drawing drew drawn
7404 overdraw overdraws overdrawing overdrew overdrawn
7761 withdraw withdraws withdrawing withdrew withdrawn
Combine the same words into one and narrow down the same pattern to only short words.
en-verb-6.py
import sys
verbs = {}
for line in sys.stdin:
id, verb, *forms = line.strip().split("\t")
if verb in verbs: continue
verbs[verb] = (id, forms)
verbs2 = []
for v1, (id, forms) in verbs.items():
contains = False
for v2, (_, f2) in verbs.items():
if v1 != v2 and v1.endswith(v2):
c = True
for f1, f2 in zip(forms, f2):
if not f1.endswith(f2):
c = False
break
if c:
contains = True
break
if not contains:
forms = "\t".join(forms)
print(f"{id}\t{v1}\t{forms}")
Execution result
$ python en-verb-6.py < en-verb-5.tsv > en-verb-6.tsv
$ wc -l en-verb-6.tsv
378 en-verb-6.tsv
It was squeezed considerably. After that, it would be realistic to look at them individually.
Put the created file. There are still some issues, but it seems that the original data will need to be modified.
By applying this method, you can investigate languages other than English. Of course, Wiktionary isn't the only source of information, but it's a good place to start because the information is gathered as soon as you write the program.
When learning a new language, it may be useful for creating self-study materials.
If I try something, I will add it.