Introduction

I tried Language processing 100 knock 2020. You can see the link of other chapters from here, and the source code from here.

Chapter 3 Regular Expressions

No.20 Reading JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

Answer

`020.py`


import pandas as pd

path = "jawiki-country.json.gz"
df = pd.read_json(path, lines=True)
print(df.query("title == 'England'")["text"].values[0])

# -> {{redirect|UK}}
#    {{redirect|United Kingdom|The princes of the Spring and Autumn period|English(Spring and autumn)}}
#    {{Otheruses|European country|Local cuisine of Nagasaki and Kumamoto prefectures|England}}

Comments

Since the given file is in JSON Lines format, we set lines to True. You can get row elements that meet the conditions by using query (). You could also do df [df ["title "] =" UK "].

No.21 Extract the line containing the category name

Extract the line that declares the category name in the article.

Answer

`021.py`


import input_json as js

article = js.input()
print(list(filter(lambda x: "[[Category" in x, article)))

# -> ['[[Category:England|*]]', '[[Category:England連邦加盟国]]', '[[Category:Commonwealth Kingdom|*]]'...

Comments

The sentence extracted in No.20 can be called by ʻimportandjs.input (). Does the filterfunction return an iterator, right? I have the impression that the processing oflist (filter ())` does not fit well. I wonder if there is another implementation method.

No.22 Extraction of category name

Extract the article category names (by name, not line by line).

Answer

`022.py`



import input_json as js
import re

article = js.input()
list = [re.match("\[\[Category:([^\|,\]]+)", item) for item in article]
print([str.groups()[0] for str in list if str is not None])

# -> ['England', 'England連邦加盟国', 'Commonwealth Kingdom'...

Comments

I noticed that the title in this chapter was a regular expression, so I wrote it using it. Extracting the string enclosed in [[Category: and | or ]. It's like cryptography and it's fun. If you use groups (), you will get the contents of () in the regular expression as a tuple type, so I put them together in a list type.

No.23 Section structure

Display the section name and its level (for example, 1 if "== section name ==") included in the article.

Answer

`023.py`


import input_json as js
import re

article = js.input()
list = [re.search("(==+)([^=]*)", item) for item in article]
print([{item.group(2): f'Level {item.group(1).count("=")}'} for item in list if not item == None])

# -> [{'Country name': 'Level 2'}, {'history': 'Level 2'}...

Comments

I got the == + and [^ =] * parts of the regular expression with search (). Group (). The level of the section seemed to correspond to the number of =, so I used count () to put it together in a dict type.

No.24 Extraction of file reference

Extract all media files referenced from the article.

Answer

`024.py`


import input_json as js
import re

article = js.input()
list = [re.search("^\[\[File:([^|]*)", item) for item in article]
print([str.group(1) for str in list if not str==None])

# -> ['Descriptio Prime Tabulae Europae.jpg', "Lenepveu, Jeanne d'Arc au siège d'Orléans.jpg "...

Comments

If you look at the Markup Quick Reference, you can see that [[File: It feels like the part starting with is a media file. I also searched for a character string containing [[File:, but I excluded it because it seems unlikely.

No.25 Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

Answer

`025.py`


import pandas as pd
import re

df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article,  flags=re.DOTALL)
ans = {}
for item in str[0].replace(" ", "").split("\n|"):
    kv = re.sub("\<ref\>.*\</ref\>", "", item, flags=re.DOTALL).split("=")
    ans[kv[0]] = kv[1]
print(ans)

# -> {'Abbreviated name': 'England', 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',...

Comments

The part that starts with {{basic information and ends with\ n}}is matched by affirmative look-ahead and affirmative look-ahead, and split is done with \ n |. You need to set the re.DOTALL flag because you need to include \ n in the match statement. This problem took quite a while ...

No.26 Removal of emphasized markup

When processing> 25, remove MediaWiki's emphasized markup (all weak, emphasized, and strongly emphasized) from the template value and convert it to text (reference: markup quick reference table).

Answer

`026.py`


import pandas as pd
import re


def remove_quote(a: list):
    ans = {}
    for i in a:
        i = re.sub("'+", "", i, flags=re.DOTALL)
        i = re.sub("<br/>", "", i, flags=re.DOTALL).split("=")
        ans[i[0]]= i[1]
    return ans


df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
list = [re.sub("\<ref\>.*\</ref\>", "", item, flags=re.DOTALL) for item in str[0].replace(" ", "").split("\n|")]
print(remove_quote(list))

# -> ...'Motto': '{{lang|fr|[[Dieuetmondroit]]}}（[[French]]:[[Dieuetmondroit|God and my rights]]）',...

Comments

Removed ' and <br /> from the output of No.25. It seems that the type can be specified in the argument of the function, so I tried using it.

No.27 Removal of internal links

In addition to processing> 26, remove MediaWiki's internal link markup from the template value and convert it to text (reference: markup quick reference table).

Answer

`027.py`


import pandas as pd
import re

p_quote = re.compile("'+")
p_br = re.compile("<br/>")
p_ref = re.compile("\<ref\>.*\</ref\>", re.DOTALL)
p_emphasis1 = re.compile("\[\[[^\]]*\|([^\|]*?)\]\]")
p_emphasis2 = re.compile("\[\[(.+?)\]\]")


def remove_markup(a: list):
    ans = {}
    for i in a:
        i = p_quote.sub("", i)
        i = p_br.sub("", i)
        i = p_emphasis1.sub(r"\1", i)
        if p_emphasis2.search(i):
            i = i.replace("[", "").replace("]", "")
        i = i.split("=")
        ans[i[0]] = i[1]
    return ans


df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
list = [p_ref.sub("", item) for item in str[0].replace(" ", "").split("\n|")]
print(remove_markup(list))

# -> ...'Motto': '{{lang|fr|Dieuetmondroit}}(French:God and my rights)'...

Comments

[A]The one in the shape ofA,[A|...|B]The one in the shape ofBIs output. I also pre-compiled and used regular expressions. Gradually I don't understand regular expressions ..

Also, I thought after seeing the above answer, but if you know, please let me know if you can solve the phenomenon that the color scheme becomes strange when you embed a regular expression.

No.28 Removal of MediaWiki markup

In addition to processing> 27, remove MediaWiki markup from the template values as much as possible and format the basic country information.

Answer

`028.py`


import pandas as pd
import re

p_quote = re.compile("'+")
p_br = re.compile("<br/>")
p_ref = re.compile("\<ref\>.*\</ref\>", re.DOTALL)
p_emphasis1 = re.compile("\[\[[^\]]*\|([^\|]*?)\]\]")
p_emphasis2 = re.compile("\[\[(.+?)\]\]")
p_templete1 = re.compile("\{\{\d\}\}")
p_templete2 = re.compile("\{\{.*\|([^\|]*?)\}\}")
p_refname = re.compile("<refname.*")


def remove_markup(a: list):
    ans = {}
    for i in a:
        i = p_quote.sub("", i)
        i = p_br.sub("", i)
        i = p_emphasis1.sub(r"\1", i)
        if p_emphasis2.search(i):
            i = i.replace("[", "").replace("]", "")
        i = p_templete1.sub("", i)
        i = p_templete2.sub(r"\1", i)
        i = p_refname.sub("", i)
        i = re.sub("（(National emblem)）", r"\1", i)
        i = re.sub("\}\}File.*", "", i).split("=")
        ans[i[0]] = i[1]
    return ans


df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
list = [p_ref.sub("", item) for item in str[0].replace(" ", "").split("\n|")]
print(remove_markup(list))

# -> ...'Motto': 'Dieuetmondroit (French:God and my rights)'...

Comments

I tried removing the markup from one end. But this way of writing doesn't seem to apply well to articles outside the UK. (For example, in the Singapore article, the parameters of the national flag width could not be extracted.)

No.29 Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)

Answer

`029.py`


import requests
import pandas as pd
import re

p_quote = re.compile("'+")
p_br = re.compile("<br />")
p_ref = re.compile("\<ref\>.*\</ref\>", re.DOTALL)
p_emphasis1 = re.compile("\[\[[^\]]*\|([^\|]*?)\]\]")
p_emphasis2 = re.compile("\[\[(.+?)\]\]")
p_templete1 = re.compile("\{\{\d\}\}")
p_templete2 = re.compile("\{\{.*\|([^\|]*?)\}\}")
p_refname = re.compile("<ref name.*")


def remove_markup(a: list):
    ans = {}
    for i in a:
        i = p_quote.sub("", i)
        i = p_br.sub("", i)
        i = p_emphasis1.sub(r"\1", i)
        if p_emphasis2.search(i):
            i = i.replace("[", "").replace("]", "")
        i = p_templete1.sub("", i)
        i = p_templete2.sub(r"\1", i)
        i = p_refname.sub("", i)
        i = re.sub("（(National emblem)）", r"\1", i)
        i = re.sub("\}\}File.*", "", i).split("=")
        i[0] = re.sub("^\s", "", i[0])
        i[0] = re.sub("\s$", "", i[0])
        i[1] = re.sub("^\s", "", i[1])
        i[1] = re.sub("\s$", "", i[1])
        ans[i[0]] = i[1]
    return ans


df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
list = [p_ref.sub("", item) for item in str[0].split("\n|")]
page = remove_markup(list)

print(page["National flag image"])
url = 'https://en.wikipedia.org/w/api.php'
PARAMS = {
    "action": "query",
    "format": "json",
    "prop": "imageinfo",
    "iiprop": "url",
    "titles": "File:" + page["National flag image"]
}
response = requests.get(url, params=PARAMS)
data = response.json()
for k, v in data["query"]["pages"].items():
    print(f"{v['imageinfo'][0]['url']}")

# -> https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg

Comments

Actually, I wanted to make No.28 up to ʻimport`, but I changed the part of Markup removal a little, so I posted all the lines.

I am sending a GET request using the request module. Imageinfo of MediaWiki API was very helpful.

[PYTHON] I tried 100 language processing knock 2020: Chapter 3

Introduction

Chapter 3 Regular Expressions

No.20 Reading JSON data

020.py

No.21 Extract the line containing the category name

021.py

No.22 Extraction of category name

022.py

No.23 Section structure

023.py

No.24 Extraction of file reference

024.py

No.25 Template extraction

025.py

No.26 Removal of emphasized markup

026.py

No.27 Removal of internal links

027.py

No.28 Removal of MediaWiki markup

028.py

No.29 Get the URL of the national flag image

029.py

`020.py`

`021.py`

`022.py`

`023.py`

`024.py`

`025.py`

`026.py`

`027.py`

`028.py`

`029.py`