I tried Language processing 100 knock 2020. You can see the link of other chapters from here, and the source code from here.
Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.
020.py
import pandas as pd
path = "jawiki-country.json.gz"
df = pd.read_json(path, lines=True)
print(df.query("title == 'England'")["text"].values[0])
# -> {{redirect|UK}}
# {{redirect|United Kingdom|The princes of the Spring and Autumn period|English(Spring and autumn)}}
# {{Otheruses|European country|Local cuisine of Nagasaki and Kumamoto prefectures|England}}
Since the given file is in JSON Lines
format, we set lines
to True
.
You can get row elements that meet the conditions by using query ()
. You could also do df [df ["title "] =" UK "]
.
Extract the line that declares the category name in the article.
021.py
import input_json as js
article = js.input()
print(list(filter(lambda x: "[[Category" in x, article)))
# -> ['[[Category:England|*]]', '[[Category:England連邦加盟国]]', '[[Category:Commonwealth Kingdom|*]]'...
The sentence extracted in No.20 can be called by ʻimportand
js.input (). Does the
filterfunction return an iterator, right? I have the impression that the processing of
list (filter ())` does not fit well. I wonder if there is another implementation method.
Extract the article category names (by name, not line by line).
022.py
import input_json as js
import re
article = js.input()
list = [re.match("\[\[Category:([^\|,\]]+)", item) for item in article]
print([str.groups()[0] for str in list if str is not None])
# -> ['England', 'England連邦加盟国', 'Commonwealth Kingdom'...
I noticed that the title
in this chapter was a regular expression, so I wrote it using it.
Extracting the string enclosed in [[Category:
and |
or ]
. It's like cryptography and it's fun.
If you use groups ()
, you will get the contents of ()
in the regular expression as a tuple type, so I put them together in a list type.
Display the section name and its level (for example, 1 if "== section name ==") included in the article.
023.py
import input_json as js
import re
article = js.input()
list = [re.search("(==+)([^=]*)", item) for item in article]
print([{item.group(2): f'Level {item.group(1).count("=")}'} for item in list if not item == None])
# -> [{'Country name': 'Level 2'}, {'history': 'Level 2'}...
I got the == +
and [^ =] *
parts of the regular expression with search (). Group ()
. The level of the section seemed to correspond to the number of =
, so I used count ()
to put it together in a dict
type.
Extract all media files referenced from the article.
024.py
import input_json as js
import re
article = js.input()
list = [re.search("^\[\[File:([^|]*)", item) for item in article]
print([str.group(1) for str in list if not str==None])
# -> ['Descriptio Prime Tabulae Europae.jpg', "Lenepveu, Jeanne d'Arc au siège d'Orléans.jpg "...
If you look at the Markup Quick Reference, you can see that [[File:
It feels like the part starting with is a media file. I also searched for a character string containing [[File:
, but I excluded it because it seems unlikely.
Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.
025.py
import pandas as pd
import re
df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
ans = {}
for item in str[0].replace(" ", "").split("\n|"):
kv = re.sub("\<ref\>.*\</ref\>", "", item, flags=re.DOTALL).split("=")
ans[kv[0]] = kv[1]
print(ans)
# -> {'Abbreviated name': 'England', 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',...
The part that starts with {{basic information
and ends with\ n}}
is matched by affirmative look-ahead and affirmative look-ahead, and split
is done with \ n |
.
You need to set the re.DOTALL
flag because you need to include \ n
in the match statement. This problem took quite a while ...
When processing> 25, remove MediaWiki's emphasized markup (all weak, emphasized, and strongly emphasized) from the template value and convert it to text (reference: markup quick reference table).
026.py
import pandas as pd
import re
def remove_quote(a: list):
ans = {}
for i in a:
i = re.sub("'+", "", i, flags=re.DOTALL)
i = re.sub("<br/>", "", i, flags=re.DOTALL).split("=")
ans[i[0]]= i[1]
return ans
df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
list = [re.sub("\<ref\>.*\</ref\>", "", item, flags=re.DOTALL) for item in str[0].replace(" ", "").split("\n|")]
print(remove_quote(list))
# -> ...'Motto': '{{lang|fr|[[Dieuetmondroit]]}}([[French]]:[[Dieuetmondroit|God and my rights]])',...
Removed '
and <br />
from the output of No.25.
It seems that the type can be specified in the argument of the function, so I tried using it.
In addition to processing> 26, remove MediaWiki's internal link markup from the template value and convert it to text (reference: markup quick reference table).
027.py
import pandas as pd
import re
p_quote = re.compile("'+")
p_br = re.compile("<br/>")
p_ref = re.compile("\<ref\>.*\</ref\>", re.DOTALL)
p_emphasis1 = re.compile("\[\[[^\]]*\|([^\|]*?)\]\]")
p_emphasis2 = re.compile("\[\[(.+?)\]\]")
def remove_markup(a: list):
ans = {}
for i in a:
i = p_quote.sub("", i)
i = p_br.sub("", i)
i = p_emphasis1.sub(r"\1", i)
if p_emphasis2.search(i):
i = i.replace("[", "").replace("]", "")
i = i.split("=")
ans[i[0]] = i[1]
return ans
df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
list = [p_ref.sub("", item) for item in str[0].replace(" ", "").split("\n|")]
print(remove_markup(list))
# -> ...'Motto': '{{lang|fr|Dieuetmondroit}}(French:God and my rights)'...
[A]
The one in the shape ofA
,[A|...|B]
The one in the shape ofB
Is output.
I also pre-compiled and used regular expressions. Gradually I don't understand regular expressions ..
Also, I thought after seeing the above answer, but if you know, please let me know if you can solve the phenomenon that the color scheme becomes strange when you embed a regular expression.
In addition to processing> 27, remove MediaWiki markup from the template values as much as possible and format the basic country information.
028.py
import pandas as pd
import re
p_quote = re.compile("'+")
p_br = re.compile("<br/>")
p_ref = re.compile("\<ref\>.*\</ref\>", re.DOTALL)
p_emphasis1 = re.compile("\[\[[^\]]*\|([^\|]*?)\]\]")
p_emphasis2 = re.compile("\[\[(.+?)\]\]")
p_templete1 = re.compile("\{\{\d\}\}")
p_templete2 = re.compile("\{\{.*\|([^\|]*?)\}\}")
p_refname = re.compile("<refname.*")
def remove_markup(a: list):
ans = {}
for i in a:
i = p_quote.sub("", i)
i = p_br.sub("", i)
i = p_emphasis1.sub(r"\1", i)
if p_emphasis2.search(i):
i = i.replace("[", "").replace("]", "")
i = p_templete1.sub("", i)
i = p_templete2.sub(r"\1", i)
i = p_refname.sub("", i)
i = re.sub("((National emblem))", r"\1", i)
i = re.sub("\}\}File.*", "", i).split("=")
ans[i[0]] = i[1]
return ans
df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
list = [p_ref.sub("", item) for item in str[0].replace(" ", "").split("\n|")]
print(remove_markup(list))
# -> ...'Motto': 'Dieuetmondroit (French:God and my rights)'...
I tried removing the markup from one end. But this way of writing doesn't seem to apply well to articles outside the UK. (For example, in the Singapore article, the parameters of the national flag width could not be extracted.)
Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)
029.py
import requests
import pandas as pd
import re
p_quote = re.compile("'+")
p_br = re.compile("<br />")
p_ref = re.compile("\<ref\>.*\</ref\>", re.DOTALL)
p_emphasis1 = re.compile("\[\[[^\]]*\|([^\|]*?)\]\]")
p_emphasis2 = re.compile("\[\[(.+?)\]\]")
p_templete1 = re.compile("\{\{\d\}\}")
p_templete2 = re.compile("\{\{.*\|([^\|]*?)\}\}")
p_refname = re.compile("<ref name.*")
def remove_markup(a: list):
ans = {}
for i in a:
i = p_quote.sub("", i)
i = p_br.sub("", i)
i = p_emphasis1.sub(r"\1", i)
if p_emphasis2.search(i):
i = i.replace("[", "").replace("]", "")
i = p_templete1.sub("", i)
i = p_templete2.sub(r"\1", i)
i = p_refname.sub("", i)
i = re.sub("((National emblem))", r"\1", i)
i = re.sub("\}\}File.*", "", i).split("=")
i[0] = re.sub("^\s", "", i[0])
i[0] = re.sub("\s$", "", i[0])
i[1] = re.sub("^\s", "", i[1])
i[1] = re.sub("\s$", "", i[1])
ans[i[0]] = i[1]
return ans
df = pd.read_json("jawiki-country.json.gz", lines=True)
article = df.query("title == 'England'")["text"].values[0]
str = re.findall("(?<=Basic information Country\n\|).*(?=\n\}\})", article, flags=re.DOTALL)
list = [p_ref.sub("", item) for item in str[0].split("\n|")]
page = remove_markup(list)
print(page["National flag image"])
url = 'https://en.wikipedia.org/w/api.php'
PARAMS = {
"action": "query",
"format": "json",
"prop": "imageinfo",
"iiprop": "url",
"titles": "File:" + page["National flag image"]
}
response = requests.get(url, params=PARAMS)
data = response.json()
for k, v in data["query"]["pages"].items():
print(f"{v['imageinfo'][0]['url']}")
# -> https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg
Actually, I wanted to make No.28 up to ʻimport`, but I changed the part of Markup removal a little, so I posted all the lines.
I am sending a GET request using the request
module. Imageinfo of MediaWiki API was very helpful.
Recommended Posts