The other day, 100 Language Processing Knock 2020 was released. I myself have only been in natural language processing for a year, and I don't know the details, but I will solve all the problems and publish them in order to improve my technical skills.
All shall be executed on jupyter notebook, and the restrictions of the problem statement may be broken conveniently. The source code is also on github. Yes.
Chapter 2 is here.
The environment is Python 3.8.2 and Ubuntu 18.04.
There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. ・ One article information is stored in JSON format per line -In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. ・ The entire file is compressed with gzip Create a program that performs the following processing.
Please download the required dataset from here.
The downloaded file shall be placed under data
.
Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.
Load the module for unzipping gzip and loading json.
code
import gzip
import json
Read the gzip file line by line and convert each line to dictionary type with json.loads ()
.
code
data = []
with gzip.open('data/jawiki-country.json.gz', 'rt') as f:
for line in f:
line = line.strip()
data.append(json.loads(line))
Find the element whose title is "UK" in the data and store it in text
.
code
for df in data:
if df['title'] == 'England':
text = df['text']
break
The contents are like this
{{redirect|UK}}
{{redirect|United Kingdom|The princes of the Spring and Autumn period|English(Spring and autumn)}}
{{Otheruses|European country|Local cuisine of Nagasaki and Kumamoto prefectures|England}}
{{Basic information Country
|Abbreviated name=England
|Japanese country name=United Kingdom of Great Britain and Northern Ireland
|Official country name= {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />
*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}([[Scottish Gaelic]])
*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}([[Welsh]])
*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}([[Irish]])
*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}([[Cornish]])
*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}([[Scots]])
**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}(Ulster Scots)</ref>
|National flag image= Flag of the United Kingdom.svg
|National emblem image= [[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]
|National emblem link=([[British coat of arms|National emblem]])
|Motto= {{lang|fr|[[Dieu et mon droit]]}}<br />([[French]]:[[Die
Extract the line that declares the category name in the article.
code
import re
code
lines = text.splitlines()
for line in lines:
if re.search(r'\[\[Category:.*\]\]', line):
print(line)
Extract the category name using a regular expression. Extract all lines that match patterns such as [[Category: Hogehoge]]
.
output
[[Category:England|*]]
[[Category:Commonwealth of Nations]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states|Former]]
[[Category:Maritime nation]]
[[Category:Existing sovereign country]]
[[Category:Island country]]
[[Category:A nation / territory established in 1801]]
Extract the article category names (by name, not line by line).
code
for line in lines:
lst = re.findall(r'\[\[Category:(.*)\]\]', line)
for category in lst:
print(category)
I put the matching subsequences in lst
and output all of them. You can extract the category name.
output
England|*
Commonwealth of Nations
Commonwealth Kingdom|*
G8 member countries
European Union member states|Former
Maritime nation
Existing sovereign country
Island country
A nation / territory established in 1801
Display the section name and its level contained in the article (for example, 1 if "== section name ==").
code
for line in lines:
if re.search(r'^==.*==$', line):
level = len(re.match(r'^=*', line).group()) - 1
title = re.sub(r'[=\s]', '', line)
print(level, title)
Extract lines with patterns such as = section =
, == section ==
, === section ===
, ...... The level of the section is determined by the number of =
.
output
1 Country name
1 history
1 Geography
2 major cities
2 Climate
1 Politics
2 Head of state
2 law
2 Domestic affairs
2 Local administrative divisions
2 Diplomacy / Military
1 economy
2 Mining
2 Agriculture
2 trade
2 real estate
2 Energy policy
2 currencies
2 companies
3 Communication
1 Transportation
2 road
2 Railroad
2 Shipping
2 aviation
1 Science and technology
1 people
2 languages
2 religion
2 Marriage
2 emigration
2 Education
2 Medical
1 culture
2 Food culture
2 Literature
2 Philosophy
2 music
3 popular music
2 movies
2 comedy
2 National flower
2 World Heritage
2 public holidays
2 sports
3 soccer
3 cricket
3 Horse racing
3 motor sports
3 baseball
3 curling
3 Cycling
1 footnote
1 Related items
1 External link
You can also use the back reference as follows. I haven't measured it properly, but I think it's faster. It seems that which example is more readable depends on the person, and I think it doesn't matter which one is easier to understand.
code
for line in lines:
if x := re.match(r'^(==+)(.*)\1$', line):
print(len(x[1])-1, x[2].strip())
I tried using the walrus operator
It's just a section "structure", so implement something like the tree
command.
code
def get_sections():
return [
(
len(re.match(r'^=*', line).group()) - 1,
re.sub(r'[=\s]', '', line)
)
for line in lines
if re.search(r'^==.*==$', line)
]
First, take the section from lines
and make it a list of levels and section names.
code
class Section(list):
def __init__(self, title):
self.title = title
super().__init__()
def last(self):
return self[-1]
def add(self, level, title):
if level == 1:
self.append(Section(title))
else:
self[-1].add(level-1, title)
def tree_lines(self, head):
lines = []
last = len(self) - 1
for i, x in enumerate(self):
line = head
line += '└' if i == last else '├'
line += x.title
lines.append(line)
lines += (x.tree_lines(head + (' ' if i == last else '│')))
return lines
def __repr__(self):
return '\n'.join(self.tree_lines(''))
Create a class of objects that recursively holds sections. The objects in the level 1 section inherit from the list type, and you can keep the level 2 section inside yourself.
code
root = Section('root')
for level, title in get_sections():
root.add(level, title)
root
The list of sections obtained by get_sections ()
is recursively inserted from the root section using the ʻaddmethod. I am trying to recursively convert from the
repr` method to a character string.
output
├ Country name
├ History
├ Geography
│ ├ Major cities
│ └ Climate
├ Politics
│ ├ Head of state
│ ├ Law
│ ├ Internal affairs
│ ├ Local administrative division
│ └ Diplomacy / Military
├ Economy
│ ├ Mining
│ ├ Agriculture
│ ├ Trade
│ ├ Real estate
│ ├ Energy policy
│ ├ Currency
│ └ Company
│ └ Communication
├ Transportation
│ ├ Road
│ ├ Railway
│ ├ Shipping
│ └ Aviation
├ Science and technology
├ People
│ ├ Language
│ ├ Religion
│ ├ Marriage
│ ├ Emigration
│ ├ Education
│ └ Medical
├ Culture
│ ├ Food culture
│ ├ Literature
│ ├ Philosophy
│ ├ Music
│ │ └ Popular music
│ ├ Movie
│ ├ Comedy
│ ├ National flower
│ ├ World Heritage Site
│ ├ Holidays
│ └ Sports
│ ├ Soccer
│ ├ Cricket
│ ├ Horse racing
│ ├ Motor sports
│ ├ Baseball
│ ├ Curling
│ └ Bicycle competition
├ Footnote
├ Related items
└ External link
Below Sports, no borders are displayed because there are no Level 2 sections with the same parent section. Only the last section of the child sections implements this by changing the ruled line that is displayed before the title.
Extract all the media files referenced from the article.
code
for line in lines:
lst = re.findall(r'\[\[File:([^|\]]*)', line)
for x in lst:
print(x)
The part of "somehow" that matches [[file: somehow]]
is extracted.
output
Royal Coat of Arms of the United Kingdom.svg
United States Navy Band - God Save the Queen.ogg
Descriptio Prime Tabulae Europae.jpg
Lenepveu, Jeanne d'Arc au siège d'Orléans.jpg
London.bankofengland.arp.jpg
Battle of Waterloo 1815.PNG
Uk topo en.jpg
BenNevis2005.jpg
Population density UK 2011 census.png
2019 Greenwich Peninsula & Canary Wharf.jpg
Birmingham Skyline from Edgbaston Cricket Ground crop.jpg
Leeds CBD at night.jpg
Glasgow and the Clyde from the air (geograph 4665720).jpg
Palace of Westminster, London - Feb 2007.jpg
Scotland Parliament Holyrood.jpg
Donald Trump and Theresa May (33998675310) (cropped).jpg
Soldiers Trooping the Colour, 16th June 2007.jpg
City of London skyline from London City Hall - Oct 2008.jpg
Oil platform in the North SeaPros.jpg
Eurostar at St Pancras Jan 2008.jpg
Heathrow Terminal 5C Iwelumo-1.jpg
Airbus A380-841 G-XLEB British Airways (10424102995).jpg
UKpop.svg
Anglospeak.svg
Royal Aberdeen Children's Hospital.jpg
CHANDOS3.jpg
The Fabs.JPG
Wembley Stadium, illuminated.jpg
Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.
code
for i, line in enumerate(lines):
if line.startswith('{{Basic information'):
start = i
elif line.startswith('}}'):
end = i
break
It's actually beyond the framework of regular languages. I think it is desirable to use a markdown parser or something, but specify the range of the basic information line.
code
templete = [
re.findall(r'\|([^=]*)=(.*)', line)
for line in lines[start+1 : end]
]
templete = [x[0] for x in templete if x]
dct = {
key.strip() : value.strip()
for key, value in templete
}
dct
Store the contents in the dictionary.
output
{'Abbreviated name': 'England',
'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />',
'National flag image': 'Flag of the United Kingdom.svg',
'National emblem image': '[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]',
'National emblem link': '([[British coat of arms|National emblem]])',
'Motto': '{{lang|fr|[[Dieu et mon droit]]}}<br />([[French]]:[[Dieu et mon droit|God and my rights]])',
'National anthem': "[[Her Majesty the Queen|{{lang|en|God Save the Queen}}]]{{en icon}}<br />''God save the queen''<br />{{center|[[File:United States Navy Band - God Save the Queen.ogg]]}}",
'Map image': 'Europe-UK.svg',
'Position image': 'United Kingdom (+overseas territories) in the World (+Antarctica claims).svg',
'Official terminology': '[[English]]',
'capital': '[[London]](infact)',
'Largest city': 'London',
'Head of state title': '[[British monarch|Queen]]',
'Name of head of state': '[[Elizabeth II]]',
'Prime Minister's title': '[[British Prime Minister|Prime Minister]]',
'Prime Minister's name': '[[Boris Johnson]]',
'Other heads of state title 1': '[[House of Peers(England)|Aristocratic House Chairman]]',
'Names of other heads of state 1': '[[:en:Norman Fowler, Baron Fowler|Norman Fowler]]',
'Other heads of state title 2': '[[House of Commons(England)|Chairman of the House of Commons]]',
'Other heads of state name 2': '{{Temporary link|Lindsay Foil|en|Lindsay Hoyle}}',
'Other heads of state title 3': '[[United Kingdom Supreme Court|Chief Justice of Japan]]',
'Other heads of state name 3': '[[:en:Brenda Hale, Baroness Hale of Richmond|Brenda Hale]]',
'Area ranking': '76',
'Area size': '1 E11',
'Area value': '244,820',
'Water area ratio': '1.3%',
'Demographic year': '2018',
'Population ranking': '22',
'Population size': '1 E7',
'Population value': '66,435,600<ref>{{Cite web|url=https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates|title=Population estimates - Office for National Statistics|accessdate=2019-06-26|date=2019-06-26}}</ref>',
'Population density value': '271',
'GDP statistics year yuan': '2012',
'GDP value source': '1,547.8 billion<ref name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>',
'GDP Statistics Year MER': '2012',
'GDP ranking MER': '6',
'GDP value MER': '2,433.7 billion<ref name="imf-statistics-gdp" />',
'GDP statistical year': '2012',
'GDP ranking': '6',
'GDP value': '2,316.2 billion<ref name="imf-statistics-gdp" />',
'GDP/Man': '36,727<ref name="imf-statistics-gdp" />',
'Founding form': 'Founding of the country',
'Established form 1': '[[Kingdom of England]]/[[Kingdom of scotland]]<br />(Both countries[[Joint law(1707)|1707合同法]]Until)',
'Date of establishment 1': '927/843',
'Established form 2': '[[Kingdom of Great Britain]]Established<br />(1707 Act)',
'Date of establishment 2': '1707{{0}}May{{0}}1 day',
'Established form 3': '[[United Kingdom of Great Britain and Ireland]]Established<br />([[Joint law(1800)|1800合同法]])',
'Date of establishment 3': '1801{{0}}January{{0}}1 day',
'Established form 4': "Current country name "'''United Kingdom of Great Britain and Northern Ireland'''"change to",
'Date of establishment 4': '1927{{0}}April 12',
'currency': '[[Sterling pound|UK pounds]](£)',
'Currency code': 'GBP',
'Time zone': '±0',
'Daylight saving time': '+1',
'ISO 3166-1': 'GB / GBR',
'ccTLD': '[[.uk]] / [[.gb]]<ref>Use is.Overwhelmingly small number compared to uk.</ref>',
'International call number': '44',
'Note': '<references/>'}
At processing> 25, remove MediaWiki's highlight markup (weak, highlight, strong) from the template value and convert it to text.
code
dct2 = {
key : re.sub(r"''+", '', value)
for key, value in dct.items()
}
dct2
result
{'Abbreviated name': 'England',
'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />',
'National flag image': 'Flag of the United Kingdom.svg',
'National emblem image': '[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]',
'National emblem link': '([[British coat of arms|National emblem]])',
'Motto': '{{lang|fr|[[Dieu et mon droit]]}}<br />([[French]]:[[Dieu et mon droit|God and my rights]])',
'National anthem': '[[Her Majesty the Queen|{{lang|en|God Save the Queen}}]]{{en icon}}<br />God save the queen<br />{{center|[[File:United States Navy Band - God Save the Queen.ogg]]}}',
'Map image': 'Europe-UK.svg',
'Position image': 'United Kingdom (+overseas territories) in the World (+Antarctica claims).svg',
'Official terminology': '[[English]]',
'capital': '[[London]](infact)',
'Largest city': 'London',
'Head of state title': '[[British monarch|Queen]]',
'Name of head of state': '[[Elizabeth II]]',
'Prime Minister's title': '[[British Prime Minister|Prime Minister]]',
'Prime Minister's name': '[[Boris Johnson]]',
'Other heads of state title 1': '[[House of Peers(England)|Aristocratic House Chairman]]',
'Names of other heads of state 1': '[[:en:Norman Fowler, Baron Fowler|Norman Fowler]]',
'Other heads of state title 2': '[[House of Commons(England)|Chairman of the House of Commons]]',
'Other heads of state name 2': '{{Temporary link|Lindsay Foil|en|Lindsay Hoyle}}',
'Other heads of state title 3': '[[United Kingdom Supreme Court|Chief Justice of Japan]]',
'Other heads of state name 3': '[[:en:Brenda Hale, Baroness Hale of Richmond|Brenda Hale]]',
'Area ranking': '76',
'Area size': '1 E11',
'Area value': '244,820',
'Water area ratio': '1.3%',
'Demographic year': '2018',
'Population ranking': '22',
'Population size': '1 E7',
'Population value': '66,435,600<ref>{{Cite web|url=https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates|title=Population estimates - Office for National Statistics|accessdate=2019-06-26|date=2019-06-26}}</ref>',
'Population density value': '271',
'GDP statistics year yuan': '2012',
'GDP value source': '1,547.8 billion<ref name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>',
'GDP Statistics Year MER': '2012',
'GDP ranking MER': '6',
'GDP value MER': '2,433.7 billion<ref name="imf-statistics-gdp" />',
'GDP statistical year': '2012',
'GDP ranking': '6',
'GDP value': '2,316.2 billion<ref name="imf-statistics-gdp" />',
'GDP/Man': '36,727<ref name="imf-statistics-gdp" />',
'Founding form': 'Founding of the country',
'Established form 1': '[[Kingdom of England]]/[[Kingdom of scotland]]<br />(Both countries[[Joint law(1707)|1707合同法]]Until)',
'Date of establishment 1': '927/843',
'Established form 2': '[[Kingdom of Great Britain]]Established<br />(1707 Act)',
'Date of establishment 2': '1707{{0}}May{{0}}1 day',
'Established form 3': '[[United Kingdom of Great Britain and Ireland]]Established<br />([[Joint law(1800)|1800合同法]])',
'Date of establishment 3': '1801{{0}}January{{0}}1 day',
'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
'Date of establishment 4': '1927{{0}}April 12',
'currency': '[[Sterling pound|UK pounds]](£)',
'Currency code': 'GBP',
'Time zone': '±0',
'Daylight saving time': '+1',
'ISO 3166-1': 'GB / GBR',
'ccTLD': '[[.uk]] / [[.gb]]<ref>Use is.Overwhelmingly small number compared to uk.</ref>',
'International call number': '44',
'Note': '<references/>'}
In addition to> 26 processing, remove MediaWiki's internal link markup from the template value and convert it to text.
code
def remove_link(x):
x = re.sub(r'\[\[[^\|\]]+\|[^{}\|\]]+\|([^\]]+)\]\]', r'\1', x)
x = re.sub(r'\[\[[^\|\]]+\|([^\]]+)\]\]', r'\1', x)
x = re.sub(r'\[\[([^\]]+)\]\]', r'\1', x)
return x
dct3 = {
key : remove_link(value)
for key, value in dct2.items()
}
dct3
output
{'Abbreviated name': 'England',
'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />',
'National flag image': 'Flag of the United Kingdom.svg',
'National emblem image': 'British coat of arms',
'National emblem link': '(National emblem)',
'Motto': '{{lang|fr|Dieu et mon droit}}<br />(French:God and my rights)',
'National anthem': '{{lang|en|God Save the Queen}}{{en icon}}<br />God save the queen<br />{{center|File:United States Navy Band - God Save the Queen.ogg}}',
'Map image': 'Europe-UK.svg',
'Position image': 'United Kingdom (+overseas territories) in the World (+Antarctica claims).svg',
'Official terminology': 'English',
'capital': 'London (virtually)',
'Largest city': 'London',
'Head of state title': 'Queen',
'Name of head of state': 'Elizabeth II',
'Prime Minister's title': 'Prime Minister',
'Prime Minister's name': 'Boris Johnson',
'Other heads of state title 1': 'Aristocratic House Chairman',
'Names of other heads of state 1': 'Norman Fowler',
'Other heads of state title 2': 'Chairman of the House of Commons',
'Other heads of state name 2': '{{Temporary link|Lindsay Foil|en|Lindsay Hoyle}}',
'Other heads of state title 3': 'Chief Justice of Japan',
'Other heads of state name 3': 'Brenda Hale',
'Area ranking': '76',
'Area size': '1 E11',
'Area value': '244,820',
'Water area ratio': '1.3%',
'Demographic year': '2018',
'Population ranking': '22',
'Population size': '1 E7',
'Population value': '66,435,600<ref>{{Cite web|url=https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates|title=Population estimates - Office for National Statistics|accessdate=2019-06-26|date=2019-06-26}}</ref>',
'Population density value': '271',
'GDP statistics year yuan': '2012',
'GDP value source': '1,547.8 billion<ref name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>',
'GDP Statistics Year MER': '2012',
'GDP ranking MER': '6',
'GDP value MER': '2,433.7 billion<ref name="imf-statistics-gdp" />',
'GDP statistical year': '2012',
'GDP ranking': '6',
'GDP value': '2,316.2 billion<ref name="imf-statistics-gdp" />',
'GDP/Man': '36,727<ref name="imf-statistics-gdp" />',
'Founding form': 'Founding of the country',
'Established form 1': 'Kingdom of England / Kingdom of Scotland<br />(Both countries until the 1707 Act)',
'Date of establishment 1': '927/843',
'Established form 2': 'Kingdom of Great Britain established<br />(1707 Act)',
'Date of establishment 2': '1707{{0}}May{{0}}1 day',
'Established form 3': 'United Kingdom of Great Britain and Ireland established<br />(1800 Joint Law)',
'Date of establishment 3': '1801{{0}}January{{0}}1 day',
'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
'Date of establishment 4': '1927{{0}}April 12',
'currency': 'UK pounds(£)',
'Currency code': 'GBP',
'Time zone': '±0',
'Daylight saving time': '+1',
'ISO 3166-1': 'GB / GBR',
'ccTLD': '.uk / .gb<ref>Use is.Overwhelmingly small number compared to uk.</ref>',
'International call number': '44',
'Note': '<references/>'}
In addition to the> 27 process, remove MediaWiki markup from the template values as much as possible and format the basic country information.
I also removed unnecessary parts other than the MediaWiki markup.
code
def remove_markups(x):
x = re.sub(r'{{.*\|.*\|([^}]*)}}', r'\1', x)
x = re.sub(r'<([^>]*)( .*|)>.*</\1>', '', x)
x = re.sub(r'<[^>]*?/>', '', x)
x = re.sub(r'\{\{0\}\}', '', x)
return x
dct4 = {
key : remove_markups(value)
for key, value in dct3.items()
}
dct4
output
{'Abbreviated name': 'England',
'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
'Official country name': 'United Kingdom of Great Britain and Northern Ireland<ref>Official country name other than English:',
'National flag image': 'Flag of the United Kingdom.svg',
'National emblem image': 'British coat of arms',
'National emblem link': '(National emblem)',
'Motto': 'Dieu et mon droit (French:God and my rights)',
'National anthem': 'File:United States Navy Band - God Save the Queen.ogg',
'Map image': 'Europe-UK.svg',
'Position image': 'United Kingdom (+overseas territories) in the World (+Antarctica claims).svg',
'Official terminology': 'English',
'capital': 'London (virtually)',
'Largest city': 'London',
'Head of state title': 'Queen',
'Name of head of state': 'Elizabeth II',
'Prime Minister's title': 'Prime Minister',
'Prime Minister's name': 'Boris Johnson',
'Other heads of state title 1': 'Aristocratic House Chairman',
'Names of other heads of state 1': 'Norman Fowler',
'Other heads of state title 2': 'Chairman of the House of Commons',
'Other heads of state name 2': 'Lindsay Hoyle',
'Other heads of state title 3': 'Chief Justice of Japan',
'Other heads of state name 3': 'Brenda Hale',
'Area ranking': '76',
'Area size': '1 E11',
'Area value': '244,820',
'Water area ratio': '1.3%',
'Demographic year': '2018',
'Population ranking': '22',
'Population size': '1 E7',
'Population value': '66,435,600',
'Population density value': '271',
'GDP statistics year yuan': '2012',
'GDP value source': '1,547.8 billion',
'GDP Statistics Year MER': '2012',
'GDP ranking MER': '6',
'GDP value MER': '2,433.7 billion',
'GDP statistical year': '2012',
'GDP ranking': '6',
'GDP value': '2,316.2 billion',
'GDP/Man': '36,727',
'Founding form': 'Founding of the country',
'Established form 1': 'Kingdom of England / Kingdom of Scotland (both countries until the 1707 Act)',
'Date of establishment 1': '927/843',
'Established form 2': 'Great Britain Kingdom established (1707 Acts of Union)',
'Date of establishment 2': 'May 1, 1707',
'Established form 3': 'United Kingdom of Great Britain and Ireland established (1800 Acts of Union 1800)',
'Date of establishment 3': 'January 1, 1801',
'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
'Date of establishment 4': 'April 12, 1927',
'currency': 'UK pounds(£)',
'Currency code': 'GBP',
'Time zone': '±0',
'Daylight saving time': '+1',
'ISO 3166-1': 'GB / GBR',
'ccTLD': '.uk / .gb',
'International call number': '44',
'Note': ''}
Use the contents of the template to get the URL of the national flag image.
code
import requests
Hit the API using requests. I referred to the code at the bottom of here.
code
filename = dct4['National flag image']
session = requests.Session()
url = 'https://en.wikipedia.org/w/api.php'
params = {
'action' : 'query',
'format' : 'json',
'prop' : 'imageinfo',
'titles' : 'File:' + filename,
'iiprop' : 'url',
}
r = session.get(url=url, params=params)
data = r.json()
pages = data['query']['pages']
flag_url = pages[list(pages)[0]]['imageinfo'][0]['url']
flag_url
output
'https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg'
The link is the image below. <img src="https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg", width="300">
Language processing 100 knocks 2020 Chapter 4: Morphological analysis
Recommended Posts