The other day, 100 Language Processing Knock 2020 was released. I myself have only been in natural language processing for a year, and I don't know the details, but I will solve all the problems and publish them in order to improve my technical skills.

All shall be executed on jupyter notebook, and the restrictions of the problem statement may be broken conveniently. The source code is also on github. Yes.

Chapter 2 is here.

The environment is Python 3.8.2 and Ubuntu 18.04.

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. ・ One article information is stored in JSON format per line -In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. ・ The entire file is compressed with gzip Create a program that performs the following processing.

Please download the required dataset from here.

The downloaded file shall be placed under data.

20. Read JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

Load the module for unzipping gzip and loading json.

`code`


import gzip
import json

Read the gzip file line by line and convert each line to dictionary type with json.loads ().

`code`


data = []
with gzip.open('data/jawiki-country.json.gz', 'rt') as f:
    for line in f:
        line = line.strip()
        data.append(json.loads(line))

Find the element whose title is "UK" in the data and store it in text.

`code`


for df in data:
    if df['title'] == 'England':
        text = df['text']
        break

The contents are like this

{{redirect|UK}}
{{redirect|United Kingdom|The princes of the Spring and Autumn period|English(Spring and autumn)}}
{{Otheruses|European country|Local cuisine of Nagasaki and Kumamoto prefectures|England}}
{{Basic information Country
|Abbreviated name=England
|Japanese country name=United Kingdom of Great Britain and Northern Ireland
|Official country name= {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />
*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}（[[Scottish Gaelic]]）
*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}（[[Welsh]]）
*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}（[[Irish]]）
*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}（[[Cornish]]）
*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}（[[Scots]]）
**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}(Ulster Scots)</ref>
|National flag image= Flag of the United Kingdom.svg
|National emblem image= [[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]
|National emblem link=（[[British coat of arms|National emblem]]）
|Motto= {{lang|fr|[[Dieu et mon droit]]}}<br />（[[French]]:[[Die

21. Extract rows containing category names

Extract the line that declares the category name in the article.

`code`


import re

`code`


lines = text.splitlines()
for line in lines:
    if re.search(r'\[\[Category:.*\]\]', line):
        print(line)

Extract the category name using a regular expression. Extract all lines that match patterns such as [[Category: Hogehoge]].

`output`


[[Category:England|*]]
[[Category:Commonwealth of Nations]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states|Former]]
[[Category:Maritime nation]]
[[Category:Existing sovereign country]]
[[Category:Island country]]
[[Category:A nation / territory established in 1801]]

22. Extraction of category name

Extract the article category names (by name, not line by line).

`code`


for line in lines:
    lst = re.findall(r'\[\[Category:(.*)\]\]', line)
    for category in lst:
        print(category)

I put the matching subsequences in lst and output all of them. You can extract the category name.

`output`


England|*
Commonwealth of Nations
Commonwealth Kingdom|*
G8 member countries
European Union member states|Former
Maritime nation
Existing sovereign country
Island country
A nation / territory established in 1801

23. Section structure

Display the section name and its level contained in the article (for example, 1 if "== section name ==").

`code`


for line in lines:
    if re.search(r'^==.*==$', line):
        level = len(re.match(r'^=*', line).group()) - 1
        title = re.sub(r'[=\s]', '', line)
        print(level, title)

Extract lines with patterns such as = section =, == section ==, === section ===, ...... The level of the section is determined by the number of =.

`output`


1 Country name
1 history
1 Geography
2 major cities
2 Climate
1 Politics
2 Head of state
2 law
2 Domestic affairs
2 Local administrative divisions
2 Diplomacy / Military
1 economy
2 Mining
2 Agriculture
2 trade
2 real estate
2 Energy policy
2 currencies
2 companies
3 Communication
1 Transportation
2 road
2 Railroad
2 Shipping
2 aviation
1 Science and technology
1 people
2 languages
2 religion
2 Marriage
2 emigration
2 Education
2 Medical
1 culture
2 Food culture
2 Literature
2 Philosophy
2 music
3 popular music
2 movies
2 comedy
2 National flower
2 World Heritage
2 public holidays
2 sports
3 soccer
3 cricket
3 Horse racing
3 motor sports
3 baseball
3 curling
3 Cycling
1 footnote
1 Related items
1 External link

You can also use the back reference as follows. I haven't measured it properly, but I think it's faster. It seems that which example is more readable depends on the person, and I think it doesn't matter which one is easier to understand.

`code`


for line in lines:
    if x := re.match(r'^(==+)(.*)\1$', line):
        print(len(x[1])-1, x[2].strip())

I tried using the walrus operator

It's just a section "structure", so implement something like the tree command.

`code`


def get_sections():
    return [
        (
            len(re.match(r'^=*', line).group()) - 1,
            re.sub(r'[=\s]', '', line)
        )
        for line in lines
        if re.search(r'^==.*==$', line)
    ]

First, take the section from lines and make it a list of levels and section names.

`code`


class Section(list):
    def __init__(self, title):
        self.title = title
        super().__init__()
        
    def last(self):
        return self[-1]
    
    def add(self, level, title):
        if level == 1:
            self.append(Section(title))
        else:
            self[-1].add(level-1, title)
            
    def tree_lines(self, head):
        lines = []
        last = len(self) - 1
        for i, x in enumerate(self):
            line = head
            line += '└' if i == last else '├'
            line += x.title
            lines.append(line)
            lines += (x.tree_lines(head + ('　' if i == last else '│')))
        return lines
    
    def __repr__(self):
        return '\n'.join(self.tree_lines(''))

Create a class of objects that recursively holds sections. The objects in the level 1 section inherit from the list type, and you can keep the level 2 section inside yourself.

`code`


root = Section('root')
for level, title in get_sections():
    root.add(level, title)
root

The list of sections obtained by get_sections () is recursively inserted from the root section using the ʻaddmethod. I am trying to recursively convert from therepr` method to a character string.

`output`


├ Country name
├ History
├ Geography
│ ├ Major cities
│ └ Climate
├ Politics
│ ├ Head of state
│ ├ Law
│ ├ Internal affairs
│ ├ Local administrative division
│ └ Diplomacy / Military
├ Economy
│ ├ Mining
│ ├ Agriculture
│ ├ Trade
│ ├ Real estate
│ ├ Energy policy
│ ├ Currency
│ └ Company
│ └ Communication
├ Transportation
│ ├ Road
│ ├ Railway
│ ├ Shipping
│ └ Aviation
├ Science and technology
├ People
│ ├ Language
│ ├ Religion
│ ├ Marriage
│ ├ Emigration
│ ├ Education
│ └ Medical
├ Culture
│ ├ Food culture
│ ├ Literature
│ ├ Philosophy
│ ├ Music
│ │ └ Popular music
│ ├ Movie
│ ├ Comedy
│ ├ National flower
│ ├ World Heritage Site
│ ├ Holidays
│ └ Sports
│ ├ Soccer
│ ├ Cricket
│ ├ Horse racing
│ ├ Motor sports
│ ├ Baseball
│ ├ Curling
│ └ Bicycle competition
├ Footnote
├ Related items
└ External link

Below Sports, no borders are displayed because there are no Level 2 sections with the same parent section. Only the last section of the child sections implements this by changing the ruled line that is displayed before the title.

24. Extracting file references

Extract all the media files referenced from the article.

`code`


for line in lines:
    lst = re.findall(r'\[\[File:([^|\]]*)', line)
    for x in lst:
        print(x)

The part of "somehow" that matches [[file: somehow]] is extracted.

`output`


Royal Coat of Arms of the United Kingdom.svg
United States Navy Band - God Save the Queen.ogg
Descriptio Prime Tabulae Europae.jpg
Lenepveu, Jeanne d'Arc au siège d'Orléans.jpg
London.bankofengland.arp.jpg
Battle of Waterloo 1815.PNG
Uk topo en.jpg
BenNevis2005.jpg
Population density UK 2011 census.png
2019 Greenwich Peninsula & Canary Wharf.jpg
Birmingham Skyline from Edgbaston Cricket Ground crop.jpg
Leeds CBD at night.jpg
Glasgow and the Clyde from the air (geograph 4665720).jpg
Palace of Westminster, London - Feb 2007.jpg
Scotland Parliament Holyrood.jpg
Donald Trump and Theresa May (33998675310) (cropped).jpg
Soldiers Trooping the Colour, 16th June 2007.jpg
City of London skyline from London City Hall - Oct 2008.jpg
Oil platform in the North SeaPros.jpg
Eurostar at St Pancras Jan 2008.jpg
Heathrow Terminal 5C Iwelumo-1.jpg
Airbus A380-841 G-XLEB British Airways (10424102995).jpg
UKpop.svg
Anglospeak.svg
Royal Aberdeen Children's Hospital.jpg
CHANDOS3.jpg
The Fabs.JPG
Wembley Stadium, illuminated.jpg

25. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

`code`


for i, line in enumerate(lines):
    if line.startswith('{{Basic information'):
        start = i
    elif line.startswith('}}'):
        end = i
        break

It's actually beyond the framework of regular languages. I think it is desirable to use a markdown parser or something, but specify the range of the basic information line.

`code`


templete = [
    re.findall(r'\|([^=]*)=(.*)', line)
    for line in lines[start+1 : end]
]
templete = [x[0] for x in templete if x]
dct = {
    key.strip() : value.strip()
    for key, value in templete
}
dct

Store the contents in the dictionary.

`output`


{'Abbreviated name': 'England',
 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
 'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />',
 'National flag image': 'Flag of the United Kingdom.svg',
 'National emblem image': '[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]',
 'National emblem link': '（[[British coat of arms|National emblem]]）',
 'Motto': '{{lang|fr|[[Dieu et mon droit]]}}<br />（[[French]]:[[Dieu et mon droit|God and my rights]]）',
 'National anthem': "[[Her Majesty the Queen|{{lang|en|God Save the Queen}}]]{{en icon}}<br />''God save the queen''<br />{{center|[[File:United States Navy Band - God Save the Queen.ogg]]}}",
 'Map image': 'Europe-UK.svg',
 'Position image': 'United Kingdom (+overseas territories) in the World (+Antarctica claims).svg',
 'Official terminology': '[[English]]',
 'capital': '[[London]](infact)',
 'Largest city': 'London',
 'Head of state title': '[[British monarch|Queen]]',
 'Name of head of state': '[[Elizabeth II]]',
 'Prime Minister's title': '[[British Prime Minister|Prime Minister]]',
 'Prime Minister's name': '[[Boris Johnson]]',
 'Other heads of state title 1': '[[House of Peers(England)|Aristocratic House Chairman]]',
 'Names of other heads of state 1': '[[:en:Norman Fowler, Baron Fowler|Norman Fowler]]',
 'Other heads of state title 2': '[[House of Commons(England)|Chairman of the House of Commons]]',
 'Other heads of state name 2': '{{Temporary link|Lindsay Foil|en|Lindsay Hoyle}}',
 'Other heads of state title 3': '[[United Kingdom Supreme Court|Chief Justice of Japan]]',
 'Other heads of state name 3': '[[:en:Brenda Hale, Baroness Hale of Richmond|Brenda Hale]]',
 'Area ranking': '76',
 'Area size': '1 E11',
 'Area value': '244,820',
 'Water area ratio': '1.3%',
 'Demographic year': '2018',
 'Population ranking': '22',
 'Population size': '1 E7',
 'Population value': '66,435,600<ref>{{Cite web|url=https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates|title=Population estimates - Office for National Statistics|accessdate=2019-06-26|date=2019-06-26}}</ref>',
 'Population density value': '271',
 'GDP statistics year yuan': '2012',
 'GDP value source': '1,547.8 billion<ref name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>',
 'GDP Statistics Year MER': '2012',
 'GDP ranking MER': '6',
 'GDP value MER': '2,433.7 billion<ref name="imf-statistics-gdp" />',
 'GDP statistical year': '2012',
 'GDP ranking': '6',
 'GDP value': '2,316.2 billion<ref name="imf-statistics-gdp" />',
 'GDP/Man': '36,727<ref name="imf-statistics-gdp" />',
 'Founding form': 'Founding of the country',
 'Established form 1': '[[Kingdom of England]]／[[Kingdom of scotland]]<br />(Both countries[[Joint law(1707)|1707合同法]]Until)',
 'Date of establishment 1': '927/843',
 'Established form 2': '[[Kingdom of Great Britain]]Established<br />(1707 Act)',
 'Date of establishment 2': '1707{{0}}May{{0}}1 day',
 'Established form 3': '[[United Kingdom of Great Britain and Ireland]]Established<br />（[[Joint law(1800)|1800合同法]]）',
 'Date of establishment 3': '1801{{0}}January{{0}}1 day',
 'Established form 4': "Current country name "'''United Kingdom of Great Britain and Northern Ireland'''"change to",
 'Date of establishment 4': '1927{{0}}April 12',
 'currency': '[[Sterling pound|UK pounds]](£)',
 'Currency code': 'GBP',
 'Time zone': '±0',
 'Daylight saving time': '+1',
 'ISO 3166-1': 'GB / GBR',
 'ccTLD': '[[.uk]] / [[.gb]]<ref>Use is.Overwhelmingly small number compared to uk.</ref>',
 'International call number': '44',
 'Note': '<references/>'}

26. Removal of highlighted markup

At processing> 25, remove MediaWiki's highlight markup (weak, highlight, strong) from the template value and convert it to text.

`code`


dct2 = {
    key : re.sub(r"''+", '', value)
    for key, value in dct.items()
}
dct2

`result`


{'Abbreviated name': 'England',
 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
 'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />',
 'National flag image': 'Flag of the United Kingdom.svg',
 'National emblem image': '[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]',
 'National emblem link': '（[[British coat of arms|National emblem]]）',
 'Motto': '{{lang|fr|[[Dieu et mon droit]]}}<br />（[[French]]:[[Dieu et mon droit|God and my rights]]）',
 'National anthem': '[[Her Majesty the Queen|{{lang|en|God Save the Queen}}]]{{en icon}}<br />God save the queen<br />{{center|[[File:United States Navy Band - God Save the Queen.ogg]]}}',
 'Map image': 'Europe-UK.svg',
 'Position image': 'United Kingdom (+overseas territories) in the World (+Antarctica claims).svg',
 'Official terminology': '[[English]]',
 'capital': '[[London]](infact)',
 'Largest city': 'London',
 'Head of state title': '[[British monarch|Queen]]',
 'Name of head of state': '[[Elizabeth II]]',
 'Prime Minister's title': '[[British Prime Minister|Prime Minister]]',
 'Prime Minister's name': '[[Boris Johnson]]',
 'Other heads of state title 1': '[[House of Peers(England)|Aristocratic House Chairman]]',
 'Names of other heads of state 1': '[[:en:Norman Fowler, Baron Fowler|Norman Fowler]]',
 'Other heads of state title 2': '[[House of Commons(England)|Chairman of the House of Commons]]',
 'Other heads of state name 2': '{{Temporary link|Lindsay Foil|en|Lindsay Hoyle}}',
 'Other heads of state title 3': '[[United Kingdom Supreme Court|Chief Justice of Japan]]',
 'Other heads of state name 3': '[[:en:Brenda Hale, Baroness Hale of Richmond|Brenda Hale]]',
 'Area ranking': '76',
 'Area size': '1 E11',
 'Area value': '244,820',
 'Water area ratio': '1.3%',
 'Demographic year': '2018',
 'Population ranking': '22',
 'Population size': '1 E7',
 'Population value': '66,435,600<ref>{{Cite web|url=https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates|title=Population estimates - Office for National Statistics|accessdate=2019-06-26|date=2019-06-26}}</ref>',
 'Population density value': '271',
 'GDP statistics year yuan': '2012',
 'GDP value source': '1,547.8 billion<ref name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>',
 'GDP Statistics Year MER': '2012',
 'GDP ranking MER': '6',
 'GDP value MER': '2,433.7 billion<ref name="imf-statistics-gdp" />',
 'GDP statistical year': '2012',
 'GDP ranking': '6',
 'GDP value': '2,316.2 billion<ref name="imf-statistics-gdp" />',
 'GDP/Man': '36,727<ref name="imf-statistics-gdp" />',
 'Founding form': 'Founding of the country',
 'Established form 1': '[[Kingdom of England]]／[[Kingdom of scotland]]<br />(Both countries[[Joint law(1707)|1707合同法]]Until)',
 'Date of establishment 1': '927/843',
 'Established form 2': '[[Kingdom of Great Britain]]Established<br />(1707 Act)',
 'Date of establishment 2': '1707{{0}}May{{0}}1 day',
 'Established form 3': '[[United Kingdom of Great Britain and Ireland]]Established<br />（[[Joint law(1800)|1800合同法]]）',
 'Date of establishment 3': '1801{{0}}January{{0}}1 day',
 'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
 'Date of establishment 4': '1927{{0}}April 12',
 'currency': '[[Sterling pound|UK pounds]](£)',
 'Currency code': 'GBP',
 'Time zone': '±0',
 'Daylight saving time': '+1',
 'ISO 3166-1': 'GB / GBR',
 'ccTLD': '[[.uk]] / [[.gb]]<ref>Use is.Overwhelmingly small number compared to uk.</ref>',
 'International call number': '44',
 'Note': '<references/>'}

27. Removal of internal links

In addition to> 26 processing, remove MediaWiki's internal link markup from the template value and convert it to text.

`code`


def remove_link(x):
    x = re.sub(r'\[\[[^\|\]]+\|[^{}\|\]]+\|([^\]]+)\]\]', r'\1', x)
    x = re.sub(r'\[\[[^\|\]]+\|([^\]]+)\]\]', r'\1', x)
    x = re.sub(r'\[\[([^\]]+)\]\]', r'\1', x)
    return x

dct3 = {
    key : remove_link(value)
    for key, value in dct2.items()
}
dct3

`output`


{'Abbreviated name': 'England',
 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
 'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />',
 'National flag image': 'Flag of the United Kingdom.svg',
 'National emblem image': 'British coat of arms',
 'National emblem link': '(National emblem)',
 'Motto': '{{lang|fr|Dieu et mon droit}}<br />(French:God and my rights)',
 'National anthem': '{{lang|en|God Save the Queen}}{{en icon}}<br />God save the queen<br />{{center|File:United States Navy Band - God Save the Queen.ogg}}',
 'Map image': 'Europe-UK.svg',
 'Position image': 'United Kingdom (+overseas territories) in the World (+Antarctica claims).svg',
 'Official terminology': 'English',
 'capital': 'London (virtually)',
 'Largest city': 'London',
 'Head of state title': 'Queen',
 'Name of head of state': 'Elizabeth II',
 'Prime Minister's title': 'Prime Minister',
 'Prime Minister's name': 'Boris Johnson',
 'Other heads of state title 1': 'Aristocratic House Chairman',
 'Names of other heads of state 1': 'Norman Fowler',
 'Other heads of state title 2': 'Chairman of the House of Commons',
 'Other heads of state name 2': '{{Temporary link|Lindsay Foil|en|Lindsay Hoyle}}',
 'Other heads of state title 3': 'Chief Justice of Japan',
 'Other heads of state name 3': 'Brenda Hale',
 'Area ranking': '76',
 'Area size': '1 E11',
 'Area value': '244,820',
 'Water area ratio': '1.3%',
 'Demographic year': '2018',
 'Population ranking': '22',
 'Population size': '1 E7',
 'Population value': '66,435,600<ref>{{Cite web|url=https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates|title=Population estimates - Office for National Statistics|accessdate=2019-06-26|date=2019-06-26}}</ref>',
 'Population density value': '271',
 'GDP statistics year yuan': '2012',
 'GDP value source': '1,547.8 billion<ref name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>',
 'GDP Statistics Year MER': '2012',
 'GDP ranking MER': '6',
 'GDP value MER': '2,433.7 billion<ref name="imf-statistics-gdp" />',
 'GDP statistical year': '2012',
 'GDP ranking': '6',
 'GDP value': '2,316.2 billion<ref name="imf-statistics-gdp" />',
 'GDP/Man': '36,727<ref name="imf-statistics-gdp" />',
 'Founding form': 'Founding of the country',
 'Established form 1': 'Kingdom of England / Kingdom of Scotland<br />(Both countries until the 1707 Act)',
 'Date of establishment 1': '927/843',
 'Established form 2': 'Kingdom of Great Britain established<br />(1707 Act)',
 'Date of establishment 2': '1707{{0}}May{{0}}1 day',
 'Established form 3': 'United Kingdom of Great Britain and Ireland established<br />(1800 Joint Law)',
 'Date of establishment 3': '1801{{0}}January{{0}}1 day',
 'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
 'Date of establishment 4': '1927{{0}}April 12',
 'currency': 'UK pounds(£)',
 'Currency code': 'GBP',
 'Time zone': '±0',
 'Daylight saving time': '+1',
 'ISO 3166-1': 'GB / GBR',
 'ccTLD': '.uk / .gb<ref>Use is.Overwhelmingly small number compared to uk.</ref>',
 'International call number': '44',
 'Note': '<references/>'}

28. Removal of MediaWiki markup

In addition to the> 27 process, remove MediaWiki markup from the template values as much as possible and format the basic country information.

I also removed unnecessary parts other than the MediaWiki markup.

`code`


def remove_markups(x):
    x = re.sub(r'{{.*\|.*\|([^}]*)}}', r'\1', x)
    x = re.sub(r'<([^>]*)( .*|)>.*</\1>', '', x)
    x = re.sub(r'<[^>]*?/>', '', x)
    x = re.sub(r'\{\{0\}\}', '', x)
    return x

dct4 = {
    key : remove_markups(value)
    for key, value in dct3.items()
}
dct4

`output`


{'Abbreviated name': 'England',
 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
 'Official country name': 'United Kingdom of Great Britain and Northern Ireland<ref>Official country name other than English:',
 'National flag image': 'Flag of the United Kingdom.svg',
 'National emblem image': 'British coat of arms',
 'National emblem link': '(National emblem)',
 'Motto': 'Dieu et mon droit (French:God and my rights)',
 'National anthem': 'File:United States Navy Band - God Save the Queen.ogg',
 'Map image': 'Europe-UK.svg',
 'Position image': 'United Kingdom (+overseas territories) in the World (+Antarctica claims).svg',
 'Official terminology': 'English',
 'capital': 'London (virtually)',
 'Largest city': 'London',
 'Head of state title': 'Queen',
 'Name of head of state': 'Elizabeth II',
 'Prime Minister's title': 'Prime Minister',
 'Prime Minister's name': 'Boris Johnson',
 'Other heads of state title 1': 'Aristocratic House Chairman',
 'Names of other heads of state 1': 'Norman Fowler',
 'Other heads of state title 2': 'Chairman of the House of Commons',
 'Other heads of state name 2': 'Lindsay Hoyle',
 'Other heads of state title 3': 'Chief Justice of Japan',
 'Other heads of state name 3': 'Brenda Hale',
 'Area ranking': '76',
 'Area size': '1 E11',
 'Area value': '244,820',
 'Water area ratio': '1.3%',
 'Demographic year': '2018',
 'Population ranking': '22',
 'Population size': '1 E7',
 'Population value': '66,435,600',
 'Population density value': '271',
 'GDP statistics year yuan': '2012',
 'GDP value source': '1,547.8 billion',
 'GDP Statistics Year MER': '2012',
 'GDP ranking MER': '6',
 'GDP value MER': '2,433.7 billion',
 'GDP statistical year': '2012',
 'GDP ranking': '6',
 'GDP value': '2,316.2 billion',
 'GDP/Man': '36,727',
 'Founding form': 'Founding of the country',
 'Established form 1': 'Kingdom of England / Kingdom of Scotland (both countries until the 1707 Act)',
 'Date of establishment 1': '927/843',
 'Established form 2': 'Great Britain Kingdom established (1707 Acts of Union)',
 'Date of establishment 2': 'May 1, 1707',
 'Established form 3': 'United Kingdom of Great Britain and Ireland established (1800 Acts of Union 1800)',
 'Date of establishment 3': 'January 1, 1801',
 'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
 'Date of establishment 4': 'April 12, 1927',
 'currency': 'UK pounds(£)',
 'Currency code': 'GBP',
 'Time zone': '±0',
 'Daylight saving time': '+1',
 'ISO 3166-1': 'GB / GBR',
 'ccTLD': '.uk / .gb',
 'International call number': '44',
 'Note': ''}

29. Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image.

`code`


import requests

Hit the API using requests. I referred to the code at the bottom of here.

`code`


filename = dct4['National flag image']

session = requests.Session()
url = 'https://en.wikipedia.org/w/api.php'
params = {
    'action' : 'query',
    'format' : 'json',
    'prop' : 'imageinfo',
    'titles' : 'File:' + filename,
    'iiprop' : 'url',
}
r = session.get(url=url, params=params)
data = r.json()
pages = data['query']['pages']
flag_url = pages[list(pages)[0]]['imageinfo'][0]['url']
flag_url

`output`


'https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg'

The link is the image below. <img src="https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg", width="300">

Next is Chapter 4

Language processing 100 knocks 2020 Chapter 4: Morphological analysis

[PYTHON] 100 Language Processing Knock 2020 Chapter 3: Regular Expressions

Chapter 3: Regular Expressions

20. Read JSON data

code

code

code

21. Extract rows containing category names

code

code

output

22. Extraction of category name

code

output

23. Section structure

code

output

code

code

code

code

output

24. Extracting file references

code

output

25. Template extraction

code

code

output

26. Removal of highlighted markup

code

result

27. Removal of internal links

code

output

28. Removal of MediaWiki markup

code

output

29. Get the URL of the national flag image

code

code

output

Next is Chapter 4

`code`

`code`

`code`

`code`

`code`

`output`

`code`

`output`

`code`

`output`

`code`

`code`

`code`

`code`

`output`

`code`

`output`

`code`

`code`

`output`

`code`

`result`

`code`

`output`

`code`

`output`

`code`

`code`

`output`