[PYTHON] 100 Language Processing Knock 2020 Chapter 3: Regular Expressions

The other day, 100 Language Processing Knock 2020 was released. I myself have only been in natural language processing for a year, and I don't know the details, but I will solve all the problems and publish them in order to improve my technical skills.

All shall be executed on jupyter notebook, and the restrictions of the problem statement may be broken conveniently. The source code is also on github. Yes.

Chapter 2 is here.

The environment is Python 3.8.2 and Ubuntu 18.04.

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. ・ One article information is stored in JSON format per line -In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. ・ The entire file is compressed with gzip Create a program that performs the following processing.

Please download the required dataset from here.

The downloaded file shall be placed under data.

20. Read JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

Load the module for unzipping gzip and loading json.

code


import gzip
import json

Read the gzip file line by line and convert each line to dictionary type with json.loads ().

code


data = []
with gzip.open('data/jawiki-country.json.gz', 'rt') as f:
    for line in f:
        line = line.strip()
        data.append(json.loads(line))

Find the element whose title is "UK" in the data and store it in text.

code


for df in data:
    if df['title'] == 'England':
        text = df['text']
        break

The contents are like this

{{redirect|UK}}
{{redirect|United Kingdom|The princes of the Spring and Autumn period|English(Spring and autumn)}}
{{Otheruses|European country|Local cuisine of Nagasaki and Kumamoto prefectures|England}}
{{Basic information Country
|Abbreviated name=England
|Japanese country name=United Kingdom of Great Britain and Northern Ireland
|Official country name= {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />
*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}([[Scottish Gaelic]])
*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}([[Welsh]])
*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}([[Irish]])
*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}([[Cornish]])
*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}([[Scots]])
**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}(Ulster Scots)</ref>
|National flag image= Flag of the United Kingdom.svg
|National emblem image= [[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]
|National emblem link=([[British coat of arms|National emblem]])
|Motto= {{lang|fr|[[Dieu et mon droit]]}}<br />([[French]]:[[Die

21. Extract rows containing category names

Extract the line that declares the category name in the article.

code


import re

code


lines = text.splitlines()
for line in lines:
    if re.search(r'\[\[Category:.*\]\]', line):
        print(line)

Extract the category name using a regular expression. Extract all lines that match patterns such as [[Category: Hogehoge]].

output


[[Category:England|*]]
[[Category:Commonwealth of Nations]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states|Former]]
[[Category:Maritime nation]]
[[Category:Existing sovereign country]]
[[Category:Island country]]
[[Category:A nation / territory established in 1801]]

22. Extraction of category name

Extract the article category names (by name, not line by line).

code


for line in lines:
    lst = re.findall(r'\[\[Category:(.*)\]\]', line)
    for category in lst:
        print(category)

I put the matching subsequences in lst and output all of them. You can extract the category name.

output


England|*
Commonwealth of Nations
Commonwealth Kingdom|*
G8 member countries
European Union member states|Former
Maritime nation
Existing sovereign country
Island country
A nation / territory established in 1801

23. Section structure

Display the section name and its level contained in the article (for example, 1 if "== section name ==").

code


for line in lines:
    if re.search(r'^==.*==$', line):
        level = len(re.match(r'^=*', line).group()) - 1
        title = re.sub(r'[=\s]', '', line)
        print(level, title)

Extract lines with patterns such as = section =, == section ==, === section ===, ...... The level of the section is determined by the number of =.

output


1 Country name
1 history
1 Geography
2 major cities
2 Climate
1 Politics
2 Head of state
2 law
2 Domestic affairs
2 Local administrative divisions
2 Diplomacy / Military
1 economy
2 Mining
2 Agriculture
2 trade
2 real estate
2 Energy policy
2 currencies
2 companies
3 Communication
1 Transportation
2 road
2 Railroad
2 Shipping
2 aviation
1 Science and technology
1 people
2 languages
2 religion
2 Marriage
2 emigration
2 Education
2 Medical
1 culture
2 Food culture
2 Literature
2 Philosophy
2 music
3 popular music
2 movies
2 comedy
2 National flower
2 World Heritage
2 public holidays
2 sports
3 soccer
3 cricket
3 Horse racing
3 motor sports
3 baseball
3 curling
3 Cycling
1 footnote
1 Related items
1 External link

You can also use the back reference as follows. I haven't measured it properly, but I think it's faster. It seems that which example is more readable depends on the person, and I think it doesn't matter which one is easier to understand.

code


for line in lines:
    if x := re.match(r'^(==+)(.*)\1$', line):
        print(len(x[1])-1, x[2].strip())

I tried using the walrus operator

It's just a section "structure", so implement something like the tree command.

code


def get_sections():
    return [
        (
            len(re.match(r'^=*', line).group()) - 1,
            re.sub(r'[=\s]', '', line)
        )
        for line in lines
        if re.search(r'^==.*==$', line)
    ]

First, take the section from lines and make it a list of levels and section names.

code


class Section(list):
    def __init__(self, title):
        self.title = title
        super().__init__()
        
    def last(self):
        return self[-1]
    
    def add(self, level, title):
        if level == 1:
            self.append(Section(title))
        else:
            self[-1].add(level-1, title)
            
    def tree_lines(self, head):
        lines = []
        last = len(self) - 1
        for i, x in enumerate(self):
            line = head
            line += '└' if i == last else '├'
            line += x.title
            lines.append(line)
            lines += (x.tree_lines(head + (' ' if i == last else '│')))
        return lines
    
    def __repr__(self):
        return '\n'.join(self.tree_lines(''))

Create a class of objects that recursively holds sections. The objects in the level 1 section inherit from the list type, and you can keep the level 2 section inside yourself.

code


root = Section('root')
for level, title in get_sections():
    root.add(level, title)
root

The list of sections obtained by get_sections () is recursively inserted from the root section using the ʻaddmethod. I am trying to recursively convert from therepr` method to a character string.

output


├ Country name
├ History
├ Geography
│ ├ Major cities
│ └ Climate
├ Politics
│ ├ Head of state
│ ├ Law
│ ├ Internal affairs
│ ├ Local administrative division
│ └ Diplomacy / Military
├ Economy
│ ├ Mining
│ ├ Agriculture
│ ├ Trade
│ ├ Real estate
│ ├ Energy policy
│ ├ Currency
│ └ Company
│ └ Communication
├ Transportation
│ ├ Road
│ ├ Railway
│ ├ Shipping
│ └ Aviation
├ Science and technology
├ People
│ ├ Language
│ ├ Religion
│ ├ Marriage
│ ├ Emigration
│ ├ Education
│ └ Medical
├ Culture
│ ├ Food culture
│ ├ Literature
│ ├ Philosophy
│ ├ Music
│ │ └ Popular music
│ ├ Movie
│ ├ Comedy
│ ├ National flower
│ ├ World Heritage Site
│ ├ Holidays
│ └ Sports
│ ├ Soccer
│ ├ Cricket
│ ├ Horse racing
│ ├ Motor sports
│ ├ Baseball
│ ├ Curling
│ └ Bicycle competition
├ Footnote
├ Related items
└ External link

Below Sports, no borders are displayed because there are no Level 2 sections with the same parent section. Only the last section of the child sections implements this by changing the ruled line that is displayed before the title.

24. Extracting file references

Extract all the media files referenced from the article.

code


for line in lines:
    lst = re.findall(r'\[\[File:([^|\]]*)', line)
    for x in lst:
        print(x)

The part of "somehow" that matches [[file: somehow]] is extracted.

output


Royal Coat of Arms of the United Kingdom.svg
United States Navy Band - God Save the Queen.ogg
Descriptio Prime Tabulae Europae.jpg
Lenepveu, Jeanne d'Arc au siège d'Orléans.jpg
London.bankofengland.arp.jpg
Battle of Waterloo 1815.PNG
Uk topo en.jpg
BenNevis2005.jpg
Population density UK 2011 census.png
2019 Greenwich Peninsula & Canary Wharf.jpg
Birmingham Skyline from Edgbaston Cricket Ground crop.jpg
Leeds CBD at night.jpg
Glasgow and the Clyde from the air (geograph 4665720).jpg
Palace of Westminster, London - Feb 2007.jpg
Scotland Parliament Holyrood.jpg
Donald Trump and Theresa May (33998675310) (cropped).jpg
Soldiers Trooping the Colour, 16th June 2007.jpg
City of London skyline from London City Hall - Oct 2008.jpg
Oil platform in the North SeaPros.jpg
Eurostar at St Pancras Jan 2008.jpg
Heathrow Terminal 5C Iwelumo-1.jpg
Airbus A380-841 G-XLEB British Airways (10424102995).jpg
UKpop.svg
Anglospeak.svg
Royal Aberdeen Children's Hospital.jpg
CHANDOS3.jpg
The Fabs.JPG
Wembley Stadium, illuminated.jpg

25. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

code


for i, line in enumerate(lines):
    if line.startswith('{{Basic information'):
        start = i
    elif line.startswith('}}'):
        end = i
        break

It's actually beyond the framework of regular languages. I think it is desirable to use a markdown parser or something, but specify the range of the basic information line.

code


templete = [
    re.findall(r'\|([^=]*)=(.*)', line)
    for line in lines[start+1 : end]
]
templete = [x[0] for x in templete if x]
dct = {
    key.strip() : value.strip()
    for key, value in templete
}
dct

Store the contents in the dictionary.

output


{'Abbreviated name': 'England',
 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
 'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />',
 'National flag image': 'Flag of the United Kingdom.svg',
 'National emblem image': '[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]',
 'National emblem link': '([[British coat of arms|National emblem]])',
 'Motto': '{{lang|fr|[[Dieu et mon droit]]}}<br />([[French]]:[[Dieu et mon droit|God and my rights]])',
 'National anthem': "[[Her Majesty the Queen|{{lang|en|God Save the Queen}}]]{{en icon}}<br />''God save the queen''<br />{{center|[[File:United States Navy Band - God Save the Queen.ogg]]}}",
 'Map image': 'Europe-UK.svg',
 'Position image': 'United Kingdom (+overseas territories) in the World (+Antarctica claims).svg',
 'Official terminology': '[[English]]',
 'capital': '[[London]](infact)',
 'Largest city': 'London',
 'Head of state title': '[[British monarch|Queen]]',
 'Name of head of state': '[[Elizabeth II]]',
 'Prime Minister's title': '[[British Prime Minister|Prime Minister]]',
 'Prime Minister's name': '[[Boris Johnson]]',
 'Other heads of state title 1': '[[House of Peers(England)|Aristocratic House Chairman]]',
 'Names of other heads of state 1': '[[:en:Norman Fowler, Baron Fowler|Norman Fowler]]',
 'Other heads of state title 2': '[[House of Commons(England)|Chairman of the House of Commons]]',
 'Other heads of state name 2': '{{Temporary link|Lindsay Foil|en|Lindsay Hoyle}}',
 'Other heads of state title 3': '[[United Kingdom Supreme Court|Chief Justice of Japan]]',
 'Other heads of state name 3': '[[:en:Brenda Hale, Baroness Hale of Richmond|Brenda Hale]]',
 'Area ranking': '76',
 'Area size': '1 E11',
 'Area value': '244,820',
 'Water area ratio': '1.3%',
 'Demographic year': '2018',
 'Population ranking': '22',
 'Population size': '1 E7',
 'Population value': '66,435,600<ref>{{Cite web|url=https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates|title=Population estimates - Office for National Statistics|accessdate=2019-06-26|date=2019-06-26}}</ref>',
 'Population density value': '271',
 'GDP statistics year yuan': '2012',
 'GDP value source': '1,547.8 billion<ref name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>',
 'GDP Statistics Year MER': '2012',
 'GDP ranking MER': '6',
 'GDP value MER': '2,433.7 billion<ref name="imf-statistics-gdp" />',
 'GDP statistical year': '2012',
 'GDP ranking': '6',
 'GDP value': '2,316.2 billion<ref name="imf-statistics-gdp" />',
 'GDP/Man': '36,727<ref name="imf-statistics-gdp" />',
 'Founding form': 'Founding of the country',
 'Established form 1': '[[Kingdom of England]]/[[Kingdom of scotland]]<br />(Both countries[[Joint law(1707)|1707合同法]]Until)',
 'Date of establishment 1': '927/843',
 'Established form 2': '[[Kingdom of Great Britain]]Established<br />(1707 Act)',
 'Date of establishment 2': '1707{{0}}May{{0}}1 day',
 'Established form 3': '[[United Kingdom of Great Britain and Ireland]]Established<br />([[Joint law(1800)|1800合同法]])',
 'Date of establishment 3': '1801{{0}}January{{0}}1 day',
 'Established form 4': "Current country name "'''United Kingdom of Great Britain and Northern Ireland'''"change to",
 'Date of establishment 4': '1927{{0}}April 12',
 'currency': '[[Sterling pound|UK pounds]](£)',
 'Currency code': 'GBP',
 'Time zone': '±0',
 'Daylight saving time': '+1',
 'ISO 3166-1': 'GB / GBR',
 'ccTLD': '[[.uk]] / [[.gb]]<ref>Use is.Overwhelmingly small number compared to uk.</ref>',
 'International call number': '44',
 'Note': '<references/>'}

26. Removal of highlighted markup

At processing> 25, remove MediaWiki's highlight markup (weak, highlight, strong) from the template value and convert it to text.

code


dct2 = {
    key : re.sub(r"''+", '', value)
    for key, value in dct.items()
}
dct2

result


{'Abbreviated name': 'England',
 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
 'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />',
 'National flag image': 'Flag of the United Kingdom.svg',
 'National emblem image': '[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]',
 'National emblem link': '([[British coat of arms|National emblem]])',
 'Motto': '{{lang|fr|[[Dieu et mon droit]]}}<br />([[French]]:[[Dieu et mon droit|God and my rights]])',
 'National anthem': '[[Her Majesty the Queen|{{lang|en|God Save the Queen}}]]{{en icon}}<br />God save the queen<br />{{center|[[File:United States Navy Band - God Save the Queen.ogg]]}}',
 'Map image': 'Europe-UK.svg',
 'Position image': 'United Kingdom (+overseas territories) in the World (+Antarctica claims).svg',
 'Official terminology': '[[English]]',
 'capital': '[[London]](infact)',
 'Largest city': 'London',
 'Head of state title': '[[British monarch|Queen]]',
 'Name of head of state': '[[Elizabeth II]]',
 'Prime Minister's title': '[[British Prime Minister|Prime Minister]]',
 'Prime Minister's name': '[[Boris Johnson]]',
 'Other heads of state title 1': '[[House of Peers(England)|Aristocratic House Chairman]]',
 'Names of other heads of state 1': '[[:en:Norman Fowler, Baron Fowler|Norman Fowler]]',
 'Other heads of state title 2': '[[House of Commons(England)|Chairman of the House of Commons]]',
 'Other heads of state name 2': '{{Temporary link|Lindsay Foil|en|Lindsay Hoyle}}',
 'Other heads of state title 3': '[[United Kingdom Supreme Court|Chief Justice of Japan]]',
 'Other heads of state name 3': '[[:en:Brenda Hale, Baroness Hale of Richmond|Brenda Hale]]',
 'Area ranking': '76',
 'Area size': '1 E11',
 'Area value': '244,820',
 'Water area ratio': '1.3%',
 'Demographic year': '2018',
 'Population ranking': '22',
 'Population size': '1 E7',
 'Population value': '66,435,600<ref>{{Cite web|url=https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates|title=Population estimates - Office for National Statistics|accessdate=2019-06-26|date=2019-06-26}}</ref>',
 'Population density value': '271',
 'GDP statistics year yuan': '2012',
 'GDP value source': '1,547.8 billion<ref name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>',
 'GDP Statistics Year MER': '2012',
 'GDP ranking MER': '6',
 'GDP value MER': '2,433.7 billion<ref name="imf-statistics-gdp" />',
 'GDP statistical year': '2012',
 'GDP ranking': '6',
 'GDP value': '2,316.2 billion<ref name="imf-statistics-gdp" />',
 'GDP/Man': '36,727<ref name="imf-statistics-gdp" />',
 'Founding form': 'Founding of the country',
 'Established form 1': '[[Kingdom of England]]/[[Kingdom of scotland]]<br />(Both countries[[Joint law(1707)|1707合同法]]Until)',
 'Date of establishment 1': '927/843',
 'Established form 2': '[[Kingdom of Great Britain]]Established<br />(1707 Act)',
 'Date of establishment 2': '1707{{0}}May{{0}}1 day',
 'Established form 3': '[[United Kingdom of Great Britain and Ireland]]Established<br />([[Joint law(1800)|1800合同法]])',
 'Date of establishment 3': '1801{{0}}January{{0}}1 day',
 'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
 'Date of establishment 4': '1927{{0}}April 12',
 'currency': '[[Sterling pound|UK pounds]](£)',
 'Currency code': 'GBP',
 'Time zone': '±0',
 'Daylight saving time': '+1',
 'ISO 3166-1': 'GB / GBR',
 'ccTLD': '[[.uk]] / [[.gb]]<ref>Use is.Overwhelmingly small number compared to uk.</ref>',
 'International call number': '44',
 'Note': '<references/>'}

27. Removal of internal links

In addition to> 26 processing, remove MediaWiki's internal link markup from the template value and convert it to text.

code


def remove_link(x):
    x = re.sub(r'\[\[[^\|\]]+\|[^{}\|\]]+\|([^\]]+)\]\]', r'\1', x)
    x = re.sub(r'\[\[[^\|\]]+\|([^\]]+)\]\]', r'\1', x)
    x = re.sub(r'\[\[([^\]]+)\]\]', r'\1', x)
    return x

dct3 = {
    key : remove_link(value)
    for key, value in dct2.items()
}
dct3

output


{'Abbreviated name': 'England',
 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
 'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />',
 'National flag image': 'Flag of the United Kingdom.svg',
 'National emblem image': 'British coat of arms',
 'National emblem link': '(National emblem)',
 'Motto': '{{lang|fr|Dieu et mon droit}}<br />(French:God and my rights)',
 'National anthem': '{{lang|en|God Save the Queen}}{{en icon}}<br />God save the queen<br />{{center|File:United States Navy Band - God Save the Queen.ogg}}',
 'Map image': 'Europe-UK.svg',
 'Position image': 'United Kingdom (+overseas territories) in the World (+Antarctica claims).svg',
 'Official terminology': 'English',
 'capital': 'London (virtually)',
 'Largest city': 'London',
 'Head of state title': 'Queen',
 'Name of head of state': 'Elizabeth II',
 'Prime Minister's title': 'Prime Minister',
 'Prime Minister's name': 'Boris Johnson',
 'Other heads of state title 1': 'Aristocratic House Chairman',
 'Names of other heads of state 1': 'Norman Fowler',
 'Other heads of state title 2': 'Chairman of the House of Commons',
 'Other heads of state name 2': '{{Temporary link|Lindsay Foil|en|Lindsay Hoyle}}',
 'Other heads of state title 3': 'Chief Justice of Japan',
 'Other heads of state name 3': 'Brenda Hale',
 'Area ranking': '76',
 'Area size': '1 E11',
 'Area value': '244,820',
 'Water area ratio': '1.3%',
 'Demographic year': '2018',
 'Population ranking': '22',
 'Population size': '1 E7',
 'Population value': '66,435,600<ref>{{Cite web|url=https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates|title=Population estimates - Office for National Statistics|accessdate=2019-06-26|date=2019-06-26}}</ref>',
 'Population density value': '271',
 'GDP statistics year yuan': '2012',
 'GDP value source': '1,547.8 billion<ref name="imf-statistics-gdp">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>',
 'GDP Statistics Year MER': '2012',
 'GDP ranking MER': '6',
 'GDP value MER': '2,433.7 billion<ref name="imf-statistics-gdp" />',
 'GDP statistical year': '2012',
 'GDP ranking': '6',
 'GDP value': '2,316.2 billion<ref name="imf-statistics-gdp" />',
 'GDP/Man': '36,727<ref name="imf-statistics-gdp" />',
 'Founding form': 'Founding of the country',
 'Established form 1': 'Kingdom of England / Kingdom of Scotland<br />(Both countries until the 1707 Act)',
 'Date of establishment 1': '927/843',
 'Established form 2': 'Kingdom of Great Britain established<br />(1707 Act)',
 'Date of establishment 2': '1707{{0}}May{{0}}1 day',
 'Established form 3': 'United Kingdom of Great Britain and Ireland established<br />(1800 Joint Law)',
 'Date of establishment 3': '1801{{0}}January{{0}}1 day',
 'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
 'Date of establishment 4': '1927{{0}}April 12',
 'currency': 'UK pounds(£)',
 'Currency code': 'GBP',
 'Time zone': '±0',
 'Daylight saving time': '+1',
 'ISO 3166-1': 'GB / GBR',
 'ccTLD': '.uk / .gb<ref>Use is.Overwhelmingly small number compared to uk.</ref>',
 'International call number': '44',
 'Note': '<references/>'}

28. Removal of MediaWiki markup

In addition to the> 27 process, remove MediaWiki markup from the template values as much as possible and format the basic country information.

I also removed unnecessary parts other than the MediaWiki markup.

code


def remove_markups(x):
    x = re.sub(r'{{.*\|.*\|([^}]*)}}', r'\1', x)
    x = re.sub(r'<([^>]*)( .*|)>.*</\1>', '', x)
    x = re.sub(r'<[^>]*?/>', '', x)
    x = re.sub(r'\{\{0\}\}', '', x)
    return x

dct4 = {
    key : remove_markups(value)
    for key, value in dct3.items()
}
dct4

output


{'Abbreviated name': 'England',
 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
 'Official country name': 'United Kingdom of Great Britain and Northern Ireland<ref>Official country name other than English:',
 'National flag image': 'Flag of the United Kingdom.svg',
 'National emblem image': 'British coat of arms',
 'National emblem link': '(National emblem)',
 'Motto': 'Dieu et mon droit (French:God and my rights)',
 'National anthem': 'File:United States Navy Band - God Save the Queen.ogg',
 'Map image': 'Europe-UK.svg',
 'Position image': 'United Kingdom (+overseas territories) in the World (+Antarctica claims).svg',
 'Official terminology': 'English',
 'capital': 'London (virtually)',
 'Largest city': 'London',
 'Head of state title': 'Queen',
 'Name of head of state': 'Elizabeth II',
 'Prime Minister's title': 'Prime Minister',
 'Prime Minister's name': 'Boris Johnson',
 'Other heads of state title 1': 'Aristocratic House Chairman',
 'Names of other heads of state 1': 'Norman Fowler',
 'Other heads of state title 2': 'Chairman of the House of Commons',
 'Other heads of state name 2': 'Lindsay Hoyle',
 'Other heads of state title 3': 'Chief Justice of Japan',
 'Other heads of state name 3': 'Brenda Hale',
 'Area ranking': '76',
 'Area size': '1 E11',
 'Area value': '244,820',
 'Water area ratio': '1.3%',
 'Demographic year': '2018',
 'Population ranking': '22',
 'Population size': '1 E7',
 'Population value': '66,435,600',
 'Population density value': '271',
 'GDP statistics year yuan': '2012',
 'GDP value source': '1,547.8 billion',
 'GDP Statistics Year MER': '2012',
 'GDP ranking MER': '6',
 'GDP value MER': '2,433.7 billion',
 'GDP statistical year': '2012',
 'GDP ranking': '6',
 'GDP value': '2,316.2 billion',
 'GDP/Man': '36,727',
 'Founding form': 'Founding of the country',
 'Established form 1': 'Kingdom of England / Kingdom of Scotland (both countries until the 1707 Act)',
 'Date of establishment 1': '927/843',
 'Established form 2': 'Great Britain Kingdom established (1707 Acts of Union)',
 'Date of establishment 2': 'May 1, 1707',
 'Established form 3': 'United Kingdom of Great Britain and Ireland established (1800 Acts of Union 1800)',
 'Date of establishment 3': 'January 1, 1801',
 'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
 'Date of establishment 4': 'April 12, 1927',
 'currency': 'UK pounds(£)',
 'Currency code': 'GBP',
 'Time zone': '±0',
 'Daylight saving time': '+1',
 'ISO 3166-1': 'GB / GBR',
 'ccTLD': '.uk / .gb',
 'International call number': '44',
 'Note': ''}

29. Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image.

code


import requests

Hit the API using requests. I referred to the code at the bottom of here.

code


filename = dct4['National flag image']

session = requests.Session()
url = 'https://en.wikipedia.org/w/api.php'
params = {
    'action' : 'query',
    'format' : 'json',
    'prop' : 'imageinfo',
    'titles' : 'File:' + filename,
    'iiprop' : 'url',
}
r = session.get(url=url, params=params)
data = r.json()
pages = data['query']['pages']
flag_url = pages[list(pages)[0]]['imageinfo'][0]['url']
flag_url

output


'https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg'

The link is the image below. <img src="https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg", width="300">

Next is Chapter 4

Language processing 100 knocks 2020 Chapter 4: Morphological analysis

Recommended Posts

100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock Regular Expressions Learned in Chapter 3
[Language processing 100 knocks 2020] Chapter 3: Regular expressions
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 natural language processing knocks Chapter 3 Regular expressions (first half)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 Language Processing Knock (2020): 28
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 natural language processing knocks Chapter 3 Regular expressions (second half)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 language processing knocks 2020: Chapter 3 (regular expression)
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock (2020): 38
I tried 100 language processing knock 2020: Chapter 1
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 language processing knock 2020 [00 ~ 69 answer]
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock UNIX Commands Learned in Chapter 2
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 language processing knock-50: sentence break
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock-80 (Replace with Regular Expression): Corpus Formatting
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)