[PYTHON] 100 language processing knocks 2020: Chapter 3 (regular expression)

(Caution): The current article is dirty with priority given to moving forward.

Language processing 100 knock 2020 version has been released, so I will solve it at this opportunity. Note that the package once loaded is not loaded after that because it is a markdown output of Jupyter published on GitHub.

The first chapter is here

I'm glad that I practiced iteratively just like knocking on the regular expressions that I was always looking up, and it seemed that future analysis could be done smoothly. I'm in the process of finishing it to the end, so there may be some mistakes, I would appreciate it if you could point out. I plan to arrange the appearance of the article once it has been solved.

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.

One article information per line is stored in JSON format
In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format.
The entire file is gzipped Create a program that performs the following processing.

It seems that the content is the same as 2015. The Wiki markup is summarized here [https://ja.wikipedia.org/wiki/Help:%E6%97%A9%E8%A6%8B%E8%A1%A8).

20. Read JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

import json

def return_article(fname, article_title):
    with open(fname, 'rt') as data_file:
        for line in data_file:
            data_json = json.loads(line)
            if data_json['title'] == article_title:
                return data_json['text']

file_path = '../data/jawiki-country.json'
uk_article = return_article(file_path, 'England')

print(uk_article)

    {{redirect|UK}}
    {{redirect|United Kingdom|The princes of the Spring and Autumn period|English(Spring and autumn)}}
    {{Otheruses|European country|Local cuisine of Nagasaki and Kumamoto prefectures|England}}
    {{Basic information Country
    |Abbreviated name=England
    |Japanese country name=United Kingdom of Great Britain and Northern Ireland

    <<The following is omitted>>

21. Extract rows containing category names

Extract the line that declares the category name in the article.

Looking at the contents of the above results, the categories were described as follows.

[[Category:England|*]]
[[Category:Commonwealth of Nations]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states|Former]]
[[Category:Maritime nation]]
[[Category:Existing sovereign country]]
[[Category:Island country]]
[[Category:A nation / territory established in 1801]]

The format is [[Category: category name | sort key]]. To use a special character without invoking its special meaning, you need to use a backslash. I wrote it as r'^ \ [+ Category \:. + \] + $'. By using re.MULTILINE and findall, you can search without turning the for loop for each line break.

import re

def extract_category_row(wiki_text):
    p = re.compile(r'^\[+Category\:.+\]+$', re.MULTILINE)
    return p.findall(wiki_text)

category_rows = extract_category_row(uk_article)
for line in category_rows:
    print(line)

    [[Category:England|*]]
    [[Category:Commonwealth of Nations]]
    [[Category:Commonwealth Kingdom|*]]
    [[Category:G8 member countries]]
    [[Category:European Union member states|Former]]
    [[Category:Maritime nation]]
    [[Category:Existing sovereign country]]
    [[Category:Island country]]
    [[Category:A nation / territory established in 1801]]

22. Extraction of category name

Extract the article category names (by name, not line by line).

Only the part enclosed in () can be extracted. Matching Unicode word characters with \ w.

def extract_category_name(wiki_text):
    p = re.compile(r'^\[+Category\:(\w+).+$', re.MULTILINE)
    return p.findall(wiki_text)

category_name = extract_category_name(uk_article)
for line in category_name:
    print(line)

England
Commonwealth of Nations
Commonwealth Kingdom
G8 member countries
European Union member states
Maritime nation
Existing sovereign country
Island country
A nation established in 1801

23. Section structure

Display the section name and its level contained in the article (for example, 1 if "== section name ==")

== History == This seems to be a section.

def extract_section(wiki_text):
    result = {}
    p = re.compile(r'^(={2,})(\w+)\1$', re.MULTILINE)
    section_content =  p.findall(wiki_text)
    for item in section_content:
        result[item[1]] = len(item[0])
    return result

section_dict = extract_section(uk_article)

for k,v in section_dict.items():
    print('level:',v, k)

level:2 Country name
level:2 history
level:2 Geography
level:3 major cities
level:3 Climate
level:2 politics
level:3 Head of state

24. Extracting file references

Extract all media files referenced from the article.

[[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]というのがFileの記載マークアップらしい。

def extract_file(wiki_text):
    p = re.compile(r'\[\[File\:(.+?)\|')
    file_name = p.findall(wiki_text)
    return file_name

file_reference = extract_file(uk_article)
print(file_reference)

    ['Royal Coat of Arms of the United Kingdom.svg', 'Descriptio Prime Tabulae Europae.jpg', "Lenepveu, Jeanne d'Arc au siège d'Orléans.jpg ", 'London.bankofengland.arp.jpg', 'Battle of Waterloo 1815.PNG', 'Uk topo en.jpg', 'BenNevis2005.jpg', 'Population density UK 2011 census.png', '2019 Greenwich Peninsula & Canary Wharf.jpg', 'Birmingham Skyline from Edgbaston Cricket Ground crop.jpg', 'Leeds CBD at night.jpg', 'Glasgow and the Clyde from the air (geograph 4665720).jpg', 'Palace of Westminster, London - Feb 2007.jpg', 'Scotland Parliament Holyrood.jpg', 'Donald Trump and Theresa May (33998675310) (cropped).jpg', 'Soldiers Trooping the Colour, 16th June 2007.jpg', 'City of London skyline from London City Hall - Oct 2008.jpg', 'Oil platform in the North SeaPros.jpg', 'Eurostar at St Pancras Jan 2008.jpg', 'Heathrow Terminal 5C Iwelumo-1.jpg', 'Airbus A380-841 G-XLEB British Airways (10424102995).jpg', 'UKpop.svg', 'Anglospeak.svg', "Royal Aberdeen Children's Hospital.jpg ", 'CHANDOS3.jpg', 'The Fabs.JPG', 'Wembley Stadium, illuminated.jpg']

25. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

The basic information is as follows.

{{Basic information Country
|Abbreviated name=England
|Japanese country name=United Kingdom of Great Britain and Northern Ireland
|Official country name= {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br />
*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}（[[Scottish Gaelic]]）
*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}（[[Welsh]]）
*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}（[[Irish]]）
*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}（[[Cornish]]）
*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}（[[Scots]]）
**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}(Ulster Scots)</ref>
|National flag image= Flag of the United Kingdom.svg
|National emblem image= [[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]

<<Omission>>

}}

A look-ahead assertion system was used to search by relying on | (quoted from python documentation).

(?=...) If ... matches what follows, it will match, but it will not consume any strings. This is called a lookahead assertion. For example, Isaac (? = Asimov) will only match if'Isaac' is followed by'Asimov'.

Regarding this, I referred to Amateur Language Processing. It's boring because it's exactly the same, so I wrote it without extracting the {{basic information *}} part. However, since the number of applied students in other examples is low, I think it is better to extract them properly before doing so. Specifically, when extracting directly, only the part | style = * was an obstacle, so it was extracted by removing it.

def extract_basic_info(wiki_text):
    result = {}
    p = re.compile(r'^\|(?!style)(\w+?)\s*\=\s*(.+?)(?:(?=\n\|))', re.MULTILINE)
    basics = p.findall(wiki_text)
    for item in basics:
        result[item[0]] = item[1]
    return result

basic_info = extract_basic_info(uk_article)
print(json.dumps(basic_info, sort_keys=True, indent=4, ensure_ascii=False))

    {
        "GDP value": "2,316.2 billion<ref name=\"imf-statistics-gdp\" />",
        "GDP value MER": "2,433.7 billion<ref name=\"imf-statistics-gdp\" />",
        "GDP value source": "1,547.8 billion<ref name=\"imf-statistics-gdp\">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>",
        "GDP statistical year": "2012",
        "GDP Statistics Year MER": "2012",
        "GDP statistics year yuan": "2012",
        "GDP ranking": "6",

        <<Omission>>

    }

26. Removal of highlighted markup

At the time of processing> 25, remove the MediaWiki emphasis markup (all weak emphasis, emphasis, strong emphasis) from the template value and convert it to text (Reference: [Markup Quick Reference](http: // ja. wikipedia.org/wiki/Help:% E6% 97% A9% E8% A6% 8B% E8% A1% A8))

Emphasis markup is the area surrounded by two or more', such as'emphasis''. Since it is at the time of processing, we have defined a new function that performs duplicate processing.

def extract_basic_removed_reinforce(wiki_text):
    result = {}
    ps = re.compile(r'\'{2,}') #Added part
    p = re.compile(r'^\|(?!style)(\w+?)\s*\=\s*(.+?)(?:(?=\n\|))', re.MULTILINE)
    removed_text = ps.sub('', wiki_text) #Added part
    basics = p.findall(removed_text)
    for item in basics:
        result[item[0]] = item[1]
    return result

basic_info = extract_basic_removed_reinforce(uk_article)
print(json.dumps(basic_info, sort_keys=True, indent=4, ensure_ascii=False))

    {
        "GDP value": "2,316.2 billion<ref name=\"imf-statistics-gdp\" />",
        "GDP value MER": "2,433.7 billion<ref name=\"imf-statistics-gdp\" />",
        "GDP value source": "1,547.8 billion<ref name=\"imf-statistics-gdp\">[http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/weorept.aspx?pr.x=70&pr.y=13&sy=2010&ey=2012&scsm=1&ssd=1&sort=country&ds=.&br=1&c=112&s=NGDP%2CNGDPD%2CPPPGDP%2CPPPPC&grp=0&a=IMF>Data and Statistics>World Economic Outlook Databases>By Countrise>United Kingdom]</ref>",
        "GDP statistical year": "2012",
        "GDP Statistics Year MER": "2012",
        "GDP statistics year yuan": "2012",
        "GDP ranking": "6",
        "GDP ranking MER": "6",
        "ccTLD": "[[.uk]] / [[.gb]]<ref>Use is.Overwhelmingly small number compared to uk.</ref>",
        "Population value": "66,435,600<ref>{{Cite web|url=https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates|title=Population estimates - Office for National Statistics|accessdate=2019-06-26|date=2019-06-26}}</ref>",
        "Population size": "1 E7",
        "Population density value": "271",
        "Demographic year": "2018",
        "Population ranking": "22",
        "Names of other heads of state 1": "[[:en:Norman Fowler, Baron Fowler|Norman Fowler]]",
        "Other heads of state name 2": "{{Temporary link|Lindsay Foil|en|Lindsay Hoyle}}",
        "Other heads of state name 3": "[[:en:Brenda Hale, Baroness Hale of Richmond|Brenda Hale]]",
        "Other heads of state title 1": "[[House of Peers(England)|Aristocratic House Chairman]]",
        "Other heads of state title 2": "[[House of Commons(England)|Chairman of the House of Commons]]",
        "Other heads of state title 3": "[[United Kingdom Supreme Court|Chief Justice of Japan]]",
        "Position image": "United Kingdom (+overseas territories) in the World (+Antarctica claims).svg",
        "Name of head of state": "[[Elizabeth II]]",
        "Head of state title": "[[British monarch|Queen]]",
        "Official terminology": "[[English]]",
        "National flag image": "Flag of the United Kingdom.svg",
        "National anthem": "[[Her Majesty the Queen|{{lang|en|God Save the Queen}}]]{{en icon}}<br />God save the queen<br />{{center|[[File:United States Navy Band - God Save the Queen.ogg]]}}",
        "National emblem link": "（[[British coat of arms|National emblem]]）",
        "National emblem image": "[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]",
        "International call number": "44",
        "Map image": "Europe-UK.svg",
        "Daylight saving time": "+1",
        "Founding form": "Founding of the country",
        "Japanese country name": "United Kingdom of Great Britain and Northern Ireland",
        "Time zone": "±0",
        "Largest city": "London",
        "Motto": "{{lang|fr|[[Dieu et mon droit]]}}<br />（[[French]]:[[Dieu et mon droit|God and my rights]]）",
        "Water area ratio": "1.3%",
        "Abbreviated name": "England",
        "Date of establishment 1": "927/843",
        "Date of establishment 2": "1707{{0}}May{{0}}1 day",
        "Date of establishment 3": "1801{{0}}January{{0}}1 day",
        "Date of establishment 4": "1927{{0}}April 12",
        "Established form 1": "[[Kingdom of England]]／[[Kingdom of scotland]]<br />(Both countries[[Joint law(1707)|1707合同法]]Until)",
        "Established form 2": "[[Kingdom of Great Britain]]Established<br />(1707 Act)",
        "Established form 3": "[[United Kingdom of Great Britain and Ireland]]Established<br />（[[Joint law(1800)|1800合同法]]）",
        "Established form 4": "Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"",
        "currency": "[[Sterling pound|UK pounds]](£)",
        "Currency code": "GBP",
        "Area value": "244,820",
        "Area size": "1 E11",
        "Area ranking": "76",
        "Prime Minister's name": "[[Boris Johnson]]",
        "Prime Minister's title": "[[British Prime Minister|Prime Minister]]",
        "capital": "[[London]](infact)"
    }


### 27.Removal of internal links

>In addition to the 26 processes, remove MediaWiki's internal link markup from the template value and convert it to text.

The link is

[[Article title]]
[[Article title|Display character]]
[[Article title#Section name|Display character]]


There are three types. Other wiki markups that are likely to get involved in processing include categories, file specifications, and redirect elements. I couldn't find redirect in the article, so I'll just consider removing the category and file specifications.
Up to this point, everything is done with one function, but I think that each one should be defined as another function, such as reading the basic aid.

[[Category:help|Hiyo Hayami]]
[[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]
#REDIRECT [[Article title]]
#REDIRECT [[Article title#Section name]]

?!: Positive when not applicable ?:: Specify where not to capture

#Link removal function
def remove_links(text):
    p = re.compile(r'\[\[(?!Category\:File)(?:[^|]*?\|)?([^|]*?)\]\]')
    return p.sub(r'\1', text)

def extract_basic_not_link_reinforce(wiki_text):
    result = {}
    ps = re.compile(r'\'{2,}') 
    p = re.compile(r'^\|(?!style)(\w+?)\s*\=\s*(.+?)(?:(?=\n\|))', re.MULTILINE) 
    removed_text = remove_links(ps.sub('', wiki_text)) #Changed part
    basics = p.findall(removed_text)
    for item in basics:
        result[item[0]] = item[1]
    return result

basic_info = extract_basic_not_link_reinforce(uk_article)
print(json.dumps(basic_info, sort_keys=True, indent=4, ensure_ascii=False))

```json
    {
 "GDP value": "2.3162 trillion <ref name = \" imf-statistics-gdp \ "/>",
 "GDP value MER": "2.4337 trillion <ref name = \" imf-statistics-gdp \ "/>",
 "GDP source": "1.5478 trillion <ref name = \" imf-statistics-gdp \ "> [http://www.imf.org/external/pubs/ft/weo/2012/02/weodata/ weorept.aspx? pr.x = 70 & pr.y = 13 & sy = 2010 & ey = 2012 & scsm = 1 & ssd = 1 & sort = country & ds =. & br = 1 & c = 112 & s = NGDP% 2CNGDPD% 2CPPPGDP% 2CPPPPC & grp = 0 & a = IMF> Data and Statistics> World Economic Outlook Databases> By Countrise> United Kingdom] </ ref> ",
 "GDP Statistics Year": "2012",
 "GDP Statistics Year MER": "2012",
 "GDP Statistics Year Yuan": "2012",
 "GDP ranking": "6",
 "GDP ranking MER": "6",
 "ccTLD": ".uk / .gb <ref> Use is overwhelmingly less than .uk. </ Ref>",
         "Population value": "66,435,600<ref>{{Cite web|url=https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates|title=Population estimates - Office for National Statistics|accessdate=2019-06-26|date=2019-06-26}}</ref>",
 "Population size": "1 E7",
 "Population density value": "271",
 "Demographic Year": "2018",
 "Population Ranking": "22",
 "Other Heads of State Name 1": "Norman Fowler",
         "Other heads of state name 2": "{{Temporary link|Lindsay Foil|en|Lindsay Hoyle}}",
 "Other Heads of State Name 3": "Brenda Hale",
 "Head of State 1": "Chairman of the Aristocratic House",
 "Title of other heads of state 2": "Chairman of the House of Commons",
    }

28.MediaWiki markup removal

In addition to the 27 processes, remove MediaWiki markup from the template values as much as possible and format the basic country information.

Let's define it again from the extraction of basic information. The cite web and center haven't been retrieved yet and need to be fixed. I will come back when I reach Chapter 10.

# Extraction of basic information
def extract_basic(text):
     p = re.compile(r'^\|(?!style)(\w+?)\s*\=\s*(.+?)(?:(?=\n\|))', re.MULTILINE)
    basics = p.findall(text)
    return basics

# Removal function
def remove_emphasis(text):
    p = re.compile(r'\'{2,}')
    return p.sub(r'', text)
def remove_links(text):
     p = re.compile(r'\[\[(?:[^|]*?\|)*?([^|]*?)\]\]')
    return p.sub(r'\1', text)
def remove_tags(text):
    p = re.compile(r'<[^>]*?>')
    return p.sub(r'', text)
def remove_lang(text):
     p = re.compile(r'\{\{lang(?:[^|]*?\|)*?([^|]*?)\}\}')
    return p.sub(r'\1', text)
def remove_ex_link(text):
    p = re.compile(r'\[http:\/\/(?:[^\s]*?)\s([^]]*?)\]')
    return p.sub(r'\1', text)


def main():
    basic_dict = {}
    basic_list = extract_basic(uk_article)
    for target in basic_list:
        explanation = remove_emphasis(target[1])
        explanation = remove_links(explanation)
        explanation = remove_tags(explanation)
        explanation = remove_lang(explanation)
        explanation = remove_ex_link(explanation)
        basic_dict[target[0]] = explanation
    print(json.dumps(basic_dict, sort_keys=True, indent=4, ensure_ascii=False))
        
if __name__ == '__main__':
    main()

    {
 "GDP value": "2.3162 trillion",
 "GDP value MER": "2.4337 trillion",
 "GDP Source": "1.5478 trillion and Statistics> World Economic Outlook Databases> By Countrise> United Kingdom",
 "GDP Statistics Year": "2012",
 "GDP Statistics Year MER": "2012",
 "GDP Statistics Year Yuan": "2012",
 "GDP ranking": "6",
 "GDP ranking MER": "6",
 << Omitted below >>
    }

29.Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: MediaWiki APIofimageinfoToconvertthefilereferencetoaURL)

At 28"National flag image": "Flag of the United Kingdom.svg"Has been obtained.

import requests

def extract_basic_dict(article):
    basic_dict = {}
    basic_list = extract_basic(article)
    for target in basic_list:
        explanation = remove_emphasis(target[1])
        explanation = remove_links(explanation)
        explanation = remove_tags(explanation)
        explanation = remove_lang(explanation)
        explanation = remove_ex_link(explanation)
        basic_dict[target[0]] = explanation
    return basic_dict

basic_dict = extract_basic_dict(uk_article)
 fname_flag = basic_dict ['flag image']

def obtain_url(basic_dict, title):
    fname_flag = basic_dict[title].replace(' ', '_')
    url = 'https://en.wikipedia.org/w/api.php?' \
        + 'action=query' \
        + '&titles=File:' + fname_flag \
        + '&prop=imageinfo' \
        + '&iiprop=url' \
        + '&format=json'
    data = requests.get(url)
    return re.search(r'"url":"(.+?)"', data.text).group(1)


def main():
    basic_dict = extract_basic_dict(uk_article)
 query_url = obtain_url (basic_dict, "flag image")
    print(query_url)
    
if __name__ == '__main__':
    main()

    https://upload.wikimedia.org/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg