[PYTHON] 100 Language Processing Knock 2020 Chapter 3

Introduction

Language processing 100 knock 2020 has been released, so I will try it immediately.

In Chapter 3, we will extract and format the necessary information using regular expressions from Wikipedia articles.

Wikipedia markup information is Help: quick reference table-Wikipedia, API Information can be found at API: Image Information-MediaWiki. However, since the markup information is incomplete, you can see the data or [Wikipedia page](https://ja.wikipedia.org/wiki/%E3%82%A4%E3%82%AE%E3%83] It is necessary to identify the pattern by looking at% AA% E3% 82% B9).

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.

20. Read JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

code


import gzip
import json

with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]
    print(eng[0]['text'])

Output result (part)


{{redirect|UK}}
{{redirect|United Kingdom|The princes of the Spring and Autumn period|English(Spring and autumn)}}
{{Otheruses|European country|Local cuisine of Nagasaki and Kumamoto prefectures|England}}
{{Basic information Country
|Abbreviated name=England
︙

Note that jawiki-country.json is JSON Lines.

21. Extract rows containing category names

Extract the line that declares the category name in the article.

code


import gzip
import json
import regex as re

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
texts = eng_data['text'].split('\n')

#Extract rows containing categories
cat_rows = list(filter(lambda e: re.search('\[\[category:|\[\[Category:', e), texts))
print('\n'.join(cat_rows))

Output result


[[Category:England|*]]
[[Category:Commonwealth of Nations]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states|Former]]
[[Category:Maritime nation]]
[[Category:Existing sovereign country]]
[[Category:Island country]]
[[Category:A nation / territory established in 1801]]

Help:Simplified chart- WikipediaThen"[[Category:help|HiyoHayami]]Althoughitisactually"[[category:help|HiyoHayami]]Thereisalsoapattern(althoughithappenstobeinaBritisharticle).

22. Extraction of category name

Extract the article category names (by name, not line by line).

code


import gzip
import json
import regex as re

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
texts = eng_data['text'].split('\n')

#Extract rows containing categories
cat_rows = list(filter(lambda e: re.search('\[\[category:|\[\[Category:', e), texts))

#Extract only category names from rows that contain categories
cat_rows = list(map(lambda e: re.search('(?<=(\[\[category:|\[\[Category:)).+?(?=(\||\]))', e).group(), cat_rows))
print('\n'.join(cat_rows))

Output result


England
Commonwealth of Nations
Commonwealth Kingdom
G8 member countries
European Union member states
Maritime nation
Existing sovereign country
Island country
A nation / territory established in 1801

There is a standard re as a regular expression library, but I get a look-behind requires fixed-width pattern error in the look-ahead Use regex for this.

23. Section structure

Display the section name and its level (for example, 1 if "== section name ==") included in the article.

code


import json
import re

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
texts = eng_data['text'].split('\n')

#Extract lines containing sections
sec_rows = list(filter(lambda e: re.search('==.+==', e), texts))

# =Calculate the level from the number of
sec_rows_num = list(map(lambda e: e + ':' + str(int(e.count('=') / 2 - 1)), sec_rows))

# =And remove whitespace
sections = list(map(lambda e: e.replace('=', '').replace(' ', ''), sec_rows_num))
print('\n'.join(sections))

Output result (part)


Country name:1
history:1
Geography:1
Major cities:2
climate:2
︙

24. Extracting file references

Extract all the media files referenced in the article.

code


import json
import regex as re

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
texts = eng_data['text'].split('\n')

#Extract the line containing the file
file_rows = list(filter(lambda e: re.search('\[\[File:|\[\[File:|\[\[file:', e), texts))

#Extract only the filename from the line containing the file
file_rows = list(map(lambda e: re.search('(?<=(\[\[File:|\[\[File:|\[\[file:)).+?(?=(\||\]))', e).group(), file_rows))
print('\n'.join(file_rows))

Output result


Royal Coat of Arms of the United Kingdom.svg
United States Navy Band - God Save the Queen.ogg
Descriptio Prime Tabulae Europae.jpg
Lenepveu, Jeanne d'Arc au siège d'Orléans.jpg
London.bankofengland.arp.jpg
Battle of Waterloo 1815.PNG
Uk topo en.jpg
BenNevis2005.jpg
Population density UK 2011 census.png
2019 Greenwich Peninsula & Canary Wharf.jpg
Leeds CBD at night.jpg
Palace of Westminster, London - Feb 2007.jpg
Scotland Parliament Holyrood.jpg
Donald Trump and Theresa May (33998675310) (cropped).jpg
Soldiers Trooping the Colour, 16th June 2007.jpg
City of London skyline from London City Hall - Oct 2008.jpg
Oil platform in the North SeaPros.jpg
Eurostar at St Pancras Jan 2008.jpg
Heathrow Terminal 5C Iwelumo-1.jpg
UKpop.svg
Anglospeak.svg
Royal Aberdeen Children's Hospital.jpg
CHANDOS3.jpg
The Fabs.JPG
Wembley Stadium, illuminated.jpg

Help:Simplified chart- WikipediaThen "[[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]Although it is actually "[[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]」「[[file:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]There is also a pattern.

25. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

code


import json
import regex as re
import pprint

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
text = eng_data['text']

#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')

#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]

#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
    key, *values = basic.split('=')
    key = key.replace(' ', '').replace('|', '')
    basic_dict[key] = ''.join(values).strip()
pprint.pprint(basic_dict)

Output result (part)


{'GDP/Man': '36,727<ref name"imf-statistics-gdp" />',
 'GDP value': '2,316.2 billion<ref name"imf-statistics-gdp" />',
 'GDP value MER': '2,433.7 billion<ref name"imf-statistics-gdp" />',
︙
 'Prime Minister's name': '[[Boris Johnson]]',
 'Prime Minister's title': '[[British Prime Minister|Prime Minister]]',
 'capital': '[[London]](infact)'}

26. Removal of highlighted markup

At the time of processing 25, remove MediaWiki's emphasis markup (all weak emphasis, emphasis, strong emphasis) from the template value and convert it to text (reference: markup quick reference table).

code


import json
import regex as re
import pprint

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
text = eng_data['text']

#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')

#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]

#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
    #Divided into keys and values
    key, *values = basic.split('=')
    #Shape the key
    key = key.replace(' ', '').replace('|', '')
    #Join because the values are listed
    value = ''.join(values).strip()
    #Removal of highlighted markup
    value = value.replace("'''''", '').replace("'''", '').replace("''", '')
    basic_dict[key] = value
pprint.pprint(basic_dict)

Output result (part)


{'GDP/Man': '36,727<ref name"imf-statistics-gdp" />',
 'GDP value': '2,316.2 billion<ref name"imf-statistics-gdp" />',
 'GDP value MER': '2,433.7 billion<ref name"imf-statistics-gdp" />',
︙
'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
︙
'Prime Minister's name': '[[Boris Johnson]]',
 'Prime Minister's title': '[[British Prime Minister|Prime Minister]]',
 'capital': '[[London]](infact)'}

Help: Quick Reference --Wikipedia "Differentiation from others (italics) ) ”,“ Emphasis (bold) ”and“ Italics and emphasis ”are understood as“ emphasis markup ”here.

27. Removal of internal links

In addition to the 26 processes, remove MediaWiki's internal link markup from the template value and convert it to text (reference: markup quick reference table).

code


import json
import regex as re
import pprint

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
text = eng_data['text']

#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')

#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]

#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
    #Divided into keys and values
    key, *values = basic.split('=')
    #Shape the key
    key = key.replace(' ', '').replace('|', '')
    #Join because the values are listed
    value = ''.join(values).strip()
    #Removal of highlighted markup
    value = value.replace("'''''", '').replace("'''", '').replace("'", '')
    #Get internal link string
    targets = re.findall('((?<=({{)).+?(?=(}})))', value)
    #Formatting the internal link string
    if targets:
        for target in targets:
            value = re.sub('{{.+?}}', target[0].split('|')[-1], value, count=1)
    #Get internal link string
    targets = re.findall('((?<=(\[\[)).+?(?=(\]\])))', value)
    #Formatting the internal link string
    if targets:
        for target in targets:
            value = re.sub('\[\[.+?\]\]', target[0].split('|')[-1], value, count=1)
    basic_dict[key] = value
pprint.pprint(basic_dict)

Output result (part)


{'GDP/Man': '36,727<ref name"imf-statistics-gdp" />',
 'GDP value': '2,316.2 billion<ref name"imf-statistics-gdp" />',
 'GDP value MER': '2,433.7 billion<ref name"imf-statistics-gdp" />',
︙
'Established form 3': 'United Kingdom of Great Britain and Ireland established<br />(1800 Joint Law)',
︙
 'Prime Minister's name': 'Boris Johnson',
 'Prime Minister's title': 'Prime Minister',
 'capital': 'London (virtually)'}

Help:Simplified chart- WikipediaThen "[[Article title|Display character]]There are patterns such as ", but in reality,"{{Article title|Display character}}There seems to be a pattern of ".

28. Removal of MediaWiki markup

In addition to the 27 processes, remove MediaWiki markup from the template values as much as possible and format the basic country information.

code


import json
import regex as re
import pprint

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
text = eng_data['text']

#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')

#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]

#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
    #Divided into keys and values
    key, *values = basic.split('=')
    #Shape the key
    key = key.replace(' ', '').replace('|', '')
    #Join because the values are listed
    value = ''.join(values).strip()
    #Removal of highlighted markup
    value = value.replace("'''''", '').replace("'''", '').replace("'", '')
    #Get internal link string
    targets = re.findall('((?<=({{)).+?(?=(}})))', value)
    #Formatting the internal link string
    if targets:
        for target in targets:
            value = re.sub('{{.+?}}', target[0].split('|')[-1], value, count=1)
    #Get internal link string
    targets = re.findall('((?<=(\[\[)).+?(?=(\]\])))', value)
    #Formatting the internal link string
    if targets:
        for target in targets:
            value = re.sub('\[\[.+?\]\]', target[0].split('|')[-1], value, count=1)
    #Tag removal
    value = value.replace('<br />', '')
    value = re.sub('<ref.+?</ref>', '', value)
    value = re.sub('<ref.+?/>', '', value)
    basic_dict[key] = value
pprint.pprint(basic_dict)

Output result (part)


{'GDP/Man': '36,727',
 'GDP value': '2,316.2 billion',
 'GDP value MER': '2,433.7 billion',
︙
 'Prime Minister's name': 'Boris Johnson',
 'Prime Minister's title': 'Prime Minister',
 'capital': 'London (virtually)'}

29. Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)

code


import json
import regex as re
import requests

eng_data = {}
with gzip.open('jawiki-country.json.gz', mode='rt') as f:
    jsons = []
    #Since the original data is JSON Lines ≠ JSON, read line by line
    lines = f.readlines() 
    for line in lines:
        jsons.append(json.loads(line))
    #Extract england
    eng = list(filter(lambda e: e['title'] == 'England', jsons))
    eng_data = eng[0]

#Extract text
text = eng_data['text']

#Extract basic information
basic_text = re.search('{{Basic information[\s\S]+?}}\n\n', text).group().replace('\n*', '*')

#List by line breaks and delete unnecessary parts
basic_ary = basic_text.split('\n')
del basic_ary[0]
del basic_ary[-3:]

#Change to dictionary type
basic_dict = {}
for basic in basic_ary:
    #Divided into keys and values
    key, *values = basic.split('=')
    #Shape the key
    key = key.replace(' ', '').replace('|', '')
    #Join because the values are listed
    value = ''.join(values).strip()
    #Removal of highlighted markup
    value = value.replace("'''''", '').replace("'''", '').replace("'", '')
    #Get internal link string
    targets = re.findall('((?<=({{)).+?(?=(}})))', value)
    #Formatting the internal link string
    if targets:
        for target in targets:
            value = re.sub('{{.+?}}', target[0].split('|')[-1], value, count=1)
    #Get internal link string
    targets = re.findall('((?<=(\[\[)).+?(?=(\]\])))', value)
    #Formatting the internal link string
    if targets:
        for target in targets:
            value = re.sub('\[\[.+?\]\]', target[0].split('|')[-1], value, count=1)
    #Tag removal
    value = value.replace('<br />', '')
    value = re.sub('<ref.+?</ref>', '', value)
    value = re.sub('<ref.+?/>', '', value)
    basic_dict[key] = value

#API call
session = requests.Session()
params = {
    'action': 'query',
    'format': 'json',
    'prop': 'imageinfo',
    'titles': 'File:' + basic_dict['National flag image'],
    'iiprop': 'url'
}

result = session.get('https://ja.wikipedia.org/w/api.php', params=params)
res_json = result.json()
print(res_json['query']['pages']['-1']['imageinfo'][0]['url'])

Output result


https://upload.wikimedia.org/wikipedia/commons/a/ae/Flag_of_the_United_Kingdom.svg

in conclusion

What you can learn in Chapter 3

Recommended Posts

100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 Language Processing Knock UNIX Commands Learned in Chapter 2
100 Language Processing Knock Regular Expressions Learned in Chapter 3
100 language processing knock-50: sentence break
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary