[Python] Challenge 100 knocks! (025-029)

About the history so far

Please refer to First Post

Knock status

9/24 added

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. ・ Information of one article per line is stored in JSON format -In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. -The entire file is compressed with gzip Create a program that performs the following processing.

025. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

basic_info_025.py


from training.json_read_020 import uk_find
import re

def basic_info_find(lines):
    pattern1 = re.compile(r'^\{\{[redirect|Basic information].*')
    pattern2 = re.compile(r'^\|.*')
    pattern3 = re.compile(r'^\}\}$')

    basic_dict = {}
    for line in lines.split('\n'):
        if pattern1.match(line):
            continue

        elif pattern2.match(line):
            point = line.find('=')
            MAX = len(line)
            title = line[0:point].lstrip('|').rstrip(' ')
            data = line[point:MAX].lstrip('= ')
            basic_dict.update({title: data})

        elif pattern3.match(line):
            break
    return basic_dict

if __name__=="__main__":
    lines = uk_find()
    basic_dict = basic_info_find(lines)
    for key,value in basic_dict.items():
        print(key+':'+value)

result


Established form 4:Current country name "'''United Kingdom of Great Britain and Northern Ireland'''"change to
National emblem image:[[File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms]]
National emblem link:([[British coat of arms|National emblem]])
(Omitted because it is long)
Process finished with exit code 0

Impression: I extracted the line starting with | from the result of the basic information and turned the loop to store it in the key and value of the dictionary before and after =. The print result was processed so that it is easy to understand.

026. Removal of highlighted markup

At the time of processing 25, remove MediaWiki's emphasized markup (all of weak emphasis, emphasis, and strong emphasis) from the template value and convert it to text (reference: markup quick reference table).

emphasize_remove_026.py


from training.json_read_020 import uk_find
from training.basic_info_025 import basic_info_find
import re

def emphasize_remove(basic_dict):
    pattern = re.compile(r".*'{2,4}.*")
    for key,value in basic_dict.items():
        if pattern.match(value):
            value = value.replace("\'",'')
            basic_dict.update({key:value})
    return basic_dict


if __name__ == "__main__":
    lines = uk_find()
    basic_dict = basic_info_find(lines)
    emphasize_remove_dict = emphasize_remove(basic_dict)
    for key,value in emphasize_remove_dict.items():
        print(key+':'+value)

result


GDP statistics year yuan:2012
Established form 4:Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"
Area size:1 E11
(Omitted because it is long)
Process finished with exit code 0

Impressions: There was only one relevant part, but it is set to'{2,4} so that all emphasized markup can be searched. When I found it, I just replaced it with replace.

027. Removal of internal links

In addition to the 26 processes, remove MediaWiki's internal link markup from the template value and convert it to text (reference: markup quick reference table).

link_remove_027.py


from training.json_read_020 import uk_find
from training.basic_info_025 import basic_info_find
from training.emphasize_remove_026 import emphasize_remove
import re

def link_remove(emphasize_remove_dict):
    pattern = re.compile(r".*\[{2}.*")
    for key,value in emphasize_remove_dict.items():
        if pattern.match(value):
            value = value.replace('[[','').replace(']]','')
            emphasize_remove_dict.update({key: value})
    return emphasize_remove_dict

if __name__=="__main__":
    lines = uk_find()
    basic_dict = basic_info_find(lines)
    emphasize_remove_dict=emphasize_remove(basic_dict)
    link_remove_dict = link_remove(emphasize_remove_dict)

    for key,value in link_remove_dict.items():
        print(key+':'+value)

result


National emblem image:File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms
Official country name:{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>
Founding form:Founding of the country
(Omitted because it is long)
Process finished with exit code 0

Impressions: Similar to problem 026, I just replaced [[and]] with replace when I found the internal link part starting with [[].

028. MediaWiki markup removal

In addition to the 27 processes, remove MediaWiki markup from the template values as much as possible and format the basic country information.

markup_remove_028.py


from training.json_read_020 import uk_find
from training.basic_info_025 import basic_info_find
from training.emphasize_remove_026 import emphasize_remove
from training.link_remove_027 import link_remove
import re

#A function that removes pounds.
def pound_check(value):
    pattern = re.compile(r".*pound.*")
    if pattern.match(value):
        value = value.replace("(&pound;)",'')
        return value
    else:
        return  value

#A function that removes the br tag.
def br_check(value):
    pattern1 = re.compile(r".*<br.*")
    if pattern1.match(value):
        value = value.replace("<br />", '').replace("<br/>", '')
        return value
    else:
        return value

#A function that removes the ref tag and reference description.
def ref_check(value):
    pattern2 = re.compile(r".*<ref.*")
    if pattern2.match(value):
        start_point = value.find("<ref")
        value = value[0:start_point]
        return value
    else:
        return value

#{{When}}A function that removes.
def brackets_check(value):
    pattern3 = re.compile(r".*\{\{.*")
    if pattern3.match(value):
        value = value.replace("{{","").replace("}}","")
        #lang|en|Get 4 characters or more from the first pipe when United ~#
        start_point = value.find("|")+4
        value = value[start_point:len(value)]
        return value
    else:
        return value

#File: Function to remove.
def file_check(value):
    pattern4 = re.compile(r".*File.*")
    if pattern4.match(value):
        value = value.replace('File:','')
        start_point = value.find("|")
        value = value[0:start_point]
        return value
    else:
        return value

#Half-width|A function that removes.|Only with|+()Removes the existing pattern.
def pipe_check(value):
     pattern5 = re.compile(r".*\|.*")
     pattern6 = re.compile(r".*\(.*")
     if pattern5.match(value) and pattern6.match(value) :
         end_point = value.find("|")
         value = value[0:end_point] + ")"
         return value
     elif pattern5.match(value):
         end_point = value.find("|")
         value = value[0:end_point]
         return value
     else:
         return value

#Full-width (removing function
def other_check(value):
    pattern7 = re.compile(r"^\(")
    if pattern7.match(value):
        value = value.replace("(","")
        return value
    else:
        return value

def markup_remove(link_remove_dict):
    for key,value in link_remove_dict.items():
        value = pound_check(value)
        value = br_check(value)
        value = ref_check(value)
        value = brackets_check(value)
        value = file_check(value)
        value = pipe_check(value)
        value = other_check(value)
        link_remove_dict.update({key:value})

    return link_remove_dict


if __name__=="__main__":
    lines = uk_find()
    basic_dict = basic_info_find(lines)
    emphasize_remove_dict=emphasize_remove(basic_dict)
    link_remove_dict = link_remove(emphasize_remove_dict)
    markup_remove_dict = markup_remove(link_remove_dict)

    for key,value in markup_remove_dict.items():
        print(key+':'+value)

    print(len(markup_remove_dict.items()))

result


Date of establishment 1:927/843
Official country name:United Kingdom of Great Britain and Northern Ireland
Established form 1:Kingdom of England / Kingdom of Scotland (Both countries are Acts of Union)(1707))
Position image:Location_UK_EU_Europe_001.svg
Motto:Dieu et mon droit (French:God and my rights)
ccTLD:.uk / .gb
National flag image:Flag of the United Kingdom.svg
currency:Sterling pound
(Omitted because it is long)
Process finished with exit code 0

Impressions: First, I found the markup, made a compile pattern, and repeated to see what kind of markup was caught. .. .. And I decided to evaluate all patterns line by line. However, full-width notation is pear. .. .. I was really wondering why it didn't get caught. .. .. I'm tired.

029. Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)

get_url_029.py


# -*- coding:utf-8-*-

from training.json_read_020 import uk_find
from training.basic_info_025 import basic_info_find
import requests
import urllib.parse
import json
import re

def image_query(filename):
    url = "https://commons.wikimedia.org/w/api.php?"
    action = "action=query&"
    titles = "titles=File:"+urllib.parse.quote(filename)+"&"
    prop = "prop=imageinfo&"
    iiprop="iiprop=url&"
    format = "format=json"
    parameter = url +action+titles+prop+iiprop+format
    return parameter

def get_request(parameter):
    pattern = re.compile(r".*\"url\".*")
    r = requests.get(parameter)
    data = r.json()
    json_data =json.dumps(data["query"]["pages"]["347935"]["imageinfo"],indent=4)
    for temp in json_data.split('\n'):
        if(pattern.search(temp)):
            url_data = temp.replace(" ","")
        else:
            continue

    return url_data

if __name__=="__main__":
    lines = uk_find()
    basic_dict = basic_info_find(lines)
    parameter=image_query(basic_dict['National flag image'])
    get_url = get_request(parameter)
    print(get_url)

result


"url":"https://upload.wikimedia.org/wikipedia/commons/a/ae/Flag_of_the_United_Kingdom.svg"

Process finished with exit code 0

Impressions: At first, I wasn't sure what to do. After googled variously, the point was to send a request to wikimedia to search for data related to the file name, and find the URL where the image file is uploaded from the response. It took me a long time to understand this subject ... It was a problem that I learned in many ways.

Recommended Posts

[Python] Challenge 100 knocks! (015 ~ 019)
[Python] Challenge 100 knocks! (006-009)
[Python] Challenge 100 knocks! (000-005)
[Python] Challenge 100 knocks! (010-014)
[Python] Challenge 100 knocks! (025-029)
[Python] Challenge 100 knocks! (020-024)
python challenge diary ①
Challenge 100 data science knocks
Python
Sparta Camp Python 2019 Day2 Challenge
100 Pandas knocks for Python beginners
Challenge Python3 and Selenium Webdriver
Challenge LOTO 6 with Python without discipline
Image processing with Python 100 knocks # 3 Binarization
# 2 Python beginners challenge AtCoder! ABC085C --Otoshidama
Image processing with Python 100 knocks # 2 Grayscale
kafka python
Python basics ⑤
python + lottery 6
Python Summary
Built-in python
Python comprehension
Python technique
Python 2.7 Countdown
Python memorandum
Python FlowFishMaster
Python service
python tips
python function ①
Python basics
Python memo
ufo-> python (3)
Python comprehension
install python
Python Singleton
Python basics ④
Python Memorandum 2
python memo
Python Jinja2
Image processing with Python 100 knocks # 8 Max pooling
Python increment
atCoder 173 Python
[Python] function
Python installation
Installing Python 3.4.3.
Try python
Python memo
Python iterative
Python algorithm
Python2 + word2vec
[Python] Variables
Python functions
Python sys.intern ()
Python tutorial
Python decimals
python underscore
Python summary
Start python
[Python] Sort
Note: Python
Python basics ③