[Python] Challenge 100 knocks! (020-024)

About the history so far

Please refer to First Post

Knock status

9/24 added

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format. ・ Information of one article per line is stored in JSON format -In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. -The entire file is compressed with gzip Create a program that performs the following processing.

020. Reading JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

json_read_020.py


#-*- coding:utf-8 -*-

import json
import gzip
import re

def uk_find():
    basepath = '/Users/masassy/PycharmProjects/Pywork/training/'
    filename = 'jawiki-country.json.gz'
    pattern = r"England"
    with gzip.open(basepath + filename, 'rt')as gf:
        for line in gf:
            # json.loads is str → dict, json.load is file → dict
            json_data = json.loads(line)
            if (re.match(json_data['title'], pattern)):
                return json_data['text']

if __name__ == "__main__":
    json_data = uk_find()
    print(json_data)

result


{{redirect|UK}}
{{Basic information Country
|Abbreviated name=England

(Omitted because it is long)

[[Category:Island country|Kureito Furiten]]
[[Category:States / Regions Established in 1801]]

Process finished with exit code 0

Impressions: It took me some time to understand the data format of the file read by gzip.open and the data format of json_data.

021. Extract rows containing category names

Extract the line that declares the category name in the article.

category_021.py


from training.json_read_020 import uk_find
import re

if __name__=="__main__":
    pattern = re.compile(r'.*Category.*')
    lines = uk_find()
    for line in lines.split('\n'):
        if pattern.match(line):
            print(line)

result


[[Category:England|*]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states]]
[[Category:Maritime nation]]
[[Category:Sovereign country]]
[[Category:Island country|Kureito Furiten]]
[[Category:States / Regions Established in 1801]]

Process finished with exit code 0

Impressions: It took me a while to realize that the regular expression. \ * Search character. \ * And lines.split ('\ n') combination can return lines that contain the search character.

022. Extraction of category name

Extract the article category names (by name, not line by line).

category_str_022.py


from training.json_read_020 import uk_find
import re

if __name__=="__main__":
    pattern = re.compile(r'.*Category:.*')
    pattern2 = re.compile(r'.*\|.*')
    lines = uk_find()
    for line in lines.split('\n'):
        if pattern.match(line):
            strip_line=line.lstrip("[[Category:").rstrip("]]")
            if pattern2.match(strip_line):
                N = strip_line.find('|')
                strip_line2 = strip_line[:N]
                print(strip_line2)
            else:
                print(strip_line)

result


England
Commonwealth Kingdom
G8 member countries
European Union member states
Maritime nation
Sovereign country
Island country
States / Regions Established in 1801

Process finished with exit code 0

Impression: After extracting Category, it is in the line|If there is|The point of ingenuity is the part that specifies the place to slice at the position of.

023. Section structure

Display the section name and its level contained in the article (for example, 1 if "== section name ==").

section_023.py


import re
from training.json_read_020 import uk_find

if __name__=="__main__":
    pattern = re.compile(r'^=.*')
    pattern2 = re.compile(r'^={2}')
    pattern3 = re.compile(r'^={3}')
    pattern4 = re.compile(r'^={4}')

    lines=uk_find()
    for line in lines.split('\n'):
        if pattern.match(line):
            if pattern4.match(line):
                print(line.lstrip('====').rstrip('====')+':Level 4')
            elif pattern3.match(line):
                print(line.lstrip('===').rstrip('===')+':Level 3')
            elif pattern2.match(line):
                print(line.lstrip('==').rstrip('==')+':Level 2')
            else:
                print('no match')

result


Country name:Level 2
history:Level 2
Geography:Level 2
climate:Level 3
(Omitted because it is long)

Process finished with exit code 0

Impressions: After creating four compilation patterns, I first extracted the lines starting with =, and then branched the process according to the level, but it was quite forcible. There seems to be another good way.

024. Extracting file references

Extract all the media files referenced in the article.

media_024.py


from training.json_read_020 import uk_find
import re

if __name__=="__main__":
    pattern = re.compile(r".*(File|File).*")
    pattern2 = re.compile(r"^|.*")
    lines = uk_find()
    for line in lines.split('\n'):
        if pattern2.search(line):
            line = line.lstrip('|')
        if pattern.search(line):
            start = line.find(':')+1
            end = line.find('|')
            print(line[start:end])

result


Royal Coat of Arms of the United Kingdom.svg
Battle of Waterloo 1815.PNG
The British Empire.png
Uk topo en.jpg
BenNevis2005.jpg
(Omitted because it is long)

Impressions: Markup Quick Reference to determine slice position The point of ingenuity is to carry out processing that matches the format of the file part from

Recommended Posts

[Python] Challenge 100 knocks! (015 ~ 019)
[Python] Challenge 100 knocks! (030-034)
[Python] Challenge 100 knocks! (006-009)
[Python] Challenge 100 knocks! (000-005)
[Python] Challenge 100 knocks! (010-014)
[Python] Challenge 100 knocks! (025-029)
[Python] Challenge 100 knocks! (020-024)
python challenge diary ①
Sparta Camp Python 2019 Day2 Challenge
100 Pandas knocks for Python beginners
Challenge Python3 and Selenium Webdriver
Challenge LOTO 6 with Python without discipline
Image processing with Python 100 knocks # 3 Binarization
# 2 Python beginners challenge AtCoder! ABC085C --Otoshidama
Image processing with Python 100 knocks # 2 Grayscale
kafka python
Python basics ⑤
python + lottery 6
Python Summary
Python comprehension
Python technique
Studying python
Python memorandum
Python FlowFishMaster
Python service
python tips
python function ①
Python basics
Python memo
ufo-> python (3)
Python comprehension
install python
Python Singleton
Python basics ④
Python Memorandum 2
python memo
Python Jinja2
Image processing with Python 100 knocks # 8 Max pooling
Python increment
atCoder 173 Python
[Python] function
Python installation
python tips
Installing Python 3.4.3.
Try python
Python memo
Python iterative
Python algorithm
Python2 + word2vec
[Python] Variables
Python sys.intern ()
Python tutorial
Python decimals
python underscore
Python summary
Start python
Note: Python
Python basics ③
python log
Python basics
[Scraping] Python scraping