100 Language Processing Knock with Python (Chapter 3)

Introduction

Knock 100 Language Processing published on the Tohoku University Inui-Okazaki Laboratory Web page for training in natural language processing and Python. I will challenge nlp100 /). I plan to make a note of the code implemented in it and the techniques that should be suppressed. The code will also be published on GitHub.

-Chapter 1 -Chapter 2, Part 1 -Chapter 2, Part 2 is continued.

excuse

Thanks to skipping for a while, I ended up writing an article while reading the code I wrote before. A style of going "I am another person three days ago" on the ground. During that time, my proficiency level changed considerably, and I was looking at my code while talking to each other. There is a gap between updates, but I hope you can use it as a stone from another mountain.

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.

One article information per line is stored in JSON format In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format. The entire file is gzipped Create a program that performs the following processing.

20. Read JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

Answer

20.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 20.py

import json

with open("jawiki-country.json") as f:
    article_json = f.readline()
    while article_json:
        article_dict = json.loads(article_json)
        if article_dict["title"] == u"England":
            print(article_dict["text"])
        article_json = f.readline()

comment

The jawiki-country.json.gz used this time is 9.9MB, which is quite heavy, so I read it line by line with readline () and only the "UK" article is print (other articles are through). I feel that the operation will stop for a while if I do readlines (), and if there is a wider range of uses, I will operate only for "UK" articles, so I implemented it like this.

In this text data, each line of the file is described in JSON format. However, just loading (json.load ()) does not work well and the advantage of JSON cannot be utilized, so I used json.loads () to convert it to JSON format (this time it is a real dictionary). I will.

modularization

From here, the work of extracting only the articles of "UK" will continue for a while, so I modularized it as follows.

extract_from_json.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# extract_from_json.py

import json


def extract_from_json(title):
    with open("jawiki-country.json") as f:
        json_data = f.readline()
        while json_data:
            article_dict = json.loads(json_data)
            if article_dict["title"] == title:
                return article_dict["text"]
            else:
                json_data = f.readline()
    return ""

Unlike 20.py, this function will return the string of the article if you pass the title as an argument (empty string if it doesn't exist).

21. Extract rows containing category names

Extract the line that declares the category name in the article.

Answer

21.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 21.py

from mymodule import extract_from_json

lines = extract_from_json(u"England").split("\n")

for line in lines:
    if "Category" in line:
        print(line)

#With python3, this can still be displayed (although it is a list)
# print([line for line in lines if "Category" in line])

comment

It's a regular expression chapter, but it doesn't use regular expressions. Well, this one is easier to understand ... That's why only the lines containing the string "Category" are print.

If you write it in intensional notation, it will fit neatly, but in Python2, if you just print a list containing Unicode strings, it will be escaped. Therefore, it is not displayed in a form that can be read as Japanese. This code can be executed in Python3, so if you execute it in Python3, it will be processed well.

22. Extraction of category name

Extract the article category names (by name, not line by line).

Answer

22.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 22.py

import re
from mymodule import extract_from_json

lines = extract_from_json(u"England").split("\n")

for line in lines:
    category_line = re.search("^\[\[Category:(.*?)(|\|.*)\]\]$", line)
    if category_line is not None:
        print(category_line.group(1))

comment

First, extract the category line as in 21., and extract only the name from it using re.search (). re.search () returns aMatchObject instance if there is a part of the string specified by the 2nd argument that matches the regular expression pattern of the 1st argument. Aside from what MatchObject looks like, you can use.group ()to get the matched string. In this case, category_line.group (0) has the entire matched string (eg " [[Category: UK | *]] "), but category_line.group (1) has the first matched part. You will get a string (eg UK).

And although it is an important regular expression, the details are thrown to Official Document, and specific adaptation examples are followed on this page. I will like to try. Click here for the category line to be processed this time (execution result of 21.py)

22.Execution result of py


$ python 22.py
[[Category:England|*]]
[[Category:Commonwealth Kingdom|*]]
[[Category:G8 member countries]]
[[Category:European Union member states]]
[[Category:Maritime nation]]
[[Category:Sovereign country]]
[[Category:Island country|Kureito Furiten]]
[[Category:States / Regions Established in 1801]]

Basically, it is [[Category: category name]], but there are some that specify reading by separating with |. So as a policy,

--First, it starts with [[Category: --Some kind of character string (category name) comes --In some cases, reading kana separated by | comes --Finally, close with ]]

It will be in the form of. Expressing this as a regular expression (I'm not sure if it's optimal) is " ^ \ [\ [Category: (. *?) (\ |. *) * \] \] $ ".

Intention Actual regular expression Commentary
First[[Category:Begins with ^\[\[Category: ^Specify the beginning with
Some kind of character string (category name) comes (.*?) Shortest match with any string
In some cases|The reading kana separated by (\|.*)* (\|.*)*?May be more appropriate
Finally]]Tighten with \]\]$ Indicates the end$May not be necessary

23. Section structure

Display the section name and its level contained in the article (for example, 1 if "== section name ==").

Answer

23.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 23.py

import re
from mymodule import extract_from_json

lines = extract_from_json(u"England").split("\n")

for line in lines:
    section_line = re.search("^(=+)\s*(.*?)\s*(=+)$", line)
    if section_line is not None:
        print(section_line.group(2), len(section_line.group(1)) - 1)

comment

The basic structure is the same as 22., but this time the section name (e.g. == section ==) is the target, so we will pick it up. Since there was a slight fluctuation in the notation (== section ==, == section ==), a space character \ s is inserted between them so that it can be absorbed. Since the section level corresponds to the length of == (== 1 ==, === 2 ===, ...), it is calculated by getting the length and -1. It is.

24. Extracting file references

Extract all the media files referenced from the article.

Answer

24.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 24.py

import re
from mymodule import extract_from_json

lines = extract_from_json(u"England").split("\n")

for line in lines:
    file_line = re.search(u"(File|File):(.*?)\|", line)
    if file_line is not None:
        print(file_line.group(2))

comment

Initially, only those starting with File: were extracted ... stupid.

It is ʻUnicodebecause Japanese is included in the regular expression pattern, but it seems that it is allowed in the Python regular expression pattern. I often see examples of raw strings asr" hogehoge ", but at least it's not a must, as it prevents the escaping process from duplicating and becoming hard to read? Furthermore, if you want to reuse regular expression patterns repeatedly, it seems more efficient to compile using re.compile ()`. However, the last used regular expression pattern is cached, so you don't need to worry about it this time.

25. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

Answer

25.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 25.py

import re
from mymodule import extract_from_json

temp_dict = {}
lines = re.split(r"\n[\|}]", extract_from_json(u"England"))

for line in lines:
    temp_line = re.search("^(.*?)\s=\s(.*)", line, re.S)
    if temp_line is not None:
        temp_dict[temp_line.group(1)] = temp_line.group(2)

for k, v in sorted(temp_dict.items(), key=lambda x: x[1]):
    print(k, v)

comment

The ~~ template is included in the form | template name = template content, so it is a regular expression that matches it. As mentioned above, if you write ^ \ | (. *?) \ S = \ s (. *), The first parenthesis is the template name and the second parenthesis is the template content, so put it in the dictionary. It is stored. ~~

Basically, the template is stored ** in each line ** in the form of | template name = template contents, but the official country name was a little troublesome.

Official country name


|Official country name= {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>
*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}([[Scottish Gaelic]])<br/>
*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}([[Welsh]])<br/>
*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}([[Irish]])<br/>
*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}([[Cornish]])<br/>
*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}([[Scots]])<br/>
**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}(Ulster Scots)</ref>

As mentioned above, it spans multiple lines including line breaks (character = \ n), so it is necessary to handle this area well.

After all

I made it through various trials and errors.

For the time being, I use for loop to print to check the contents, but Python3 is also recommended. Python3 is useful for some reason ...

26. Removal of highlighted markup

At the time of processing> 25, remove the MediaWiki emphasis markup (all of weak emphasis, emphasis, and strong emphasis) from the template value and convert it to text (Reference: [Markup Quick Reference](https: // ja. wikipedia.org/wiki/Help: quick reference table)).

Answer

26.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 26.py

import re
from mymodule import extract_from_json

temp_dict = {}
lines = re.split(r"\n[\|}]", extract_from_json(u"England"))

for line in lines:
    temp_line = re.search("^(.*?)\s=\s(.*)", line, re.S)
    if temp_line is not None:
        temp_dict[temp_line.group(1)] = re.sub(r"'{2,5}", r"", temp_line.group(2))

# 25.See Python3 as well as py
for k, v in sorted(temp_dict.items(), key=lambda x: x[1]):
    print(k, v)

comment

re.sub is a function that replaces the part that matches the regular expression. This time, I am writing to delete 2 or more and 5 or less 's. If you write {n, m}, you can express the previous character as n or more and m or less in a regular expression. ~~ Well, I feel like I should have removed all 's purely this time ... ~~

27. Removal of internal links

In addition to processing 26, remove MediaWiki's internal link markup from the template value and convert it to text (Reference: Markup Quick Reference Simplified chart)).

Answer

27.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 27.py

import re
from mymodule import extract_from_json


def remove_markup(str):
    str = re.sub(r"'{2,5}", r"", str)
    str = re.sub(r"\[{2}([^|\]]+?\|)*(.+?)\]{2}", r"\2", str)
    return str

temp_dict = {}
lines = extract_from_json(u"England").split("\n")

for line in lines:
    category_line = re.search("^\|(.*?)\s=\s(.*)", line)
    if category_line is not None:
        temp_dict[category_line.group(1)] = remove_markup(category_line.group(2))

for k, v in sorted(temp_dict.items(), key=lambda x: x[0]):
    print(k, v)
    

comment

I have created a function remove_markup () that removes markup.

line number Target to be removed
The first line Emphasis (similar to 26)
2nd line Internal link

How to write internal links

There are three types, but all of them comply with the rule that "the article name starts from [[ and ends with some symbol (]], |, #)". I wrote a regular expression.

28. Removal of MediaWiki markup

In addition to the> 27 process, remove MediaWiki markup from the template values as much as possible and format the basic country information.

Answer

28.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 28.py

import re
from mymodule import extract_from_json


def remove_markup(str):
    str = re.sub(r"'{2,5}", r"", str)
    str = re.sub(r"\[{2}([^|\]]+?\|)*(.+?)\]{2}", r"\2", str)
    str = re.sub(r"\{{2}.+?\|.+?\|(.+?)\}{2}", r"\1 ", str)
    str = re.sub(r"<.*?>", r"", str)
    str = re.sub(r"\[.*?\]", r"", str)
    return str

temp_dict = {}
lines = extract_from_json(u"England").split("\n")

for line in lines:
    temp_line = re.search("^\|(.*?)\s=\s(.*)", line)
    if temp_line is not None:
        temp_dict[temp_line.group(1)] = remove_markup(temp_line.group(2))

for k, v in sorted(temp_dict.items(), key=lambda x: x[0]):
    print(k, v)

comment

In addition to 27

line number Target to be removed
The first line Emphasis (similar to 26)
2nd line Internal link (same as 27)
3rd line Notation with language specified (though not in the markup chart)
4th line comment
5th line External link

Rewrote remove_markup () to remove.

29. Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in the MediaWiki API to convert file references to URLs)

Answer

29.py


#!/usr/bin/env python
# -*- coding: utf-8 -*-
# 29.py

import re
import requests
from mymodule import extract_from_json


def json_search(json_data):
    ret_dict = {}
    for k, v in json_data.items():
        if isinstance(v, list):
            for e in v:
                ret_dict.update(json_search(e))
        elif isinstance(v, dict):
            ret_dict.update(json_search(v))
        else:
            ret_dict[k] = v
    return ret_dict


def remove_markup(str):
    str = re.sub(r"'{2,5}", r"", str)
    str = re.sub(r"\[{2}([^|\]]+?\|)*(.+?)\]{2}", r"\2", str)
    str = re.sub(r"\{{2}.+?\|.+?\|(.+?)\}{2}", r"\1 ", str)
    str = re.sub(r"<.*?>", r"", str)
    str = re.sub(r"\[.*?\]", r"", str)
    return str

temp_dict = {}
lines = extract_from_json(u"England").split("\n")

for line in lines:
    temp_line = re.search("^\|(.*?)\s=\s(.*)", line)
    if temp_line is not None:
        temp_dict[temp_line.group(1)] = remove_markup(temp_line.group(2))

url = "https://en.wikipedia.org/w/api.php"
payload = {"action": "query",
           "titles": "File:{}".format(temp_dict[u"National flag image"]),
           "prop": "imageinfo",
           "format": "json",
           "iiprop": "url"}

json_data = requests.get(url, params=payload).json()

print(json_search(json_data)["url"])

comment

How can I hit the API in Python? When I looked it up, this was quite complicated ...

... Well, from the conclusion, it seems that requests is recommended. Official documentation for Python 3

For a higher level http client interface, the Requests package is recommended.

Or, Official Documents for Requests

Requests: HTTP for humans (Omitted) Python's standard urllib2 module has most of the required HTTP functionality, but the API doesn't work properly.

Requests is recommended for strong text.

For details on how to use it, refer to the official document, but this time I received the result of hitting the API in JSON and processed it. The structure of the returned JSON was complicated, so I searched all over and checked the part where the URL was written.

in conclusion

Continue to Chapter 4.

Recommended Posts

100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock with Python (Chapter 2, Part 2)
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 Language Processing with Python Knock 2015
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 1
100 Language Processing Knock Chapter 1
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 Language Processing Knock (2020): 28
I tried 100 language processing knock 2020: Chapter 3
[Chapter 5] Introduction to Python with 100 knocks of language processing
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
[Chapter 3] Introduction to Python with 100 knocks of language processing
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock (2020): 38
[Chapter 2] Introduction to Python with 100 knocks of language processing
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
I tried 100 language processing knock 2020: Chapter 1
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
[Chapter 4] Introduction to Python with 100 knocks of language processing
Python beginner tried 100 language processing knock 2015 (00 ~ 04)
Image processing with Python 100 knock # 10 median filter
3. Natural language processing with Python 2-1. Co-occurrence network
Image processing with Python 100 knock # 12 motion filter
3. Natural language processing with Python 1-1. Word N-gram
[Programmer newcomer "100 language processing knock 2020"] Solve Chapter 1
100 Language Processing Knock-88: 10 Words with High Similarity
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
Python: Natural language processing
100 language processing knocks ~ Chapter 1
100 Amateur Language Processing Knock: 07
100 language processing knocks Chapter 2 (10 ~ 19)
Image processing with Python
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization