[PYTHON] I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]

The teaching material of the programming basic study session, which is one of the training for newcomers, created by Tohoku University Inui / Okazaki Lab (currently Inui / Suzuki Lab) "[Language processing 100 knocks 2020 edition](https://nlp100.github. This is the 6th article of solving io / ja /) with Python (3.7).

Having studied Python on my own, there may be mistakes and more efficient ways to do it. We would appreciate it if you could point out any improvements you may find.

From Chapter 3, there are many parts that are correct or not, so please point out not only the points for improvement but also whether or not they are correct.

The source code is also available on GitHub.

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.

--One article information per line is stored in JSON format

--In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format.

--The entire file is gzipped

Create a program that performs the following processing.

25. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

`25.py`


import pandas as pd
import re


def basic_info_extraction(text):
    texts = text.split("\n")
    index = texts.index("{{Basic information Country")
    basic_info = []
    for i in texts[index + 1:]:
        if i == "}}":
            break
        if i.find("|") != 0:
            basic_info[-1] += ", " + i
            continue
        basic_info.append(i)

    pattern = r"\|(.*)\s=(.*)"
    ans = {}
    for i in basic_info:
        result = re.match(pattern, i)
        ans[result.group(1)] = result.group(2).lstrip(" ")
    return ans


file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query("title == 'England'")['text'].values[0]

ans = basic_info_extraction(uk_text)
for key, value in ans.items():
    print(key, ":", value)

In order to extract the "basic information" template, we have defined basic_info_extraction () that takes British text data as an argument.

This function saves the list as basic information until the closing parenthesis of }} appears for the lines after " {{Basic information country ". However, there were cases where even the same row of data spanned multiple rows. The condition is that the beginning of the line|Because it started from.find("|")If there are multiple hits with, they are separated by commas and then combined into one line.

Then, for the extracted data list, separate from | to the half-width space before = as the "field name" and after = as the "value", and if there is a space to the left of the "value" After removing it, it is saved in the dictionary object and returned.

26. Removal of highlighted markup

At the time of processing> 25, remove the MediaWiki emphasis markup (all weak emphasis, emphasis, strong emphasis) from the template value and convert it to text (Reference: [Markup Quick Reference](http: // ja. wikipedia.org/wiki/Help:% E6% 97% A9% E8% A6% 8B% E8% A1% A8)).

`26.py`


import pandas as pd
import re


def basic_info_extraction(text):
    # 「25.See Template Extraction
    ...


def remove_emphasis(value):
    pattern = r"(.*?)'{1,3}(.+?)'{1,3}(.*)"
    result = re.match(pattern, value)
    if result:
        return "".join(result.group(1, 2, 3))
    return value


file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query("title == 'England'")['text'].values[0]

ans = basic_info_extraction(uk_text)
for key, value in ans.items():
    value = remove_emphasis(value) #add to
    print(key, ":", value)

We have prepared a function to remove the specified highlighted markup. Emphasized markup is represented by enclosing one to three 's in a row. Therefore, specify r" (. *?)'{1,3} (. +?)' {1,3} (. *) " As the regular expression pattern, and () other than'Extract other than emphasized markup by enclosing it in. The matched parts are listed, joined by the join method, and returned as a value.

27. Removal of internal links

In addition to processing 26, remove MediaWiki's internal link markup from the template value and convert it to text (Reference: Markup Quick Reference % E6% 97% A9% E8% A6% 8B% E8% A1% A8)).

`27.py`


import pandas as pd
import re


def basic_info_extraction(text):
    # 「25.See Template Extraction
    ...


def remove_emphasis(value):
    # 「26.See Removing Highlighted Markup
    ...


def remove_innerlink(value):
    pipe_pattern = r"(.*\[\[(.*?)\|(.+)\]\])"
    result = re.findall(pipe_pattern, value)
    if len(result) != 0:
        for i in result:
            pattern = "[[{}|{}]]".format(i[1], i[2])
            value = value.replace(pattern, i[2])
    pattern = r"(\[\[(.+?)\]\])"
    result = re.findall(pattern, value)
    if len(result) != 0:
        for i in result:
            if "[[File:" not in value:
                value = value.replace(i[0], i[-1])
    return value


file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query("title == 'England'")['text'].values[0]

ans = basic_info_extraction(uk_text)
for key, value in ans.items():
    value = remove_emphasis(value)
    value = remove_innerlink(value) #add to
    print(key, ":", value)

Next, I prepared a function to remove the markup of the internal link. Internal link markup is represented by enclosing two [[~]] square brackets in a row. As a pattern, you can put a pipe inside and write the display character and article name. Therefore, specify r" (. * \ [\ [(. *?) \ | (. +) \] \]) " As the regular expression pattern, and first extract the link including the pipe. Replace the relevant internal link with an internal link containing only display characters and pass it to the next process.

Since the internal link including the pipe disappears and only the displayed part remains, specify the pattern with r" (\ [\ [(. +?) \] \]) " And search for the corresponding part. , If there is a corresponding part, the markup is removed except for the file and the value is returned.

28. Removal of MediaWiki markup

In addition to the> 27 process, remove MediaWiki markup from the template values as much as possible and format the basic country information.

`28.py`


import pandas as pd
import re


def basic_info_extraction(text):
    # 「25.See Template Extraction
    ...


def remove_emphasis(value):
    # 「26.See Removing Highlighted Markup
    ...


def remove_innerlink(value):
    # 「27.See Removing Internal Links
    ...


def remove_footnote(value):
    #See below
    ...


def remove_langage(value):
    #See below
    ...


def remove_temporarylink(value):
    #See below
    ....


def remove_zero(value):
    #See below
    ...


def remove_br(value):
    #See below
    ...


def remove_pipe(value):
    #See below
    ...


file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query("title == 'England'")['text'].values[0]

ans = basic_info_extraction(uk_text)
for key, value in ans.items():
    value = remove_footnote(value)
    value = remove_emphasis(value)
    value = remove_innerlink(value)
    value = remove_langage(value)       #add to
    value = remove_temporarylink(value) #add to
    value = remove_zero(value)          #add to
    value = remove_br(value)            #add to
    value = remove_pipe(value)          #add to
    print(key, ":", value)

Removal of footnote comments

`remove_footnote()function`


def remove_footnote(value):
    pattern = r"(.*?)(<ref.*?</ref>)(.*)"
    result = re.match(pattern, value)
    if result:
        return "".join(result.group(1, 3))
    pattern = r"<ref.*/>"
    value = re.sub(pattern, "", value)
    return value

First is the removal of footnotes. If the line obtained as an argument contains <ref ~ </ ref>, the value is returned by combining before and after <ref ~ </ ref>. There is also a footnote notation called <ref ~ /> that does not correspond to this, so if this remains, remove it and return it.

Language tag removal

`remove_language()function`


def remove_langage(value):
    pattern = r"{{lang\|.*?\|(.*?)[}}|）]"
    result = re.match(pattern, value)
    if result:
        return result.group(1)
    return value

Next is the language tag. The language tag is the part enclosed by {{lang ~}}. Since the display part is included after the pipe in the middle, enclose it in parentheses in the regular expression, extract it with the group () method and return it as a value.

Removal of temporary links

`remove_temporarylink()function`


def remove_temporarylink(value):
    pattern = r"{{Temporary link\|.*\|(.*)}}"
    result = re.match(pattern, value)
    if result:
        return result.group(1)
    return value

The removal of temporary links is almost the same as the language tag, although the pattern is slightly different. The part surrounded by {{temporary link ~}} is used as a pattern, and if there is a matching part, it is extracted by the group () method.

Removal of enclosed zeros

`remove_zero()function`


def remove_zero(value):
    pattern = r"\{\{0\}\}"
    value = re.sub(pattern, "", value)
    return value

There were many things called {{0}} that I didn't understand, so if there is a match, replace it with an empty string.

Removal of `<br />`

`remove_br()function`


def remove_br(value):
    pattern = r"<br />"
    value = re.sub(pattern, "", value)
    return value

Since the line feed tag <br /> was sometimes included at the end, like {{0}}, if it matches, replace it with an empty string.

Pipe removal

`remove_pipe()function`


def remove_pipe(value):
    pattern = r".*\|(.*)"
    result = re.match(pattern, value)
    if result:
        return result.group(1)
    return value

There is a part where the pipe remains, so if it matches, only the notation part will be returned.

If you do all this, you've removed the markup!

29. Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in MediaWiki API ， Convert file reference to URL)

`29.py`


import pandas as pd
import re
import requests #add to


def basic_info_extraction(text):
    # 「25.See Template Extraction
    ...


def remove_emphasis(value):
    # 「26.See Removing Highlighted Markup
    ...


def remove_innerlink(value):
    # 「27.See Removing Internal Links
    ...


def remove_footnote(value):
    # 「28.See MediaWiki Markup Removal
    ...


def remove_langage(value):
    # 「28.See MediaWiki Markup Removal
    ...


def remove_temporarylink(value):
    # 「28.See MediaWiki Markup Removal
    ....


def remove_zero(value):
    # 「28.See MediaWiki Markup Removal
    ...


def remove_br(value):
    # 「28.See MediaWiki Markup Removal
    ...


def remove_pipe(value):
    # 「28.See MediaWiki Markup Removal
    ...


file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query("title == 'England'")['text'].values[0]

ans = basic_info_extraction(uk_text)
for key, value in ans.items():
    value = remove_footnote(value)
    value = remove_emphasis(value)
    value = remove_innerlink(value)
    value = remove_langage(value)
    value = remove_temporarylink(value)
    value = remove_zero(value)
    value = remove_br(value)
    value = remove_pipe(value)
    uk_data[key] = value

S = requests.Session()
url = "https://en.wikipedia.org/w/api.php"
params = {
    "action": "query",
    "format": "json",
    "prop": "imageinfo",
    "titles": "File:{}".format(uk_data["National flag image"])
}

R = S.get(url=url, params=params)
data = R.json()

pages = data["query"]["pages"]
for k, v in pages.items():
    print(v['imageinfo'][0]['url'])

We will solve the problem using the data formatted up to 28. Use the Requests module installed with $ pip install requests. I don't need to use it, but I use Session. Describe the URL and parameters and get the desired data. Here, the parameters for solving the problem are specified. All you have to do is send a Get request and display the returned data.

I would like to say, but the URL of the image I want to display is not included in the acquired data, and an error occurs.

The current situation is that we have not been able to investigate whether it is wrong or the data has changed ...

Summary

In this article, I tried to solve 100 language processing knocks 2020 edition Chapter 3: Regular expression problem numbers 25 to 29.

I'm really confused about how much data should be formatted ... It is difficult to confirm that it is done properly, so I personally feel that this is the most difficult part of language processing ...

Since I am self-taught, I think that there will be a big difference in how to write code from professionals. I would appreciate it if you could teach me a better way to write.

Thank you!

Until last time

-I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04] -I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09] -Language processing 100 knocks 2020 version [Chapter 2: UNIX commands 10-14] -I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15-19] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 20-24]

[PYTHON] I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]

Chapter 3: Regular Expressions

25. Template extraction

25.py

26. Removal of highlighted markup

26.py

27. Removal of internal links

27.py