[PYTHON] I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]

The teaching material of the programming basic study session, which is one of the training for newcomers, created by Tohoku University Inui / Okazaki Lab (currently Inui / Suzuki Lab) "[Language processing 100 knocks 2020 edition](https://nlp100.github. This is the 6th article of solving io / ja /) with Python (3.7).

Having studied Python on my own, there may be mistakes and more efficient ways to do it. We would appreciate it if you could point out any improvements you may find.

From Chapter 3, there are many parts that are correct or not, so please point out not only the points for improvement but also whether or not they are correct.

The source code is also available on GitHub.

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.

--One article information per line is stored in JSON format

--In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format.

--The entire file is gzipped

Create a program that performs the following processing.

25. Template extraction

Extract the field names and values of the "basic information" template included in the article and store them as a dictionary object.

25.py


import pandas as pd
import re


def basic_info_extraction(text):
    texts = text.split("\n")
    index = texts.index("{{Basic information Country")
    basic_info = []
    for i in texts[index + 1:]:
        if i == "}}":
            break
        if i.find("|") != 0:
            basic_info[-1] += ", " + i
            continue
        basic_info.append(i)

    pattern = r"\|(.*)\s=(.*)"
    ans = {}
    for i in basic_info:
        result = re.match(pattern, i)
        ans[result.group(1)] = result.group(2).lstrip(" ")
    return ans


file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query("title == 'England'")['text'].values[0]

ans = basic_info_extraction(uk_text)
for key, value in ans.items():
    print(key, ":", value)

In order to extract the "basic information" template, we have defined basic_info_extraction () that takes British text data as an argument.

This function saves the list as basic information until the closing parenthesis of }} appears for the lines after " {{Basic information country ". However, there were cases where even the same row of data spanned multiple rows. The condition is that the beginning of the line|Because it started from.find("|")If there are multiple hits with, they are separated by commas and then combined into one line.

Then, for the extracted data list, separate from | to the half-width space before = as the "field name" and after = as the "value", and if there is a space to the left of the "value" After removing it, it is saved in the dictionary object and returned.

26. Removal of highlighted markup

At the time of processing> 25, remove the MediaWiki emphasis markup (all weak emphasis, emphasis, strong emphasis) from the template value and convert it to text (Reference: [Markup Quick Reference](http: // ja. wikipedia.org/wiki/Help:% E6% 97% A9% E8% A6% 8B% E8% A1% A8)).

26.py


import pandas as pd
import re


def basic_info_extraction(text):
    # 「25.See Template Extraction
    ...


def remove_emphasis(value):
    pattern = r"(.*?)'{1,3}(.+?)'{1,3}(.*)"
    result = re.match(pattern, value)
    if result:
        return "".join(result.group(1, 2, 3))
    return value


file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query("title == 'England'")['text'].values[0]

ans = basic_info_extraction(uk_text)
for key, value in ans.items():
    value = remove_emphasis(value) #add to
    print(key, ":", value)

We have prepared a function to remove the specified highlighted markup. Emphasized markup is represented by enclosing one to three 's in a row. Therefore, specify r" (. *?)'{1,3} (. +?)' {1,3} (. *) " As the regular expression pattern, and () other than'Extract other than emphasized markup by enclosing it in. The matched parts are listed, joined by the join method, and returned as a value.

27. Removal of internal links

In addition to processing 26, remove MediaWiki's internal link markup from the template value and convert it to text (Reference: Markup Quick Reference % E6% 97% A9% E8% A6% 8B% E8% A1% A8)).

27.py


import pandas as pd
import re


def basic_info_extraction(text):
    # 「25.See Template Extraction
    ...


def remove_emphasis(value):
    # 「26.See Removing Highlighted Markup
    ...


def remove_innerlink(value):
    pipe_pattern = r"(.*\[\[(.*?)\|(.+)\]\])"
    result = re.findall(pipe_pattern, value)
    if len(result) != 0:
        for i in result:
            pattern = "[[{}|{}]]".format(i[1], i[2])
            value = value.replace(pattern, i[2])
    pattern = r"(\[\[(.+?)\]\])"
    result = re.findall(pattern, value)
    if len(result) != 0:
        for i in result:
            if "[[File:" not in value:
                value = value.replace(i[0], i[-1])
    return value


file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query("title == 'England'")['text'].values[0]

ans = basic_info_extraction(uk_text)
for key, value in ans.items():
    value = remove_emphasis(value)
    value = remove_innerlink(value) #add to
    print(key, ":", value)

Next, I prepared a function to remove the markup of the internal link. Internal link markup is represented by enclosing two [[~]] square brackets in a row. As a pattern, you can put a pipe inside and write the display character and article name. Therefore, specify r" (. * \ [\ [(. *?) \ | (. +) \] \]) " As the regular expression pattern, and first extract the link including the pipe. Replace the relevant internal link with an internal link containing only display characters and pass it to the next process.

Since the internal link including the pipe disappears and only the displayed part remains, specify the pattern with r" (\ [\ [(. +?) \] \]) " And search for the corresponding part. , If there is a corresponding part, the markup is removed except for the file and the value is returned.

28. Removal of MediaWiki markup

In addition to the> 27 process, remove MediaWiki markup from the template values as much as possible and format the basic country information.

28.py


import pandas as pd
import re


def basic_info_extraction(text):
    # 「25.See Template Extraction
    ...


def remove_emphasis(value):
    # 「26.See Removing Highlighted Markup
    ...


def remove_innerlink(value):
    # 「27.See Removing Internal Links
    ...


def remove_footnote(value):
    #See below
    ...


def remove_langage(value):
    #See below
    ...


def remove_temporarylink(value):
    #See below
    ....


def remove_zero(value):
    #See below
    ...


def remove_br(value):
    #See below
    ...


def remove_pipe(value):
    #See below
    ...


file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query("title == 'England'")['text'].values[0]

ans = basic_info_extraction(uk_text)
for key, value in ans.items():
    value = remove_footnote(value)
    value = remove_emphasis(value)
    value = remove_innerlink(value)
    value = remove_langage(value)       #add to
    value = remove_temporarylink(value) #add to
    value = remove_zero(value)          #add to
    value = remove_br(value)            #add to
    value = remove_pipe(value)          #add to
    print(key, ":", value)

Removal of footnote comments

remove_footnote()function


def remove_footnote(value):
    pattern = r"(.*?)(<ref.*?</ref>)(.*)"
    result = re.match(pattern, value)
    if result:
        return "".join(result.group(1, 3))
    pattern = r"<ref.*/>"
    value = re.sub(pattern, "", value)
    return value

First is the removal of footnotes. If the line obtained as an argument contains <ref ~ </ ref>, the value is returned by combining before and after <ref ~ </ ref>. There is also a footnote notation called <ref ~ /> that does not correspond to this, so if this remains, remove it and return it.

Language tag removal

remove_language()function


def remove_langage(value):
    pattern = r"{{lang\|.*?\|(.*?)[}}|)]"
    result = re.match(pattern, value)
    if result:
        return result.group(1)
    return value

Next is the language tag. The language tag is the part enclosed by {{lang ~}}. Since the display part is included after the pipe in the middle, enclose it in parentheses in the regular expression, extract it with the group () method and return it as a value.

Removal of temporary links

remove_temporarylink()function


def remove_temporarylink(value):
    pattern = r"{{Temporary link\|.*\|(.*)}}"
    result = re.match(pattern, value)
    if result:
        return result.group(1)
    return value

The removal of temporary links is almost the same as the language tag, although the pattern is slightly different. The part surrounded by {{temporary link ~}} is used as a pattern, and if there is a matching part, it is extracted by the group () method.

Removal of enclosed zeros

remove_zero()function


def remove_zero(value):
    pattern = r"\{\{0\}\}"
    value = re.sub(pattern, "", value)
    return value

There were many things called {{0}} that I didn't understand, so if there is a match, replace it with an empty string.

Removal of <br />

remove_br()function


def remove_br(value):
    pattern = r"<br />"
    value = re.sub(pattern, "", value)
    return value

Since the line feed tag <br /> was sometimes included at the end, like {{0}}, if it matches, replace it with an empty string.

Pipe removal

remove_pipe()function


def remove_pipe(value):
    pattern = r".*\|(.*)"
    result = re.match(pattern, value)
    if result:
        return result.group(1)
    return value

There is a part where the pipe remains, so if it matches, only the notation part will be returned.

If you do all this, you've removed the markup!

29. Get the URL of the national flag image

Use the contents of the template to get the URL of the national flag image. (Hint: Call imageinfo in MediaWiki API , Convert file reference to URL)

29.py


import pandas as pd
import re
import requests #add to


def basic_info_extraction(text):
    # 「25.See Template Extraction
    ...


def remove_emphasis(value):
    # 「26.See Removing Highlighted Markup
    ...


def remove_innerlink(value):
    # 「27.See Removing Internal Links
    ...


def remove_footnote(value):
    # 「28.See MediaWiki Markup Removal
    ...


def remove_langage(value):
    # 「28.See MediaWiki Markup Removal
    ...


def remove_temporarylink(value):
    # 「28.See MediaWiki Markup Removal
    ....


def remove_zero(value):
    # 「28.See MediaWiki Markup Removal
    ...


def remove_br(value):
    # 「28.See MediaWiki Markup Removal
    ...


def remove_pipe(value):
    # 「28.See MediaWiki Markup Removal
    ...


file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query("title == 'England'")['text'].values[0]

ans = basic_info_extraction(uk_text)
for key, value in ans.items():
    value = remove_footnote(value)
    value = remove_emphasis(value)
    value = remove_innerlink(value)
    value = remove_langage(value)
    value = remove_temporarylink(value)
    value = remove_zero(value)
    value = remove_br(value)
    value = remove_pipe(value)
    uk_data[key] = value

S = requests.Session()
url = "https://en.wikipedia.org/w/api.php"
params = {
    "action": "query",
    "format": "json",
    "prop": "imageinfo",
    "titles": "File:{}".format(uk_data["National flag image"])
}

R = S.get(url=url, params=params)
data = R.json()

pages = data["query"]["pages"]
for k, v in pages.items():
    print(v['imageinfo'][0]['url'])

We will solve the problem using the data formatted up to 28. Use the Requests module installed with $ pip install requests. I don't need to use it, but I use Session. Describe the URL and parameters and get the desired data. Here, the parameters for solving the problem are specified. All you have to do is send a Get request and display the returned data.

I would like to say, but the URL of the image I want to display is not included in the acquired data, and an error occurs.

The current situation is that we have not been able to investigate whether it is wrong or the data has changed ...

Summary

In this article, I tried to solve 100 language processing knocks 2020 edition Chapter 3: Regular expression problem numbers 25 to 29.

I'm really confused about how much data should be formatted ... It is difficult to confirm that it is done properly, so I personally feel that this is the most difficult part of language processing ...

Since I am self-taught, I think that there will be a big difference in how to write code from professionals. I would appreciate it if you could teach me a better way to write.

Thank you!

Until last time

-I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04] -I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09] -Language processing 100 knocks 2020 version [Chapter 2: UNIX commands 10-14] -I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15-19] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 20-24]

Recommended Posts

I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 10 to 14]
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
[Language processing 100 knocks 2020] Chapter 3: Regular expressions
I tried 100 language processing knock 2020: Chapter 3
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
I tried to get the batting results of Hachinai using image processing
100 Language Processing Knock Regular Expressions Learned in Chapter 3
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to solve the E qualification problem collection [Chapter 1, 5th question]
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to extract named entities with the natural language processing library GiNZA
I compared the speed of regular expressions in Ruby, Python, and Perl (2013 version)
100 natural language processing knocks Chapter 3 Regular expressions (first half)
I tried to summarize the basic form of GPLVM
Try to solve the problems / problems of "Matrix Programmer" (Chapter 1)
I tried to solve the soma cube with python
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
100 natural language processing knocks Chapter 3 Regular expressions (second half)
[Chapter 3] Introduction to Python with 100 knocks of language processing
I tried to visualize the spacha information of VTuber
[Chapter 2] Introduction to Python with 100 knocks of language processing
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I tried to identify the language using CNN + Melspectogram
I tried to classify the voices of voice actors
[Chapter 4] Introduction to Python with 100 knocks of language processing
I tried to summarize the string operations of Python
I tried 100 language processing knock 2020
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
I tried to solve the virtual machine placement optimization problem (simple version) with blueqat
I tried to compare the processing speed with dplyr of R and pandas of Python
The 15th offline real-time I tried to solve the problem of how to write with python
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
I tried to find the average of the sequence with TensorFlow
I tried to illustrate the time and time in C language
Try to solve the problems / problems of "Matrix Programmer" (Chapter 0 Functions)
[Python] I tried to visualize the follow relationship of Twitter
[Machine learning] I tried to summarize the theory of Adaboost
I tried to fight the Local Minimum of Goldstein-Price Function
How to write offline real time I tried to solve the problem of F02 with Python
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
100 language processing knocks 2020: Chapter 3 (regular expression)
I tried to move the ball
I tried to estimate the interval.
I tried to solve the ant book beginner's edition with python
I tried to get the index of the list using the enumerate function
I tried to automate the watering of the planter with Raspberry Pi