[PYTHON] I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]

The teaching material of the programming basic study session, which is one of the training for newcomers, created by Tohoku University Inui / Okazaki Lab (currently Inui / Suzuki Lab) This is the 5th article of io / ja /) ”solved in Python (3.7).

Having studied Python on my own, there may be mistakes and more efficient ways to do it. We would appreciate it if you could point out any improvements you may find.

From Chapter 3, there are many parts that are correct or not, so please point out not only the points for improvement but also whether or not they are correct.

The source code is also available on GitHub.

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.

--One article information per line is stored in JSON format

--In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format.

--The entire file is gzipped

Create a program that performs the following processing.

20. Read JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

`20.py`


import pandas as pd

file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query('title == "England"')['text'].values[0]
print(uk_text)

In my experience, I found pandas to be very useful, so I use pandas. If you pass a JSON format string or file path as the first argument of the pandas.read_json () function, it will be read as pandas.DataFrame.

Since the file being read this time is a compressed file, ʻinfer is specified in the argument compressionas an option. compression ='infer' supports .gz, .bz2, .zip, and .xz`.

In the argument of lines, each line is read as a JSON object.

I used the DataFrame's query () method to find the UK article. The query () method is convenient because it makes it easy to extract data for specified conditions. Here, the column name title is UK, and the value of the text column is output.

21. Extract rows containing category names

Extract the line that declares the category name in the article.

`21.py`


import pandas as pd
import re

file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query('title == "England"')['text'].values[0]

pattern = r'\[\[Category:(.*)\]\]'
for line in uk_text.split("\n"):
    result = re.match(pattern, line)
    if result is not None:
        print(line)

From here we will use regular expressions.

In pattern, it is specified to extract the line corresponding to[[Category: ~]]. Matching with Python regular expressions imports the re module. Match the pattern you want to match with the first argument with the re.match () function by specifying the text you want to check with the second argument.

If the matching result is not None, the condition is met, so that line is output.

22. Extraction of category name

Extract the article category names (by name, not line by line).

`22.py`


import pandas as pd
import re

file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query('title == "England"')['text'].values[0]

pattern = r'\[\[Category:(.*)\]\]'
remove_pattern = r'\|.*'
for line in uk_text.split("\n"):
    result = re.match(pattern, line)
    if result is None:
        continue
    ans = re.sub(remove_pattern, '', result.group(1))
    print(ans)

Earlier, the category name was output line by line, but this time the task is to extract only the name. I'm doing two things for each row.

First of all, is there a part that matches [[Category: ~]] as in "21."? And if the match is matched, get the one with the | ~ part removed.

With result.group (1), you can get the part of(. *)Of the matched parts of pattern = r'\ [\ [Category: (. *) \] \]'. It might be okay to answer this, but it included some extras. Sometimes there are pipes like [[Category: European Union member states | former]]. In such a case, only result.group (1) will get the European Union member state | yuan, so I would like to remove the pipes that are unnecessary as the category name.

So, in the re.sub () method, specify the pattern you want to replace with the first argument, the character string to be replaced with the second argument, and the target character string with the third argument, and then replace and output.

23. Section structure

Display the section name and its level contained in the article (for example, 1 if "== section name ==").

`23.py`


import pandas as pd
import re

file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query('title == "England"')['text'].values[0]

pattern = r'^(==*)(.+)=+$'
for line in uk_text.split("\n"):
    result = re.match(pattern, line)
    if result is None:
        continue
    print(result.group(2).strip(' ='), len(result.group(1)))

Headings are represented by adding two or more equals to the beginning of the line, such as == heading ==. (Reference: Markup Quick Reference) Since the number of equals represents the heading level, match the heading and extract the number of equals at the beginning of the line with result.group (1) and the heading name with result.group (2). I am.

24. Extracting file references

Extract all the media files referenced from the article.

`24.py`


import pandas as pd
import re

file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query('title == "England"')['text'].values[0]

pattern = r'File:(.+?)\|(thumb\|.*)+?'
for line in uk_text.split("\n"):
    result = re.finditer(pattern, line)
    if result is None:
        continue
    for match in result:
        print(match.group(1))

The file is[[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]It is expressed in the form of. Since it is not always at the beginning of the line, the matching is performed using the character string "File:" as a clue.

This time, it is possible that one line contains multiple files, so the re.finditer () function is used. The finditer () function can return match objects continuously. The media file name is extracted for the matches in the line.

Summary

In this article, I tried to solve 100 language processing knocks 2020 edition Chapter 3: Regular expression problem numbers 20 to 24.

When I was a student, I used to handle regular expressions in Ruby, but I got the impression that Python is more quirky. I struggled with the shortest match every time, but as I was doing it, I felt that I was gradually getting used to it, so I am grateful for these polite problems.

I'm still immature, so if you have a better answer, please let me know! !! Thank you.

Continued

-Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 25-29]

Until last time

-I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04] -I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09] -Language processing 100 knocks 2020 version [Chapter 2: UNIX commands 10-14] -I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15-19]