[PYTHON] I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]

The teaching material of the programming basic study session, which is one of the training for newcomers, created by Tohoku University Inui / Okazaki Lab (currently Inui / Suzuki Lab) This is the 5th article of io / ja /) ”solved in Python (3.7).

Having studied Python on my own, there may be mistakes and more efficient ways to do it. We would appreciate it if you could point out any improvements you may find.

From Chapter 3, there are many parts that are correct or not, so please point out not only the points for improvement but also whether or not they are correct.

The source code is also available on GitHub.

Chapter 3: Regular Expressions

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.

--One article information per line is stored in JSON format

--In each line, the article name is stored in the "title" key and the article body is stored in the dictionary object with the "text" key, and that object is written out in JSON format.

--The entire file is gzipped

Create a program that performs the following processing.

20. Read JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

20.py


import pandas as pd

file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query('title == "England"')['text'].values[0]
print(uk_text)

In my experience, I found pandas to be very useful, so I use pandas. If you pass a JSON format string or file path as the first argument of the pandas.read_json () function, it will be read as pandas.DataFrame.

Since the file being read this time is a compressed file, ʻinfer is specified in the argument compressionas an option. compression ='infer' supports .gz, .bz2, .zip, and .xz`.

In the argument of lines, each line is read as a JSON object.

I used the DataFrame's query () method to find the UK article. The query () method is convenient because it makes it easy to extract data for specified conditions. Here, the column name title is UK, and the value of the text column is output.

21. Extract rows containing category names

Extract the line that declares the category name in the article.

21.py


import pandas as pd
import re

file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query('title == "England"')['text'].values[0]

pattern = r'\[\[Category:(.*)\]\]'
for line in uk_text.split("\n"):
    result = re.match(pattern, line)
    if result is not None:
        print(line)

From here we will use regular expressions.

In pattern, it is specified to extract the line corresponding to[[Category: ~]]. Matching with Python regular expressions imports the re module. Match the pattern you want to match with the first argument with the re.match () function by specifying the text you want to check with the second argument.

If the matching result is not None, the condition is met, so that line is output.

22. Extraction of category name

Extract the article category names (by name, not line by line).

22.py


import pandas as pd
import re

file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query('title == "England"')['text'].values[0]

pattern = r'\[\[Category:(.*)\]\]'
remove_pattern = r'\|.*'
for line in uk_text.split("\n"):
    result = re.match(pattern, line)
    if result is None:
        continue
    ans = re.sub(remove_pattern, '', result.group(1))
    print(ans)

Earlier, the category name was output line by line, but this time the task is to extract only the name. I'm doing two things for each row.

First of all, is there a part that matches [[Category: ~]] as in "21."? And if the match is matched, get the one with the | ~ part removed.

With result.group (1), you can get the part of(. *)Of the matched parts of pattern = r'\ [\ [Category: (. *) \] \]'. It might be okay to answer this, but it included some extras. Sometimes there are pipes like [[Category: European Union member states | former]]. In such a case, only result.group (1) will get the European Union member state | yuan, so I would like to remove the pipes that are unnecessary as the category name.

So, in the re.sub () method, specify the pattern you want to replace with the first argument, the character string to be replaced with the second argument, and the target character string with the third argument, and then replace and output.

23. Section structure

Display the section name and its level contained in the article (for example, 1 if "== section name ==").

23.py


import pandas as pd
import re

file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query('title == "England"')['text'].values[0]

pattern = r'^(==*)(.+)=+$'
for line in uk_text.split("\n"):
    result = re.match(pattern, line)
    if result is None:
        continue
    print(result.group(2).strip(' ='), len(result.group(1)))

Headings are represented by adding two or more equals to the beginning of the line, such as == heading ==. (Reference: Markup Quick Reference) Since the number of equals represents the heading level, match the heading and extract the number of equals at the beginning of the line with result.group (1) and the heading name with result.group (2). I am.

24. Extracting file references

Extract all the media files referenced from the article.

24.py


import pandas as pd
import re

file_name = "jawiki-country.json.gz"
data_frame = pd.read_json(file_name, compression='infer', lines=True)
uk_text = data_frame.query('title == "England"')['text'].values[0]

pattern = r'File:(.+?)\|(thumb\|.*)+?'
for line in uk_text.split("\n"):
    result = re.finditer(pattern, line)
    if result is None:
        continue
    for match in result:
        print(match.group(1))

The file is[[File:Wikipedia-logo-v2-ja.png|thumb|Explanatory text]]It is expressed in the form of. Since it is not always at the beginning of the line, the matching is performed using the character string "File:" as a clue.

This time, it is possible that one line contains multiple files, so the re.finditer () function is used. The finditer () function can return match objects continuously. The media file name is extracted for the matches in the line.

Summary

In this article, I tried to solve 100 language processing knocks 2020 edition Chapter 3: Regular expression problem numbers 20 to 24.

When I was a student, I used to handle regular expressions in Ruby, but I got the impression that Python is more quirky. I struggled with the shortest match every time, but as I was doing it, I felt that I was gradually getting used to it, so I am grateful for these polite problems.

I'm still immature, so if you have a better answer, please let me know! !! Thank you.

Continued

-Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 25-29]

Until last time

-I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04] -I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09] -Language processing 100 knocks 2020 version [Chapter 2: UNIX commands 10-14] -I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15-19]

Recommended Posts

I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 10 to 14]
I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15 to 19]
[Language processing 100 knocks 2020] Chapter 3: Regular expressions
100 natural language processing knocks Chapter 3 Regular expressions (first half)
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
100 natural language processing knocks Chapter 3 Regular expressions (second half)
[Chapter 3] Introduction to Python with 100 knocks of language processing
[Chapter 2] Introduction to Python with 100 knocks of language processing
[Chapter 4] Introduction to Python with 100 knocks of language processing
I tried 100 language processing knock 2020: Chapter 3
100 language processing knocks 2020: Chapter 3 (regular expression)
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 3: Regular Expressions
I tried 100 language processing knock 2020: Chapter 2
I tried 100 language processing knock 2020: Chapter 4
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
[Language processing 100 knocks 2020] Chapter 1: Preparatory movement
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 Knocking Natural Language Processing Chapter 1 (Preparatory Movement)
100 natural language processing knocks Chapter 1 Preparatory movement (second half)
100 natural language processing knocks Chapter 1 Preparatory movement (first half)
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
I tried to get the batting results of Hachinai using image processing
I tried to solve the E qualification problem collection [Chapter 1, 5th question]
100 Language Processing Knock Regular Expressions Learned in Chapter 3
I tried to touch the API of ebay
I tried to correct the keystone of the image
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
100 language processing knocks ~ Chapter 1
100 language processing knocks Chapter 2 (10 ~ 19)
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
I tried to extract named entities with the natural language processing library GiNZA
I compared the speed of regular expressions in Ruby, Python, and Perl (2013 version)
I tried to summarize the basic form of GPLVM
Try to solve the problems / problems of "Matrix Programmer" (Chapter 1)
I tried to solve the soma cube with python
Solve 100 language processing knocks 2020 (00. Reverse order of character strings)
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to solve the problem with Python Vol.1
I tried to identify the language using CNN + Melspectogram
I tried to classify the voices of voice actors
I tried to summarize the string operations of Python
I tried to solve the virtual machine placement optimization problem (simple version) with blueqat
I tried 100 language processing knock 2020
I tried to compare the processing speed with dplyr of R and pandas of Python
The 15th offline real-time I tried to solve the problem of how to write with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried to get the location information of Odakyu Bus
Language processing 100 knocks-48: Extraction of paths from nouns to roots
I tried to find the average of the sequence with TensorFlow
I tried to illustrate the time and time in C language
Try to solve the problems / problems of "Matrix Programmer" (Chapter 0 Functions)
[Python] I tried to visualize the follow relationship of Twitter
[Machine learning] I tried to summarize the theory of Adaboost
I tried to fight the Local Minimum of Goldstein-Price Function
How to write offline real time I tried to solve the problem of F02 with Python
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text