[PYTHON] 100 Language Processing Knock Regular Expressions Learned in Chapter 3

Introduction

I am solving 100 language processing knocks at a study session centered on members of the company, but the answer code and the solution This is a summary of the tricks that I found useful in the process. Most of the content has been investigated and verified by myself, but it also contains information shared by other study group members.

This time it's a regular expression, but there was a response from the basic content to the tough and difficult problems.

I haven't had a chance to learn regular expressions in earnest so far, so I reviewed the basic knowledge in advance, but Documents of the original Python /re.html) and WWW Creators pages are easy to read and organized, which helped me. Some tutorial-like documents were written by Ruby expert Mr. Ito, which may be helpful when you get stuck.

series

-Unix commands learned in Chapter 2 of 100 language processing knocks -Regular expressions learned in Chapter 3 of 100 language processing knocks (this article) -Morphological analysis learned in Chapter 4 of 100 language processing knocks

environment

macOS
Python 3.8.1
JupyterLab

code

Preprocessing

Put jawiki-country.json.gz in the same directory as the ipynb (or Python) file

!gzip -d jawiki-country.json.gz

Then you can unzip the zip.

20. Read JSON data

import json

def load_uk():
    with open('jawiki-country.json') as file:
        for item in file:
            d = json.loads(item)

            if d['title'] == 'England':
                return d['text']

print(load_uk())

`result`


{{redirect|UK}}
{{Basic information Country
|Abbreviated name=England
|Japanese country name=United Kingdom of Great Britain and Northern Ireland
|Official country name= {{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>
...

21. Extract rows containing category names

import re

def extract_categ_lines():
    return re.findall(r'.*Category.*', load_uk())

print(*extract_categ_lines())

`result`


[[Category:England|*]] [[Category:Commonwealth Kingdom|*]] [[Category:G8 member countries]] [[Category:European Union member states]] [[Category:Maritime nation]] [[Category:Sovereign country]] [[Category:Island country|Kureito Furiten]] [[Category:States / Regions Established in 1801]]

Since the problem statement contains the word "extract lines", I feel like using split () to separate the return value of load_uk () for each line, but such processing is essential. Is not. Since . represents "any character except line feed", you can retrieve only the line containing the character string Category by writing as above.

22. Extraction of category name

def extract_categs():
    return re.findall(r'.*Category:(.*?)\]', load_uk())
    
print(*extract_categs())

`result`


England|*Commonwealth Kingdom|*G8 member states European Union member states Maritime nations Sovereign countries Island countries|Kureito Furiten A state / region established in 1801

Enclose in () to capture the string immediately after Category:. However, I added ? After * to make it non-greedy (specify the shortest match) so that the ] after it would not be taken together.

23. Section structure

def extract_sects():
    tuples = re.findall(r'(={2,})\s*([^\s=]*).*', load_uk())

    sects = []
    for t in tuples:
        if t[0] == '==':
            sects.append([t[1], 1])
        elif t[0] == '===':
            sects.append([t[1], 2])
        elif t[0] == '====':
            sects.append([t[1], 3])

    return sects

print(*extract_sects())

`result`


['Country name', 1] ['history', 1] ['Geography', 1] ['climate', 2] ['Politics', 1] ['Diplomacy and military', 1] ['Local administrative division', 1] ['Major cities', 2] ['Science and technology', 1] ['Economy', 1] ['Mining', 2] ['Agriculture', 2] ['Trade', 2] ['currency', 2] ['Company', 2] ['traffic', 1] ['road', 2] ['Railroad', 2] ['Shipping', 2] ['Aviation', 2] ['communication', 1] ['People', 1] ['language', 2] ['religion', 2] ['marriage', 2] ['education', 2] ['culture', 1] ['食culture', 2] ['literature', 2] ['philosophy', 2] ['musics', 2] ['イギリスのポピュラーmusics', 3] ['movies', 2] ['comedy', 2] ['national flower', 2] ['world Heritage', 2] ['Holidays', 2] ['Sports', 1] ['Football', 2] ['Horse racing', 2] ['モーターSports', 2] ['footnote', 1] ['Related item', 1] ['External link', 1]

Since ^ \ s = means "any character that is neither blank nor=", [^ \ s =] + means "one or more arbitrary characters that are neither blank nor= It means "things". This is the section name, which is the first value you want to get this time. In addition to this, if you enclose = {2,} in (), you can get "two or more=in a row".

For each match, one tuple containing the above two values is returned, so I used that to create the return value with the for statement. I don't have to worry about the format of the return value because it is not specified, but I chose an array of arrays.

By the way, in the above code, we first prepare a list called sects, put elements in it, and return it at the end, but using yield makes that a little more concise. Can be written in.

def extract_sects_2():
    tuples = re.findall(r'(={2,})\s*([^\s=]+).*', load_uk())

    for t in tuples:
        if t[0] == '==':
            yield [t[1], 1]
        elif t[0] == '===':
            yield [t[1], 2]
        elif t[0] == '====':
            yield [t[1], 3]

print(*extract_sects_2())

A function using yield is called a generator and returns an iterator (generator iterator) (for details, see Python documentation. See # term-generator)), it seems that the iterator can be expanded with a * in front of it as above, or converted to a list by passing it to list ().

24. Extracting file references

def extract_media_files():
    return re.findall(r'(?:File|File):(.+?)\|', load_uk())

extract_media_files()

`result`


['Royal Coat of Arms of the United Kingdom.svg',
 'Battle of Waterloo 1815.PNG',
 'The British Empire.png',
 'Uk topo en.jpg',
 'BenNevis2005.jpg',
 ...

I have to use () to represent " File or file", but I don't want to capture this, so I wrote ?: Immediately after (? ".

By the way, the reason why print () is not used in the last line is that if you return a list etc. here, Jupyter will format it arbitrarily and then output it.

25. Template extraction

def extract_template():
    data = re.search(r'\{\{Basic information.*\n\}\}', load_uk(), re.DOTALL).group()
    tuples = re.findall(r'\n\|(.+?)\s=\s(.+?)(?:(?=\n\|)|(?=\}\}\n))', data, re.DOTALL)

    return dict(tuples)

extract_template()

`result`



{'Abbreviated name': 'England',
 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
 'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>\n*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}（[[Scottish Gaelic]]）<br/>\n*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}（[[Welsh]]）<br/>\n*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}（[[Irish]]）<br/>\n*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}（[[Cornish]]）<br/>\n*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}（[[Scots]]）<br/>\n**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}（アルスター・Scots）</ref>',
 ...
 'International call number': '44'}

Personally, it was the hardest issue in this chapter. However, I learned the process of hardship, so I will write a little more about the core of the problem, re.findall (). When thinking about what regular expression to use here, probably at first

re.findall(r'\n\|(.+)\s=\s(.+)\n\|', data)

I think a code like this comes to mind. However, if you do this, you will find that the even-numbered parts you wanted to take, such as 'Japanese country name':'Great Britain and the United Kingdom of Northern Ireland',, are not taken. The cause of this is

\n|Abbreviated name=England\n|Japanese country name

When there is a character string like, if it is the above regular expression\n|Abbreviated name=England\n|After a match, start looking for the next matchDayThis is because it becomes the character string afterdocumentIfyouusetheexpressionabove,theregularexpressionaboveisDayInfrontof\n|Doesnotworkbecauseit"consumes").

So what to do\n|There is a method called "look-ahead" to prevent the consumption of. As a code, the regular expression\n\|To(?=)Enclose it in and do the following.

re.findall(r'\n\|(.+)\s=\s(.+)(?=\n\|)', load_uk())

If you run it again with this, you should be able to get the lines such as Japanese country name properly. However, if you look closely at the character string obtained here, the part that explains the official country name is included only in the first line. This means that the . + Written in the second () matches "one or more arbitrary characters except a newline ( \ n) "as it is. In other words, it is not possible to get across lines.

So pass a module called re.DOTALL tofindall (). Then, . + Will fetch "one or more arbitrary characters", but if you leave + greedy, you will get too much. Let's add a ? At the end.

re.findall(r'\n\|(.+?)\s=\s(.+?)(?=\n\|)', load_uk(), re.DOTALL)

I think this is roughly OK, but if you look closely at the returned results, there is a problem that you get too much at the end. Therefore, look ahead to the character string after the pattern and modify it as follows so that it will match even if it is }} \ n.

re.findall(r'\n\|(.+?)\s=\s(.+?)(?:(?=\n\|)|(?=\}\}\n))', data, re.DOTALL)

With this, I finally wrote the code that satisfies the specification of the problem statement, but if I feel that I have packed too many things in one line, I can use re.complile () to separate the lines as follows. think.

pattern = re.compile(r'\n\|(.+?)\s=\s(.+?)(?:(?=\n\|)|(?=\}\}\n))', re.DOTALL)
tuples = pattern.findall(data)

If you want to insert a line break in the regular expression part, you can add + re.MULTILINE after re.DOTALL.

26. Removal of highlighted markup

def remove_emphases():
    d = extract_template()
    return {key: re.sub(r'\'{2,5}', '', val) for key, val in d.items()}

remove_emphases()

`result`


{'Abbreviated name': 'England',
 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
 'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>\n*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}（[[Scottish Gaelic]]）<br/>\n*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}（[[Welsh]]）<br/>\n*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}（[[Irish]]）<br/>\n*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}（[[Cornish]]）<br/>\n*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}（[[Scots]]）<br/>\n**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}（アルスター・Scots）</ref>',
 ...
 'Established form 4': 'Changed to the current country name "United Kingdom of Great Britain and Northern Ireland"',
 ...

Compared to 25, regular expressions are a problem that can be solved with only basic knowledge, but there is a caveat that if you put a space after 2,, it will not work in some cases.

I think the dictionary comprehension is a Python-specific notation, but as I get used to it, I can write it concisely, so I personally use it a lot.

27. Removal of internal links

def remove_links():
    d = remove_emphases()
    return {key: re.sub(r'\[\[.*?\|?(.+?)\]\]', r'\\1', val) for key, val in d.items()}

remove_links()

`result`


{'Abbreviated name': 'England',
 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
 'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}<ref>Official country name other than English:<br/>\n*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}(Scottish Gaelic)<br/>\n*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}(Welsh)<br/>\n*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}(Irish)<br/>\n*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}(Cornish)<br/>\n*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}(Scots)<br/>\n**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}(Ulster Scots)</ref>',
 'National flag image': 'Flag of the United Kingdom.svg',
 'National emblem image': 'File:Royal Coat of Arms of the United Kingdom.svg|85px|British coat of arms',
 ...

What you want to do[[]]If there is a part surrounded by, take out the contents. However, in it|If there is|I tried to solve it by defining "take out only the back of."|I don't know if it will come out or not, so behind?Is added to specify that it matches 0 or 1 repetition.

28. Removal of MediaWiki markup

def remove_markups():
    d = remove_links()
    #Remove external links
    d = {key: re.sub(r'\[http:.+?\s(.+?)\]', '\\1', val) for key, val in d.items()}
    #Remove ref (start tag and end tag together) and br
    d = {key: re.sub(r'</?(ref|br).*?>', '', val) for key, val in d.items()}
    return d

remove_markups()

`result`


{'Abbreviated name': 'England',
 'Japanese country name': 'United Kingdom of Great Britain and Northern Ireland',
 'Official country name': '{{lang|en|United Kingdom of Great Britain and Northern Ireland}}Official country name other than English:\n*{{lang|gd|An Rìoghachd Aonaichte na Breatainn Mhòr agus Eirinn mu Thuath}}(Scottish Gaelic)\n*{{lang|cy|Teyrnas Gyfunol Prydain Fawr a Gogledd Iwerddon}}(Welsh)\n*{{lang|ga|Ríocht Aontaithe na Breataine Móire agus Tuaisceart na hÉireann}}(Irish)\n*{{lang|kw|An Rywvaneth Unys a Vreten Veur hag Iwerdhon Glédh}}(Cornish)\n*{{lang|sco|Unitit Kinrick o Great Breetain an Northren Ireland}}(Scots)\n**{{lang|sco|Claught Kängrick o Docht Brätain an Norlin Airlann}}、{{lang|sco|Unitet Kängdom o Great Brittain an Norlin Airlann}}(Ulster Scots)',
 'National flag image': 'Flag of the United Kingdom.svg',
 ...
 'Population value': '63,181,775United Nations Department of Economic and Social Affairs>Population Division>Data>Population>Total Population',
 ...

{{lang|.+?|.+?}}I thought that patterns like this should be removed,~~I'm getting tired~~It was not listed in the quick reference table, so I left it as it is this time.

29. Get the URL of the national flag image

import requests
import json

def get_flag_url():
    d = remove_markup()['National flag image']
    
    url = 'https://www.mediawiki.org/w/api.php'
    params = {'action': 'query',
              'titles': f'File:{d}',
              'format': 'json',
              'prop': 'imageinfo',
              'iiprop': 'url'}
    res = requests.get(url, params)

    return res.json()['query']['pages']['-1']['imageinfo'][0]['url']

get_flag_url()

`result`


'https://upload.wikimedia.org/wikipedia/commons/a/ae/Flag_of_the_United_Kingdom.svg'

A problem that suddenly asks for knowledge of HTTP only at the end. If you don't know about the request and response, it will be a little difficult to solve. If you don't understand well and want to understand it roughly, you can use the video here, and if you want to know a little more, [MDN Document](https: //) developer.mozilla.org/en/docs/Web/HTTP/Overview) may be helpful.

In this problem, you can extract the URL of the flag image from the response returned when you send a GET request to the MediaWiki API with Python etc., but for that you need to import a special module.

It seems that you can use urllib.request if you don't want to bother installing external packages, but I decided to use requests because I had installed it in the past. This is a simpler code, and I wrote it because it is widely used, including the sample on the here page of MediaWiki. I think it's easy.

By the way, the 7 lines in the middle can be written in 2 lines as shown below, but in this case the URL will grow too horizontally, so it is better to separate the lines as described above.

url = f'https://www.mediawiki.org/w/api.php?action=query&titles=File:{d}&format=json&prop=imageinfo&iiprop=url'    
res = requests.get(url)

Summary

Regular expressions are a tool that is used not only in language processing but also in web development, but this chapter covers a wide range of topics and I felt that it was a good teaching material.

As mentioned above, I wrote it aiming at accurate and concise code, but if you have any mistakes, please comment.