Extract arbitrary strings using Python regular expressions / Use named groups

This article is the 18th day article of Takumi Akashiro Alone Advent Calendar 2020.

At the beginning

Do you use regular expressions? !! !! !!

The last time I used it was about a month ago. I will use it when I need it.

Although it is such a regular expression, if you use it properly, you can extract any character string group with a good feeling.

TLDL

Use named groups to retrieve strings!

>> text = "environ/house-food/apple-pie02.fbx"
>> import re
>> reg_text = r'(?P<main>(chara|environ))/(?P<sub>[^-/]*)-?(?P<sub_sub>[^/]*)/(?P<filling>[^-]*)-pie'
>> match = re.search(reg_text, text)
>> print(match.groupdict())
{'main': 'environ', 'sub': 'house', 'sub_sub': 'food', 'filling': 'apple'}

Regular expression basics

First of all, the basics of regular expressions, we will match appropriately.

#! python3
import re

def main():
    # NOTE:It doesn't matter, but you often see apple pies in regular expression samples.
    text = "environ/house-food/apple-pie02.fbx"

    match_obj= re.search(r'pie', text)
    if match_obj:
        print("It was a hit!")
    else:
        print("I won't hit!")

if __name__ == '__main__':
    main()

Well, it's like this. It's super easy.

If speed is required and head matching is possible, use re.match, and if you want to replace, use re.sub. It's not a problem to use regular expressions for the time being. [^ 1]

[^ 1]: Addendum: If you are concerned about speed, if you use the same regular expression inside the for loop, it will be a little faster to reuse it by re.compile outside the for. ..


Now, how do you extract the string apple before pie? I think that I will remove various character strings by scraping them.

But what if there are multiple acquisition targets? For example, what if you want to take environ, house, food and apple all at once?

Groups can be used in such cases. Let's read the official documentation for a moment.

Regular expression syntax

(Omitted) (...) Matches the regular expression enclosed in parentheses and represents the start and end of the group. The contents of the group can be retrieved after the match has been performed, or can be subsequently matched in the string with the \ number special sequence, as described below. To match a literal'(' or')', use \ (or ) or enclose it in a character class: [(], [)].

re --- Regular Expression Manipulation — From Python 3.9.1 Documentation

...... I don't know how to use it ... So I will give you a sample.

Try using a group

#! python3
import re

def main():
    text = "environ/house-food/apple-pie02.fbx"

    match_obj= re.search(r'([^/-]*)-?pie)', text)
    print(match_obj.groups())

if __name__ == '__main__':
    main()

image.png

By match_obj.groups () on the match object, You can get a list of strings stuck in a grouped regular expression.

So I want to extract environ, house, food and apple from the above text,

#! python3
import re

def main():
    text = "environ/house-food/apple-pie02.fbx"

    match = re.search(r'([^/-]*)/([^/-]*)-?([^/-]*)/([^/-]*)-?pie', text)
    if match:
        print(match.groups())

if __name__ == '__main__':
    main()

given that……

image.png You've done it safely!

It's so convenient!


I want you to use dict instead of list. I often use () in regular expressions, so I think I want only the necessary parts.

In such a case, this is a "named group"!

As usual, read the official docs.

Regular expression syntax

(Omitted) (?P<name>...) Similar to regular parentheses, but the substrings matched by this group can be accessed by the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a group that is numbered as if it had not been named.

Try using named groups

Well, I will write it in the spirit that you will understand if you use it for the time being.

#! python3
import re

def main():
    text = "environ/house-food/apple-pie02.fbx"

    match = re.search(r'(?P<main>[^/-]*)/(?P<sub>[^/-]*)-?(?P<sub_sub>[^/-]*)/(?P<filling>[^/-]*)-?pie', text)
    if match:
        print(match.groupdict())

if __name__ == '__main__':
    main()

By match_obj.groupdict () on the match object, You can get a dictionary of named groups.

image.png

You can take it out nicely!

Tightening

~~ I can't think of anything …… ~~

It's convenient!

Recommended Posts

Extract arbitrary strings using Python regular expressions / Use named groups
Use regular expressions in Python
When using regular expressions in Python
Don't use \ d in Python 3 regular expressions!
How to use regular expressions in Python
[Python] Regular Expressions Regular Expressions
Python pandas: Search for DataFrame using regular expressions
Use regular expressions in C
Extract numbers with regular expressions
About Python and regular expressions
Ansible Jinja2 filters Replace and extract variable strings with regular expressions
slackbot memorandum ~ Request using regular expressions ~
I can't remember Python regular expressions
[Beginner] Extract character strings with Python
Extract the targz file using python
Handling regular expressions with PHP / Python
Extract strings from files in Python
[Python] I made a function that can also use regular expressions that replace character strings all at once.
Overlapping regular expressions in Python and Java
[Python] Use pandas to extract △△ that maximizes ○○
Replace non-ASCII with regular expressions in Python
[Road to intermediate Python] Use lambda expressions
Python: Simplified morphological analysis with regular expressions