Makes you think that Python regular expressions are great

When I'm knocking 100 language processes, I've learned something new about re when expressing regular expressions, so I'll summarize it here.

Difference between search and match

For example, if you want to match the pattern ape to the string grape, if you use search, the grape contains ape, so it will match, butmatch If you use, it doesn't match because ape is included but doesn't start from the beginning.

m = re.match('ape', 'grape') #Does not match
m = re.search('ape', 'grape') #Match

Take out the matched part

When you match with a regular expression, you may want to retrieve the matched part.

In the following example, you want to retrieve '1993', '7', and '2'.

s = "Born July 2, 1993"
p = "([0-9]+)Year([0-9]+)Moon([0-9]+)Day"

m = re.search(p, s)

The part of the match target string (s in the snippet mentioned above) that matches the pattern enclosed by()can be retrieved with them.group (n)method. The argument n is as follows.

In other words, in the above example, it becomes like this.

m.group(0) # -> 'July 2, 1993'
m.group(1) # -> '1993'
m.group(2) # -> '7'
m.group(3) # -> '2'

As a caveat, the return value of .group () is always a string, not a number, so if you want to treat it as a number, cast it as appropriate.

Give a name to the part to match

In the case of the previous section, it may be difficult to understand with numbers. In that case, if you write like (? P <name> regex), you can extract it with the name name where you specified which part to extract with the argument in.group (). become.

Specifically, do as follows.

s = "Born July 2, 1993"
p = "(?P<year>[0-9]+)Year(?P<month>[0-9]+)Moon(?P<day>[0-9]+)Day"

m = re.search(p, s)

m.group(0) # -> 'July 2, 1993'
m.group('year') # -> '1993'
m.group('month') # -> '7'
m.group('day') # -> '2'

Take out all the matched parts

When extracting regular expressions in a long sentence, for example, you may want to extract all the words that start with "con". In that case, use re.findall ().

s = 'It\'s convenient to conclude you are conservative.' 
p = 'con\w+'

m = re.findall(p, s)

m # -> ['convenient', 'conclude', 'conservative']

Matches newline characters

If you use . in the pattern string, you can match any character, with the exception of the newline character (\ n).

For example, in the following case, I think that 'abc = def \ nghi \ njkl' will match, but only up to 'abc = def' will match.

s = 'abc=def\nghi\njkl'
p = '^abc=.+'

m = re.search(p, s)
m.group() # -> 'abc=def'

This is because the metacharacter '.' Does not exceptionally match \ n. In such a case, set the re.DOTALL flag in the third argument ofsearch ().

s = 'abc=def\nghi\njkl'
p = '^abc=.+'

m = re.search(p, s, re.DOTALL)
m.group() # -> 'abc=def\nghi\njkl'

Right? Isn't it easy?

Matches multiple lines of text

When web scraping, there may be cases where you want to retrieve only the lines that start with a specific tag. (I've never done it before)

s = """<p>Pieter Pipar piked a peck of pickled pepers.</p>
<hr>
<p>A pek of pickled pepers Pieter Pipar piked.</p>
<p>If Pieter Pipar piked a pek of pickled pepers,<p>
<hr>
<p>How many pickled pepper did Pieter Pipar picked?</p>"""

p = "^<p>.+$"

I surrounded it with <p> and put a horizontal line with <hr>. Suppose you want to extract only the lines that start with <p> from this state.

At this time, .findall is used, but since a newline character is included, the re.MULTILINE flag can be used to perform pattern matching for each line after dividing by the newline character.

m = re.findall(p, s, re.MULTILINE)
m
# ['<p>Pieter Pipar piked a peck of pickled pepers.</p>',
#  '<p>A pek of pickled pepers Pieter Pipar piked.</p>',
#  '<p>If Pieter Pipar piked a pek of pickled pepers,<p>',
#  "<p>Where's the pek of pickled pepers that Pieter Pipar picked?</p>"]

It's a mystery that only the last is double quotes, but it's convenient.

reference

Recommended Posts

Makes you think that Python regular expressions are great
[Python] Regular Expressions Regular Expressions
Use regular expressions in Python
About Python and regular expressions
If you think that the person you put in with pip doesn't work → Maybe you are using python3?
A memo that handles double-byte double quotes in Python regular expressions
A python regular expression, str and unicode that are sober and addictive
I can't remember Python regular expressions
Handling regular expressions with PHP / Python
When using regular expressions in Python
[Python] A game that uses regular expressions when, where, who, and what
Overlapping regular expressions in Python and Java
Replace non-ASCII with regular expressions in Python
Don't use \ d in Python 3 regular expressions!
10 Python errors that are common to beginners
How to use regular expressions in Python
Python: Simplified morphological analysis with regular expressions
Python list comprehensions that are easy to forget
Python pandas: Search for DataFrame using regular expressions
[Python] Get rid of dating with regular expressions
Note that Python list comprehensions are always confusing
What are you comparing with Python is and ==?
python Creating functions that you can remember later
Why you are interested in motor control in Python
Useful Python built-in functions that you use occasionally
What are you using when testing with Python?