Python regular expression basics and tips to learn from scratch

About Python regular expressions. Until now, I googled and researched and implemented it when necessary, but I thought it was time to deepen my understanding. It's said that "googling when needed and implementing it" seems to be great, but it is a beginner level. I am writing with an awareness of ** what beginners should learn from scratch ** and ** what the occasional user will relearn **. In this article, Knock 100 Language Processing 2015 ["Chapter 3: Regular Expressions"](http://www.cl. ecei.tohoku.ac.jp/nlp100/#ch3) is organized.

Reference link

Link Remarks
Regular expression HOWTO Python Official Regular Expression How To
re ---Regular expression operation Python official re package description

Basic

Python uses the package re to implement regular expressions. Omit ʻimport re` in subsequent Python statements.

import re

Two ways to use

1. Use function directly

Use functions such as re.match and re.sub.

#The first argument is a regular expression pattern(Search term), The second argument is the search target
result = re.match('Hel', 'Hellow python')

print(result)
# <_sre.SRE_Match object; span=(0, 3), match='Hel'>

print(result.group())
# Hel

2. Compile and use

Use functions such as match and sub after compiling the regular expression pattern.

#Compile regular expression patterns in advance
regex = re.compile('Hel')

result = regex.match('Hellow python')

print(result)
# <_sre.SRE_Match object; span=(0, 3), match='Hel'>

print(result.group())
# Hel

Two types of usage

** If you want to use multiple regular expression patterns many times, use the compile method **. Official has the following description.

It is more efficient to use re.compile () to save and reuse the resulting regular expression object when you want to use that expression many times in one program. The latest patterns passed to re.compile () and module-level matching functions are cached as compiled, so programs that use a small amount of regular expressions at a time do not need to compile regular expressions.

If you use the same regex pattern many times, compiling doesn't seem to have a speed advantage. I haven't checked how much it is cached.

Definition of regular expression patterns (search terms)

Escape sequence disabled in raw string

Raw strings are not a regular expression-specific topic, but they can be used to ** disable escape sequences **.

In the former case of the following example, \ t becomes a tab and\ n becomes aline feed, but in the latter case, it is treated as a \ t, \ n character string as it is.

print('a\tb\nA\tB')
print(r'a\tb\nA\tB')

Terminal output result


a	b
A	B

a\tb\nA\tB

** I don't want to write escape sequences for backslashes in the regular expression pattern, so I use raw strings **

result = re.match(r'\d', '329')

Articles "Writing regular expressions using raw Python strings" and "Ignore (disable) escape sequences in Python" Raw string " has a detailed explanation.

Ignore line breaks, comments and whitespace with triple quotes and re.VERBOSE

You can use newlines in the regular expression pattern by enclosing them in ''' triple quotes (which can be " "" ) (no newlines are fine). You can exclude whitespace and comments from the regular expression pattern by passing re.VERBOSE. ** Ripple quotes and re.VERBOSE make it very readable ** It is easy to see if you write the following regular expression pattern.

a = re.compile(r'''\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits''', re.VERBOSE)

You can read more about triple quotes in the article "String generation in Python (quotes, str constructor)" (https://note.nkmk.me/python-str-literal-constructor/).

By the way, if you want to use multiple compile flags in the compile parameter flags, you can simply add + (addition).

a = re.compile(r'''\d''', re.VERBOSE+re.MULTILINE)

special character

letter Description Remarks Example Match Does not match
\d Numbers [0-9]Same as
\D Other than numbers [^0-9]Same as
\s Whitespace character [\t\n\r\f\v]Same as
\S Other than whitespace [^\t\n\r\f\v]Same as
\w Alphanumeric characters and underscore [a-zA-Z0-9_]Same as
\W Non-alphanumeric characters [\a-zA-Z0-9_]Same as
\A The beginning of the string ^Similar to
\Z End of string $Similar to
\b Word boundaries(space)
. Any single letter - 1.3 123, 133 1223
^ The beginning of the string - ^123 1234 0123
$ End of string - 123$ 0123 1234
* Repeat 0 or more times - 12* 1, 12, 122 11, 22
+ Repeat one or more times - 12+ 12, 122 1, 11, 22
? 0 times or 1 time - 12? 1, 12 122
{m} Repeat m times - 1{3} 111 11, 1111
{m,n} Repeat m ~ n times - 1{2, 3} 11, 111 1, 1111
[] set [^5]Then other than 5 [1-3] 1, 2, 3 4, 5
| Union(or) - 1|2 1, 2 3
() Grouping - (12)+ 12, 1212 1, 123

Match function

I often use the following functions.

function Purpose
match Of the stringAt the beginningDetermine if it matches a regular expression
search Find where the regular expression matches
findall Finds all matching substrings and returns them as a list
sub String replacement

match and search .html # re.search)

Re.match matches only at the beginning of the string, and re.search matches regardless of the position in the string. See Official "search () vs. match ()" for more information. Both return only the first pattern (do not return matches after the second).

>>> re.match("c", "abcdef")    #The beginning is"c"Does not match because it is not
>>> re.search("c", "abcdef")   #Match
<re.Match object; span=(2, 3), match='c'>

The result is in group. All the results are contained in group (0), and the grouped search results are sequentially numbered from 1.

>>> m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
>>> m.group(0)       # The entire match
'Isaac Newton'
>>> m.group(1)       # The first parenthesized subgroup.
'Isaac'
>>> m.group(2)       # The second parenthesized subgroup.
'Newton'
>>> m.group(1, 2)    # Multiple arguments give us a tuple.
('Isaac', 'Newton')

findall Findall returns all the strings that match the pattern in list format.

>>> text = "He was carefully disguised but captured quickly by police."
>>> re.findall(r"\w+ly", text)
['carefully', 'quickly']

You can specify the capture target using (), but if you specify more than one, it will be as follows. It will be returned as a tuple for each group.

>>> print(re.findall(r'''(1st)(2nd)''', '1st2nd1st2nd'))
[('1st', '2nd'), ('1st', '2nd')]

sub Replace characters. In the order of arguments: 1. regular expression pattern, 2. character string after replacement, 3. character string to be replaced.

>>> re.sub(r'Before replacement', 'After replacement', 'Before replacement 対象外 Before replacement')
'After replacement Not applicable After replacement'

Compile Flags

The following are the compile flags that are often used. Pass it to the function parameter flags.

flag meaning
DOTALL .Set to any character including line breaks
IGNORECASE Case insensitive
MULTILINE ^Or$Matches a multi-line character string with.
VERBOSE Ignore comments and whitespace in regular expressions

I will explain in a little more detail except for the VERBOSE and the confusing ones that I have already explained.

DOTALL re.DOTALL is an option to include a newline for the wildcard . (DOT).

string = r'''\
1st line
2nd line'''

print(re.findall(r'1st.*2nd', string, re.DOTALL))
# ['1st line\n line beginning 2nd']

print(re.findall(r'1st.*2nd', string))
# No Match

See the article Python: Replacing multi-line matching with regular expressions for details.

MULTILINE Use this when you want to search for multiple lines individually. In the example below, if you use re.MULTILINE, the second line ("the beginning of the line 2nd line") will also be the target. * In the case of the match function, it does not make sense to use re.MULTILINE

string = r'''\
1st line
2nd line'''

print(re.findall(r'^Beginning of line.*', string, re.MULTILINE))
# ['1st line', '2nd line']

print(re.findall(r'^Beginning of line.*', string))
# ['1st line']

See the article Python: Replacing multi-line matching with regular expressions for details.

Tips

Not subject to capture

If you add (?: ...), it will not be included in the search result string ** and will not be captured. The official Regular Expression Syntax explains:

An uncaptured version of regular parentheses. Matches a regular expression enclosed in parentheses, but the substrings that this group matches cannot be retrieved after the match is performed or referenced later in the pattern.

In the example below, the 4 part is used as a regular expression pattern, but it is not output in the result.

>>> re.findall(r'(.012)(?:4)', 'A0123 B0124 C0123')
['B012']

Greedy / non-greedy match

** You can control the length of the search result target string **. ** A greedy match is a match with the maximum length, and a non-greedy match is a match with the minimum length. The default is greedy match, and to make it a non-greedy match, attach ? to continuous special characters (*,?, +). Below are example sentences of both.

#Greedy match
>>> print(re.findall(r'.0.*2',  'A0123 B0123'))
['A0123 B012']

#Non-greedy match(*After the?)
>>> print(re.findall(r'.0.*?2', 'A0123 B0123'))
['A012', 'B012']

See the article "Greedy and non-greedy matches" for more information.

Back reference

You can use \ number to match the contents of the previous group. In Official Syntax, the following description.

Matches the contents of the group with the same number. Groups can be numbered starting with 1. For example, (. +) \ 1 matches'the the'or '55 55', but not'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0 or number is a 3-digit octal number, it is interpreted as a character with the octal value number, not as a group match. All numeric escapes between the character classes'[' and']' are treated as characters.

Specifically, like this, the \ 1 part matches the abcab with the same meaning as the part that matched in the previous(ab), but abddd does not have the 4th and 5th characters ab. Does not match.

>>> print(re.findall(r'''(ab).\1''', 'abcab abddd'))
['ab']

Look-ahead / look-behind assertions

Although it is not included in the match target, there are the following four usages for including / not including the character string in the search condition.

--Positive Lookahead Assertions --Negative Lookahead Assertions --Positive Lookbehind Assertions --Negative Lookbehind Assertions

The following shape is made into a matrix.

positive denial
Look-ahead (?=...)
...Match if the part continues next
(?!...)
...Match if the part does not follow
Look-ahead (?<=...)
...Match if the part is before the current position and there is a match
(?<!...)
...Match if the part is before the current position and there is no match

A concrete example is easier to understand than a detailed explanation.

>>> string = 'A01234 B91235 C01234'

#Positive look-ahead assertion(Positive Lookahead Assertions)
# '123'Next to'5'String followed by('(?=5)'Part is the following'.'Do not get without)
>>> print(re.findall(r'..123(?=5).', string))
['B91235']

#Negative look-ahead assertion(Negative Lookahead Assertions)
# '123'Next to'5'String that does not follow('(?!5)'Part is the following'.'Do not get without)
>>> print(re.findall(r'..123(?!5).', string))
['A01234', 'C01234']

#Affirmative look-behind assertion(Positive Lookbehind Assertions)
# '0'But'123'Matching string before('(?<=0)'The part of is the beginning'.'Butなければ取得しない)
>>> print(re.findall(r'..(?<=0)123', string))
['A0123', 'C0123']

#Negative look-ahead assertion(Negative Lookbehind Assertions)
# '0'But'123'String that does not match before('(?<!0)'The part of is the beginning'.'Butなければ取得しない)
>>> print(re.findall(r'..(?<!0)123', string))
['B9123']

Recommended Posts

Python regular expression basics and tips to learn from scratch
Regular expressions that are easy and solid to learn in Python
Learn Bayesian statistics from the basics to learn the M-H and HMC methods
(Python) HTML reading and regular expression notes
Porting and modifying doublet-solver from python2 to python3.
[Python] How to read data from CIFAR-10 and CIFAR-100
PHP and Python integration from scratch on Laravel
Python and numpy tips
Changes from Python 3.0 to Python 3.5
Changes from Python 2 to Python 3.0
python regular expression memo
Regular expression in Python
Regular expression in Python
Tips and precautions when porting MATLAB programs to Python
Post from Python to Slack
A learning roadmap that allows you to develop and publish services from scratch with Python
Python 處 處 regular expression Notes
How to connect to various DBs from Python (PEP 249) and SQLAlchemy
Anaconda updated from 4.2.0 to 4.3.0 (python3.5 updated to python3.6)
How to learn TensorFlow for liberal arts and Python beginners
Tips for coding short and easy to read in Python
Convert Scratch project to Python
Switch from python2.7 to python3.6 (centos7)
Python basics: conditions and iterations
[Introduction to Data Scientists] Basics of Python ♬ Functions and classes
Connect to sqlite from python
Regular expression manipulation with Python
About Python and regular expressions
~ Tips for beginners to Python ③ ~
Go language to see and remember Part 8 Call GO language from Python
[Python] Hit Keras from TensorFlow and TensorFlow from c ++ to speed up execution
I tried to make a regular expression of "amount" using Python
How to get followers and followers from python using the Mastodon API
A python regular expression, str and unicode that are sober and addictive
I tried to make a regular expression of "time" using Python
I tried to make a regular expression of "date" using Python
[Introduction to pytorch-lightning] Autoencoder of MNIST and Cifar10 made from scratch ♬
[Introduction to Data Scientists] Basics of Python ♬ Conditional branching and loops
[It's not too late to learn Python from 2020] Part 3 Python Language Basic (1)
[Python] Try to recognize characters from images with OpenCV and pyocr
[Introduction to Data Scientists] Basics of Python ♬ Functions and anonymous functions, etc.
Call Matlab from Python to optimize
Python 3.6 on Windows ... and to Xamarin.
[Introduction to Python3 Day 1] Programming and Python
Python, yield, return, and sometimes yield from
Create folders from '01' to '12' with python
Programming to learn from books May 10
Post from python to facebook timeline
[Lambda] [Python] Post to Twitter from Lambda!
Read and use Python files from Python
About Python, from and import, as
String replacement with Python regular expression
Connect to utf8mb4 database from python
Python (from first time to execution)
Post images from Python to Tumblr
Tips for calling Python from C
Programming to learn from books May 7
Introduction to regular expression processing system
Python logging and dump to json
How to access wikipedia from python
Python to switch from another language