It seems that the regular expression search operators, patterns, and rules of the search itself are almost the same as Perl and PHP.
The usage of regular expression functions is completely different, so I will write it for my own study and organization.
You can use regular expressions by loading the following library.
Read
import re
There are two ways to use regular expressions. One is to compile the pattern to be searched in advance. By using this method, when you search for the same pattern many times, you can search at high speed without having to specify the pattern each time. http://docs.python.jp/3/howto/regex.html#compiling-regular-expressions
Then, it is recommended to add r at the beginning of the pattern, it is basically okay without it, but by adding it, the backslash character in the string can be treated as a backslash as it is, so how to write the pattern It will be easier to understand.
http://docs.python.jp/3/howto/regex.html#the-backslash-plague
compile
pattern = r"ca"
text = "caabsacasca"
repatter = re.compile(pattern)
matchOB = repatter.match(text)
The other is to set a pattern when searching without compiling. In this case, if you don't want to reuse the search pattern, you should use this.
NoCompile
pattern = r"ca"
text = "caabsacasca"
matchOB = re.match(pattern , text)
There are four main search methods. http://docs.python.jp/3/howto/regex.html#performing-matches
Method/attribute | Purpose |
---|---|
match(pattern, string) | Determines if it matches the regular expression at the beginning of the string. |
search(pattern, string) | Manipulate the string to find out where the regular expression matches. |
findall(pattern, string) | Finds all the substrings that match the regular expression and returns it as a list. |
finditer(pattern, string) | Finds all the substrings that match the regular expression and returns it as an iterator. |
http://docs.python.jp/2.7/library/re.html#re.split
Method/attribute | Purpose |
---|---|
split(pattern, string) | Split each time there is a part that matches the regular expression. |
sub(pattern, repl, string) | Replace the part that matches the regular expression with the character in the repl |
Let's look at the search methods one by one.
This is a function that determines if the pattern matches at the beginning of the string. The matchObject object goes into matchOB. Use the .group () function to extract the matched part from this object (str) (because there is a function to extract information from the object other than the group () function, it will be described later).
match
pattern = r"ca"
text = "caabsacasca"
matchOB = re.match(pattern , text)
if matchOB:
print matchOB.group() # 'ca'
Method/attribute | Purpose |
---|---|
group() | Returns a string that matches the regular expression. |
start() | Returns the start position of the match. |
end() | Returns the end position of the match. |
span() | Match position(start, end)Returns a tuple containing. |
A function that determines if there is a part of the string that matches the pattern. Unlike the match () function, it matches even if the pattern is not at the beginning of the string. However, even if there are multiple matches, only the first one is returned.
search
pattern = r"ca"
text = "caabsacasca"
matchOB = re.search(pattern , text)
if matchOB:
print matchOB
print matchOB.group() #Returns the matched string# ca
print matchOB.start() #Returns the start position of the match# 0
print matchOB.end() #Returns the end position of the match# 2
print matchOB.span() #Match position(start, end)Returns a tuple containing# (0, 2)
A function that returns as a list all the parts of a string that match the pattern. Unlike search (), you can get all the matching parts. However, the return value is not a matchObject, but just a list of strings, so group () etc. cannot be used.
findall
pattern = r"ca"
text = "caabsacasca"
#Returns as a list everything that matches the pattern
matchedList = re.findall(pattern,text)
if matchedList:
print matchedList # ['34567', '34567']
A function that returns the part of a string that matches a pattern with an iterator. By turning the return value for loop etc., it is the same as the findall () function, you can get all the matching parts, because the findall () function returns a list, but the finditer () function returns an object in the loop. , End (), start (), etc. are available.
finditer
pattern = r"ca"
text = "caabsacasca"
#Returns everything that matches the pattern as an iterator
iterator = re.finditer(pattern ,text)
for match in iterator:
print match.group() #First time:ca 2nd time: ca
print match.start() #First time:0 2nd time: 6
print match.end() #First time:2 2nd time: 8
print match.span() #First time: (0, 2)Second time: (6, 8)
Regular expressions such as perl set properties such as `/ pattern / s``` (. Matches newlines) and
`/ pattern / i``` (case sensitive) at the end of the pattern You can, but in Python you do the following:
match
pattern = r"avSCSA"
text = "AVscsa"
------------------------
#Pattern to compile
repatter = re.compile(pattern, re.IGNORECASE)#Insensitive to case
matchOB = repatter.match(text)
------------------------
#Patterns that do not compile
matchOB = re.match(pattern , text, re.IGNORECASE)#Insensitive to case
--------------------------
if matchOB:
print match.group() # ''
Be sure to prefix the following properties with `re.```. Like
re. DOTALL
`re.L```.
Property | meaning |
---|---|
ASCII, A | \w, \b, \s,And\Matches d etc. only to ASCII characters with their respective properties. |
DOTALL, S | .To match any character, including newlines |
IGNORECASE, I | Performs a case-insensitive match |
LOCALE, L | Matches according to the locale |
MULTILINE, M | ^Or$Acts on and matches multiple lines |
VERBOSE, X (for ‘extended’) | You can make redundant regular expressions available to make them cleaner and easier to understand. |
Finally, I will explain how to handle Japanese, which always misleads Japanese pythonista.
It seems that the normal character string (str) type may be okay,
Ah
matchOB = re.match("Ah","Ah")
print matchOB.group()
#Ah
What if you do the following? Seems to be
[Ah-ゞ]
matchOB = re.match("[Ah-ゞ]","If")
#?
It's okay if you use unicode
u[Ah-ゞ]
matchOB = re.match(u"[Ah-ゞ]",u"If")
#If
So, when dealing with Japanese, let's make it Unicode type once
str→unicode
u = "Japanese".decode("utf-8")
print type(u)
#unicode
print u
#Japanese
unicode→str
u = u"Japanese".encode("utf-8")
print type(u)
#unicode
print u
#Japanese
Also, when dealing with unicode, it says to add re.U as an option, but I feel that it will be the same answer with or without it, but can anyone please tell me ~ http://pepper.is.sci.toho-u.ac.jp/index.php?%A5%CE%A1%BC%A5%C8%2FPython%2F%B4%C1%BB%FA%A4%CE%C0%B5%B5%AC%C9%BD%B8%BD
re.U
s = u'It's nice weather today'
r = re.compile(u'《[^》]*》')
news = r.sub('*', s, re.U)
print s, '>', news
#It's nice weather today>today's weather is good*Ne
not(re.U)
s = u'It's nice weather today'
r = re.compile(u'《[^》]*》')
news = r.sub('*', s)
print s, '>', news
#It's nice weather today>today's weather is good*Ne
This page was used as a reference
Regular expression HOWTO — Python 3.4.1 documentation --http://docs.python.jp/3/howto/regex.html A brief summary of regular expressions in Python --minus9d's diary --http://minus9d.hatenablog.com/entry/20120713/1342188160 7.2. re — Regular expression operation — Python 2.7ja1 documentation --http://docs.python.jp/2.7/library/re.html Handling of Japanese in Python regular expression module --Notes on Linux during trial operation-http://d.hatena.ne.jp/kakurasan/20090424/p1 Handle Japanese with python regular expressions | taichino.com --http://taichino.com/programming/1272
Recommended Posts