SoFunction
Updated on 2024-11-12

python re Regular Expression module (Regular Expression)

The role of the module is mainly used for string and text processing, find, search, replace, etc.

Let's review the basic regular expressions

. : matches any single character except a newline character

*: matches any character, one, zero, more than one can be matched Commonly known as greedy mode

+: matches one or more characters before the +.

|: matches the character before or after |.

^: match at beginning of line

$: match at end of line

? : matches zero or one character before ? : matches zero or one of the characters before ?

\: Indicates that the characters after \ are escaped.

[]: match any single character in [], [0-9] means match any number from 0 to 9.

(): treats the contents within () as a whole.

{}: match by the number of times in {}, 100[0-9]{3} means match any 3-digit number (100-999) after 100

Metacharacters in python that begin with \:

Special Sequence Symbols
significance
\A
Match only at the beginning of the string
\Z
Match only at the end of the string
\b
Matches the empty string at the beginning or end
\B
Matches empty strings that do not begin or end with
\d
Equivalent to [0-9]
\D
Equivalent to [^0-9]
\s
Matches any blank character: [\t\n\r\r\v]
\S
Matches any non-white space character.[^\t\n\r\r\v]
\w
Match any numbers and letters:[a-zA-Z0-9]
\W
Match any non-number and letter:[^a-zA-Z0-9]

Regular expression syntax table

grammatical significance clarification
"." arbitrary character
"^" string inception '^hello' matches 'helloworld'.' without matching 'aaaahellobbbb''
"$" string ending ditto
"*" 
0 or more characters(greedy matching)
<*> match<title> chinaunix</title>
"+"
1 or more characters(Greedy Matching)
ditto
"?"
0 or more characters(Greedy Matching)
ditto
*?,+?,??
The above three take the first matching result (non-greedy matching) <*> match <title>
{m,n}
Repeat m to n times for the previous character, {m} is also available
a{6} matches 6 a, a{2,4}Match 2 to 4 a
{m,n}?
Repeat m to n times for the previous character and take as few as possible
‘aaaaaa' in a{2,4} will only match 2
"\\"
Special character escapes or special sequences
[]
Indicates a character set [0-9]、[a-z]、[A-Z]、[^0]
"|"
maybe A|B, or operation
(...)
Match any expression in parentheses
(?#...)
Note, can be ignored
(?=...)
Matches if ... matches next, but doesn't consume the string.
'(?=test)'  Match hello in hellotest
(?!...)
Matches if ... doesn't match next.
'(?!=test)'  If hello is not followed by test, match hello.
(?<=...) 
Matches if preceded by ... (must be fixed length).
'(?<=hello)test'  Match test in hellotest
(?<!...)
Matches if not preceded by ... (must be fixed length).
'(?<!hello)test' does not match test in hellotest

Signs and meanings of matches

symbolize hidden meaning
ignore capitals
Change the matching contents of \w,\w,\b,\b,\b,\s,\s according to the local settings.
Multi-line Matching Mode
Make "." metacharacter matches a newline character
Matches Unicode characters
Ignore spaces in patterns that need to be matched, and can be commented out with a "#" sign

Text content (extracts password file under Linux)

man:x:6:12:man:/var/cache/man:/bin/nologin

re module has three search functions, each function accepts three parameters (match pattern, to match the string, to match the flag), if the match is returned to an object instance, no will return None.

findall(): Finds strings in a string that match a regular expression and returns a list of those strings.

search(): search the whole string, return object instance

match(): match only from the first character, no more matches after, return object instance

lovelinux@LoveLinux:~/py/boke$ cat text 
man:x:6:12:man:/var/cache/man:/bin/sh
lovelinux@LoveLinux:~/py/boke$ cat 
#/usr/bin/env python
#coding:utf-8
import re
with open('text','r') as txt:
 f = ()
 print ('bin',f)
 print ('bin',f).end() 
lovelinux@LoveLinux:~/py/boke$ python  
None
34
lovelinux@LoveLinux:~/py/boke$ vim 
lovelinux@LoveLinux:~/py/boke$ python  
None
<_sre.SRE_Match object at 0x7f12fc9f9ed0>

Return is an object instance has 2 methods.

start(): return to record matching to the beginning of the character index

end(): return record matching to the end of the character index

lovelinux@LoveLinux:~/py/boke$ python  
None
31
34
lovelinux@LoveLinux:~/py/boke$ cat  
#/usr/bin/env python
#coding:utf-8
import re
with open('text','r') as txt:
 f = ()
 print ('bin',f)
 print ('bin',f).start()
 print ('bin',f).end()