Regular expressions in Python explained in detail

I. Introduction to Regular Expressions

A regular expression is a special sequence of characters that helps you conveniently check whether a string matches a certain pattern. Python has added the re module since version 1.5, which provides Perl-style regular expression patterns.

By its very nature, regular expressions (or RE) is a small, highly specialized programming language that (in Python) is embedded in Python and implemented through the re module. Regular expression patterns are compiled into a series of bytecodes, which are then executed by a matching engine written in C. The regular expressions are then used in a variety of ways.

The re module gives the Python language full regular expression capabilities. The compile function generates a regular expression object based on a pattern string and optional flags. This object has a set of methods for regular expression matching and replacement. The re module also provides functions that are identical to these methods, using a pattern string as their first argument.

II. Character Matching

1. Ordinary characters: most characters and letters will match themselves

>>> ("alexsel","gtuanalesxalexselericapp")
['alexsel']
>>> ("alexsel","gtuanalesxalexswxericapp")
[]
>>> ("alexsel","gtuanalesxalexselwupeiqialexsel")
['alexsel', 'alexsel']

2. Metacharacters: . ^ $ * + ? { } [ ] | ( ) \

-. : Matches a character other than a newline character.

>>> ("","aaaalexselaw")
['alexselaw']
#A dot can only match one character

-^ :will only match if the following string is at the beginning of the string

>>> ("^alexsel","gtuanalesxalexselgeappalexsel")
[]
>>> ("^alexsel","alexselgtuanalesxalexselwgtappqialexsel")
['alexsel']
#"^"This symbol controls the beginning，That's why it's at the beginning.

-$ :matches only if the string before it is at the end of the string being detected

>>> ("alexsel$","alexselseguanalesxalexselganapp")
[]
>>> ("alexsel$","alexselgtaanalesxalexsssiqialexsel")
['alexsel']

-* : It controls the character in front of it, the character in front of it can match from 0 to more than one.

>>> ("alexsel*","aaaalexse")
['alexse']
>>> ("alexsel*","aaaalexsel")
['alexsel']
>>> ("alex*","aaaalexsellllll")
['alexsellllll']

-+ : matches the previous character 1 to many times

>>> ("alexsel+","aaaalexselll")
['aleselll']
>>> ("alexsel+","aaaalexsel")
['alexsel']
>>> ("alexsel+","aaaalexse")
[]

-? : matches 0 to 1 of the previous character, and only matches one of the redundant ones

>>> ("alexsel?","aaaalexse")
['ale']
>>> ("alexsel?","aaaalexsel")
['alexsel']
>>> ("alexsel?","aaaalexsellll")
['alexsel']

-{} :Controls the number of matches for the character before it, can have intervals (closed intervals), with intervals according to the case of more matches

>>> ("alexsel{3}","aaaalexselllll")
['alexselll']
>>> ("alexsel{3}","aaaalexsell")
[]
>>> ("alexsel{3}","aaaalexse")
[]
>>> ("alexsel{3}","aaaalexselll")
['alexselll']
>>> ("alexsel{3,5}","aaaalexsellllllll")
['alexselllll']
>>> ("alexsel{3,5}","aaaalexselll")
['alexselll']
>>> ("alexsel{3,5}","aaaalexsell")
[]

•\ :

followed by a metacharacter to remove special features.
Followed by ordinary characters to realize special functions.
Quotes the string matched by the word group corresponding to the serial number (one group per parenthesis).
Add r at the beginning to indicate that it is not escaped.

#\2 is equivalent to the second group (eric)
>>> (r"(alexsel)(eric)com\2","alexselericcomeric").group()
'alexselericcomeric'
>>> (r"(alexsel)(eric)com\1","alexselericcomalex").group()
'alexselericcomalex'
>>> (r"(alexsel)(eric)com\1\2","alexselericcomalexseleric").group()
'alexselericcomalexeric'

\d :matches any decimal number; it is equivalent to the class [0-9]

>>> ("\d","aaazz1111344444c")
['1', '1', '1', '1', '3', '4', '4', '4', '4', '4']
>>> ("\d\d","aaazz1111344444c")
['11', '11', '34', '44', '44']
>>> ("\d0","aaazz1111344444c")
[]
>>> ("\d3","aaazz1111344444c")
['13']
>>> ("\d4","aaazz1111344444c")
['34', '44', '44']

\D :matches any non-numeric character; it is equivalent to the class [^0-9]

>>> ("\D","aaazz1111344444c")
['a', 'a', 'a', 'z', 'z', 'c']
>>> ("\D\D","aaazz1111344444c")
['aa', 'az']
>>> ("\D\d\D","aaazz1111344444c")
[]
>>> ("\D\d\D","aaazz1z111344444c")
['z1z']

\s :matches any blank character; it is equivalent to class [ \t\n\r\f\v]

>>> ("\s","aazz1 z11..34c")
[' ']

\S :matches any non-white space character; it is equivalent to the class [^ \t\n\r\f\v]

\w :matches any alphanumeric character; he is equivalent to the class [a-zA-Z0-9_]

>>> ("\w","aazz1z11..34c")
['a', 'a', 'z', 'z', '1', 'z', '1', '1', '3', '4', 'c']

\W :matches any non-alphanumeric character; it is equivalent to the class [^a-zA-Z0-9_]

\b :matches a word boundary, which means the position between the word and the space.

>>> (r"\babc\b","abc sdsadasabcasdsadasdabcasdsa")
['abc']
>>> (r"\balexsel\b","abc alexsel abcasdsadasdabcasdsa")
['alexsel']
>>> ("\\balexsel\\b","abc alexsel abcasdsadasdabcasdsa")
['alexsel']
>>> ("\balexsel\b","abc alexsel abcasdsadasdabcasdsa")
[]

() :Handle the bracketed characters as a whole.

>>> (r"a(\d+)","a222bz1144c").group()
'a222'
>>> ("(ab)*","aabz1144c")
['', 'ab', '', '', '', '', '', '', ''] # Match the string in parentheses as a whole with the following characters one by one, in this case the a and ab in the following string first.
# line match, the beginning of the match is successful, look at the back is a, and ab in the second does not match, then look at the back of the string in the second a, and ab match, the first a match is successful, b also matches the success of the match, get match
# Then looking at the third in the string after that is b, the match fails at the beginning, to the fourth, and so on after that.
>>> (r"a(\d+)","a222bz1144c").group()
'a222'
>>> (r"a(\d+?)","a222bz1144c").group() +The minimum number of times the1
'a2'
>>> (r"a(\d*?)","a222bz1144c").group() *The minimum number of times the0
'a'

#Non-greedy matching pattern Add ? , but if there is a matching character after it, non-greedy matching is not possible
# (non-greedy patterns cannot be realized if there are matching conditions before and after)
>>> (r"a(\d+?)b","aa2017666bz1144c")
['2017666']
>>> (r"a(\d*?)b","a222bz1144c").group()
'a222b'
>>> (r"a(\d+?)b","a277722bz1144c").group()
'a277722b'

Metacharacters represent characters in a character set and have no special meaning (with a few exceptions)

>>> ("a[.]d","aaaacd")
[]
>>> ("a[.]d","")
['']

be an exception
[-] [^] [\]

[-]

# Matches a single character, all characters from a to z
>>> ("[a-z]","")
['a', 'a', 'a', 'a', 'd']
>>> ("[a-z]","aaazzzzzaaccc")
['a', 'a', 'a', 'z', 'z', 'z', 'z', 'z', 'a', 'a', 'c', 'c', 'c']
>>>
>>> ("[1-3]","aaazz1111344444c")
['1', '1', '1', '1', '3']

[^]
#matches all characters except those in this range, (^ means non- in this case).
>>> ("[^1-3]","aaazz1111344444c")
['a', 'a', 'a', 'z', 'z', '4', '4', '4', '4', '4', 'c']
>>> ("[^1-4]","aaazz1111344444c")
['a', 'a', 'a', 'z', 'z', 'c']
[\]
>>> ("[\d]","aazz1144c")
['1', '1', '4', '4']

The first metacharacters we'll look at are "[" and "]". They are often used to specify a character class, which is the set of characters you want to match. Characters can be listed individually, or two given characters can be separated by a "-" sign to represent a character range. For example, [abc] will match any of the characters "a", "b", or "c"; you can also use the interval [a-c] to represent the same character set, which has the same effect. If you only want to match lowercase letters, then RE should be written as [a-z], and metacharacters do not play a role in the category. For example, [akm$] will match any of the characters "a", "k", "m", or "$"; "$" is normally used as a metacharacter, but in a character category its properties are removed and it reverts to a normal character.

Third, Python regular expression of various functions and parameter analysis

match: (pattern,string,flags=0)

flags Compiler flags, used to modify how regular expressions are matched, e.g. whether they are case sensitive or not.

("com","").group()
'com'
("com","",).group()
'Com'

flags Compile flags

make matches case insensitive

>>> ("com","COM",).group()
  'COM'

Doing locale-aware matching

Multi-line matching, affecting ^ and $

Make . matches all characters including newlines

>>> (".","abc\nde")
['a', 'b', 'c', 'd', 'e']
>>> (".","abc\nde",)
['a', 'b', 'c', '\n', 'd', 'e']

Parses characters according to the Unicode character set. This flag affects \w, \W, \b, \B

This flag makes writing regular expressions easier to understand by giving you a more flexible format.

search:(pattern,string,flags=0)

("\dcom","www.4comrunoob.5com").group()
'4com'

Difference with

Match only the beginning of the string, if the beginning of the string does not match the regular expression, the match fails and the function returns None; instead, match the entire string until a match is found.

Once match and search are successfully matched, it is a match object object which has the following methods:
-group() returns the string matched by RE
-start() Returns the start position of the match
-end() returns the end of the match
-span() returns a tuple containing the position of the match (start, end)
-group() return re the overall matching string, you can enter more than one group number at a time, corresponding to the group number of the matching string, to get all the results of the match (regardless of whether there is a group)
-a. group () Returns the string that re matches as a whole.
-b. group (n,m) returns the string matched by the group number n, m. If the group number does not exist, an indexError exception is returned.
-c. groups() The groups() method returns a tuple containing the strings of all the groups in the regular expression, from 1 to the number of the group contained. usually groups() takes no arguments and returns a tuple whose elements are the groups defined in the regular expression.
a = "123abc456"
("([0-9]*)([a-z]*)([0-9]*)",a).group(0) #123abc456,return whole
("([0-9]*)([a-z]*)([0-9]*)",a).group(1)   #123
("([0-9]*)([a-z]*)([0-9]*)",a).group(2)   #abc
("([0-9]*)([a-z]*)([0-9]*)",a).group(3)   #456

In the code exercise above, we see that a lot of code is followed by a group, which we will parse here.

m = ("([abc])+", "abc")

In general, (N) returns the character matched by the Nth set of parentheses. And () == (0) == all matched characters, independent of parentheses, this is specified by the API. () returns all bracket-matched characters in tuple format. () == ((0), (1), ...)

sub subn：

(pattern, repl, string, max=0)

>>> ("","have","I get A, I got B, I gut C")# Match characters, replace with have (. matches a character other than any newline character)
'I have A, I have B, I have C'
>>> ("got","have","I get A, I got B, I gut C")
'I get A, I have B, I gut C'
>>> ("","have","I get A, I got B, I gut C",2)# Replace both
'I have A, I have B, I gut C'
>>> ("","have","I get A, I got B, I gut C",1)
'I have A, I got B, I gut C'
>>> ("","have","I get A, I got B, I gut C")#How many in the use of display replacements
('I have A, I have B, I have C', 3)

(strPattern[, flag]):

This method is a factory method of the Pattern class for compiling regular expressions in string form into Pattern objects.

The second parameter flag is the match pattern, the value can use the bitwise or operator '|' to indicate that it is valid at the same time, for example | can compile the regular expression into a regular expression object. You can compile those frequently used regular expressions into a regular expression object, which can improve the efficiency to a certain extent.

An example of a regular expression object:

>>> text = "JGood is a handsome boy, he is cool, clever, and so on..."
>>> regex = (r"\w*oo\w*")
>>> print (text)
['JGood', 'cool']

split:

p = (r"\d+")    #+: matches the previous character 1 to many times
("one1two2three3four4")    #spilt split
>>> p = (r"\d+")
>>> ("one1two2three3four4")
['one', 'two', 'three', 'four', '']
("\d+","one1two2three3four4")
>>> ("\d+","one1two2three3four4")
['one', 'two', 'three', 'four', '']
>>> ("\d+","4one1two2three3four4")
['', 'one', 'two', 'three', 'four', '']# If the left or right side has already been divided at the time of the split
>>> ("[bc]","abcd")# or split a null character if there is no character
['a', '', 'd']

finditer():

>>> p = (r"\d+")
>>> iterator = ("12 drumm44ers drumming, 11 ... 10 ...")
>>>
>>> iterator
<callable-iterator object at 0x02626990>
>>> for match in iterator:
...   () , ()# Each number and where they appear
...
('12', (0, 2))
('44', (8, 10))
('11', (24, 26))
('10', (31, 33))

Since we are using regular expressions in python, special characters need to be escaped multiple times, while with rawstring, you don't need to escape multiple times just to use the regular rules.

>>> (r"\d","www4dd6")
['4', '6']
>>> ("\\d","www4dd6")
['4', '6']
>>> ("\d","www4dd6")
['4', '6']
#It's here.\dThe reason for the success is because\dexistasciiNo special meaning in the code，所以It's here.就自动转意了，But the formal way to write it is the first two

word boundary

>>> (r"\babc","abcsd abc")
['abc', 'abc']
>>> (r"abc\b","abcsd abc")
['abc']
>>> (r"abc\b","abcsd abc*")
['abc']
>>> (r"\babc","*abcsd*abc")
['abc', 'abc']
#Detecting word boundaries is not necessarily a space，Can also be special characters other than letters

summarize

The above is a small introduction to the regular expression in Python , I hope to help you, if you have any questions welcome to leave me a message, I will reply to you in a timely manner!