Getting Started with Regular Expression Functions Implemented in python [Classic

This article is about regular expression functions implemented in python. It is shared for your reference as follows:

Précédent:

First, what is meant by Regular Expression?

For example, we want to determine whether the string "adi_e32fv,Ls" contains the substring "e32f", and for example, we are in a txt file containing millions of names to find the last name "Wang", the first name ends in "five" name, and then printed printed out. The result is: "Wang Wu", "Wang Xiaowu", "Wang Dawu", "Wang Xiaowu". ......

Previously we were using string functions to find them, but the code implementation would be complicated. Nowadays, using regular expressions requires only one sentence ('King. *? Five',txt1) and you're done! Regular expressions are the most basic knowledge to write a web crawler, which can be used to collect URLs in html that meet certain string requirements. The following is a personal summary of some of the basics of regular expressions.

(Operating environment: 32-bit Win8 system, running tools: python2.7.9 + Eclipse.)

Main article:

1, first of all, import python's re module.

2. Metacharacters. ^ $ * + ? {} [] \ | ()

The findall(str1, str2) method in the re module returns the string str2 that matches the str1 format. For example, matching 'dit' in the string 'dit dot det,dct dit dot' results in:

str1 = 'dit dot det,dct dit dot'
print ('dit',str1)

Results:

['dit', 'dit']

|dct: 'dit|dct' means dit or dct.

str1 = 'dit dot det,dct dit dot'
print ('dit|dct',str1)

Results:

['dit', 'dct', 'dit']

Role of []: [ic] denotes i or c, e.g. 'd[ic]t' denotes both dit and dct, and is equivalent to 'dit|dct':

str1 = 'dit dot det,dct dit dot'
print ('d[ic]t',str1)

Results:

['dit', 'dct', 'dit']

^ Role 1: ^ in [^ic] denotes negation, i.e., except for i and c:

str1 = 'dit dot det,dct dit dot'
print ('d[^ic]t',str1)

Results:

['dot', 'det', 'dot']

^Role 2: ^dit indicates that the substring dit is in the beginning position, while dct is not:

str1 = 'dit dot det,dct dit dot'
print ('^dit',str1)
print ('^dct',str1)

Results:

['dit'][]

$Effect: dot$ indicates that the substring dot is to be in end position, while dct is not in end position:

str1 = 'dit dot det,dct dit dot'
print ('dot$',str1)
print ('dct$',str1)

Results:

['dot'][]

. Role: Indicates that an arbitrary character is omitted between d and t:

str1 = 'dit dot det,dct dit dot'
print ('',str1)

Results:

['dit', 'dot', 'det', 'dct', 'dit', 'dot']

+ Role: di+t indicates that one or more 'i's have been omitted between d and t:

str1 = 'd dt dit diit det'
print ('d.+t',str1)

Results:

['dit', 'diit']

*Role: di*t means that zero to more 'i's are omitted between d and t:

str1 = 'd dt dit diit det'
print ('d.*t',str1)

Results:

['dt', 'dit', 'diit']

Often, '.' is used with '+' or '*'.' . +' indicates the omission of one to more arbitrary elements, and '. *' means that zero to more arbitrary elements are omitted:

str1 = 'd dt dit diit det'
print ('d.+t',str1)
print ('d.*t',str1)

Results:

['d dt dit diit det']['d dt dit diit det']

? Role 1: Look at the matching result of . + match result, 'dit', 'dot' also satisfy the matching condition of 'd.+t', but the output is the longest substring that satisfies the matching condition 'dit dot det,dct dit dot', this is called greedy matching. If you want to output the shortest match string, just add '?' after '+'. : (Note: the same is true for '*', just add '?' after '*')

str1 = 'd dt dit diit det'
print ('d.+?t',str1)

Results:

['dit', 'dot', 'det', 'dct', 'dit', 'dot']

? Role 2: di?t indicates that i is dispensable, i.e., dt and dit satisfy the matching condition:

str1 = 'd dt dit diit det'
print ('di?t',str1)

Results:

['dt', 'dit']

{} Role 1: di{n}t means that there are n 'i's between d and t:

str1 = 'dt dit diit diiit diiiit'
print ('di{2}t',str1)

Results:

['diit']

{} Role 2: di{n,m}t means that there are n to m 'i's between d and t:

str1 = 'dt dit diit diiit diiiit'
print ('di{1,3}t',str1)

Results:

['dit', 'diit', 'diiit']

where both n and m can be omitted. {n,} means n to any one; {,m} means 0 to m; {,} means any one, same function as '*':

str1 = 'dt dit diit diiit diiiit'
print ('di{1,}t',str1)
print ('di{,3}t',str1)
print ('di{,}t',str1)

Results:

['dit', 'diit', 'diiit', 'diiiit']
   ['dt', 'dit', 'diit', 'diiit']
   ['dt', 'dit', 'diit', 'diiit', 'diiiit']

\ Role 1: Cancel metacharacters and turn them into escape characters:

str1 = '^abc ^abc'
print ('^abc',str1)
print ('\^abc',str1)

Results:

[]['^abc', '^abc']

\ Role 2: Predefined characters

str1 = '12 abc 345 efgh'
print ('\d+',str1)
print ('\w+',str1)

Results:

['12', '345']
   ['12', 'abc', '345', 'efgh']

() Role: After matching a string, only the content inside the matching string '()' is output:

str1 = '12abcd34'
print ('12abcd34',str1)
print ('1(2a)bcd34',str1)
print ('1(2a)bc(d3)4',str1)

Results:

['12abcd34']
   ['2a']
   [('2a', 'd3')]

3. Main methods in the re module：findall()、finditer()、match()、search()、compile()、split()、sub()、subn()。

(pattern,string,flags = 0)

Function: search the string from left to right for the string matching the pattern, and return the result as a list.

str1 = 'ab cd'
print ('\w+',str1)

Result: ['ab', 'cd']

(pattern,string,flags = 0)

Role: Its function is the same, but the result is returned as an iterator.

str1 = 'ab cd'
iter1 = ('\w+',str1)
for a in iter1:
  print (),()

Results:

ab (0, 2)
cd (3, 5)

(Note: () returns the string that satisfies the matching condition, and () returns the start and end positions of the string)

(pattern,string,flags = 0)

Function: search the string from left to right for the string that matches the pattern, return None if there is no match, otherwise return a search instance.

str1 = 'ab cd'
result = ('cd',str1)
if result == None:
  print 'None'
else:
  print (),(),()

Result: cd 3 5

(pattern,string,flags = 0)

Role: determine whether the header of string matches the pattern, if yes, return match instance, otherwise return None.

str1 = 'ab cd'
result = ('cd',str1)
if result == None:
  print 'None'
else:
  print (),(),()

Result: None

(pattern,flags = 0)

Role: compiles the matching format pattern and returns an instance object. Compiling the regular expression first can greatly improve the matching speed.

str1 = 'ab cd'
pre = ('ab')
print (str1)

Result: ['ab']

(pattern,string,maxsplit = 0,flags = 0)

Role: split a string when it matches a pattern:

str1 = ''
str2 = '12+34-56*78/90'
print ('\.',str1)
print ('[\+\-\*/]',str2)

Results:

['ab', 'c', 'de']
['12', '34', '56', '78', '90']

(pattern,repl,string,count = 0,flags = 0)

Role: Replace a string that satisfies the pattern rule with a repl in string:

str1 = 'abcde'
print ('bc','123',str1)

Result: a123de

(pattern,repl,string,count = 0,flags = 0)

Function: The function is the same as (), but the returned result has one more number, which represents how many times it has been replaced.

str1 = 'abcdebce'
print ('bc','123',str1)

Result: ('a123de123e', 2)

PS: Here are 2 more very convenient regular expression tools for your reference:

JavaScript regular expression online test tool:
http://tools./regex/javascript

Regular expression online generation tool:
http://tools./regex/create_reg

More about Python related content can be viewed on this site's topic: thePython Regular Expression Usage Summary》、《Python Data Structures and Algorithms Tutorial》、《Python Socket Programming Tips Summary》、《Summary of Python function usage tips》、《Summary of Python string manipulation techniques》、《Python introductory and advanced classic tutorialsand theSummary of Python file and directory manipulation techniques》

I hope that what I have said in this article will help you in Python programming.