python3 Regular Expression Basics Xuefeng Liao

Strings are one of the most involved data structures in programming, and the need to manipulate strings is almost ubiquitous. For example, determining whether a string is a legitimate Email address can be programmatically extracted, although the@before and after the substring, and then determine whether it is a word and domain name, but this is not only troublesome, and the code is difficult to reuse.

Regular expressions are a powerful weapon for matching strings. It is designed to use a descriptive language to define a rule for strings, where the string that meets the rule, we consider it "matched", otherwise, the string is not legal.

So the way we determine if a string is a legitimate Email is:

Creates a regular expression that matches Email;
Use this regular expression to match the user's input to determine if it is legal.

Because regular expressions are also represented as strings, we need to first understand how to describe characters in terms of characters.

In a regular expression, if the character is given directly, it is an exact match. A regular expression is an exact match if the character is given directly with the\dcan match a number.\wIt can match a letter or a number, so:

'00\d'can be matched'007'but can't match'00A'；
'\d\d\d'can be matched'010'；
'\w\w\d'can be matched'py3'；

.It can match any character, so:

'py.'can be matched'pyc'、'pyo'、'py!'And so on.

To match variable-length characters, in a regular expression, use the*denotes any character (including 0), with the+represents at least one character, with the?Indicates 0 or 1 character with the{n}denotes n characters with{n,m}denotes n-m characters:

Take a look at a complex example:\d{3}\s+\d{3,8}。

Let's unpack this from left to right:

\d{3}means match 3 numbers, e.g.'010'；
\scan match a space (and also white space characters such as Tab), so the\s+means that there is at least one space, e.g. matching' '，' 'etc;
\d{3,8}Indicates 3-8 numbers, e.g.'1234567'。

Taken together, the above regular expressions can match phone numbers with area codes separated by any number of spaces.

If you want to match the'010-12345'What about a number like this? Since'-'are special characters that, in regular expressions, have to be used with the'\'escaped, so the above regularization is\d{3}\-\d{3,8}。

But it still doesn't match'010 - 12345', because of the spaces. So we need a more complex match.

advanced

To do a more precise match, you can use the[]Indicates a range, for example:

[0-9a-zA-Z\_]Can match a number, letter or underscore;
[0-9a-zA-Z\_]+can match strings consisting of at least one number, letter, or underscore, such as'a100'，'0_Z'，'Py3000'And so on;
[a-zA-Z\_][0-9a-zA-Z\_]*Matches a string starting with a letter or underscore followed by any string consisting of a number, letter, or underscore, i.e., a Python legal variable;
[a-zA-Z\_][0-9a-zA-Z\_]{0, 19}More precisely limits the length of the variable to 1-20 characters (1 character before + up to 19 characters after).

A|Bcan match either A or B, so(P|p)ythoncan be matched'Python'or'python'。

^indicates the beginning of a line.^\dIndicates that it must begin with a number.

$indicates the end of the line.\d$Indicates that it must end with a number.

As you may have noticed.pyIt can also match'python'But add^py$It becomes a whole line match, and then it can only match the'py'Up.

re module

With preparatory knowledge, we are ready to use regular expressions in Python.Python provides theremodule that contains all the regular expression functionality. Since Python's strings themselves are represented by the\Escape, so pay special attention:

s = 'ABC\\-001' # Python strings
# The corresponding regular expression string becomes:
# 'ABC\-001'

We therefore strongly recommend using Python'srprefix, you don't have to think about escaping:

s = r'ABC\-001' # Python strings
# The corresponding regular expression string remains unchanged:
# 'ABC\-001'

Let's first look at how to determine if a regular expression matches:

>>> import re
>>> (r'^\d{3}\-\d{3,8}$', '010-12345')
<_sre.SRE_Match object; span=(0, 9), match='010-12345'>
>>> (r'^\d{3}\-\d{3,8}$', '010 12345')
>>>

match()method to determine if there is a match, and if the match is successful, return aMatchobject, otherwise it returns theNone. A common way to determine this is:

test = 'The string entered by the user'
if (r'Regular expressions', test):
  print('ok')
else:
  print('failed')

Sliced String

Slicing strings with regular expressions is more flexible than with fixed characters, see the normal slicing code:

>>> 'a b  c'.split(' ')
['a', 'b', '', '', 'c']

Hmm, can't recognize consecutive spaces, try with a regular expression:

>>> (r'\s+', 'a b  c')
['a', 'b', 'c']

Splits properly no matter how many spaces. Add,Try:

>>> (r'[\s\,]+', 'a,b, c d')
['a', 'b', 'c', 'd']

add in;Try:

>>> (r'[\s\,\;]+', 'a,b;; c d')
['a', 'b', 'c', 'd']

If the user enters a set of labels, next time remember to use a regular expression to convert the irregular input into a correct array.

clusters

In addition to simply determining whether or not they match, regular expressions have the powerful ability to extract substrings. Using the()The grouping (Group) to be extracted is indicated. For example:

^(\d{3})-(\d{3,8})$Two groups are defined separately to extract the area code and local number directly from the matched string:

>>> m = (r'^(\d{3})-(\d{3,8})$', '010-12345')
>>> m
<_sre.SRE_Match object; span=(0, 9), match='010-12345'>
>>> (0)
'010-12345'
>>> (1)
'010'
>>> (2)
'12345'

If a group is defined in the regular expression, it can be used in theMatchobject withgroup()method to extract the substring.

perceivegroup(0)is always the original string.group(1)、group(2)...... denotes the 1st, 2nd, and ...... substrings.

Extracting substrings is very useful. Take a look at a more aggressive example:

>>> t = '19:05:30'
>>> m = (r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
>>> ()
('19', '05', '30')

This regular expression can directly recognize legitimate times. But there are times when full validation is not possible with regular expressions, such as recognizing dates:

'^(0[1-9]|1[0-2]|[0-9])-(0[1-9]|1[0-9]|2[0-9]|3[0-1]|[0-9])$'

insofar as'2-30'，'4-31'Such illegal date, with regular or can not be identified, or written out very difficult, then you need to cooperate with the program to identify.

greedy matching

Finally, it is important to note that regular matching is greedy by default, that is, it matches as many characters as possible. For example, to match a number followed by0：

>>> (r'^(\d+)(0*)$', '102300').groups()
('102300', '')

due to\d+A greedy match is used, directly following the0All matched. Result.0*It can only match empty strings.

have to let\d+Using non-greedy matching (i.e., matching as little as possible) is the only way to get the back of the0Match it out, add a?Then you can make\d+Non-greedy matching is used:

>>> (r'^(\d+?)(0*)$', '102300').groups()
('1023', '00')

compiling

When we use regular expressions in Python, the re module does two things internally:

Compiles the regular expression and reports an error if the regular expression string itself is not legal;
Use compiled regular expressions to match strings.

If a regular expression is to be reused thousands of times, for efficiency reasons, we can pre-compile the regular expression, and the next time it is reused, we don't need to compile this step, and it will match directly:

>>> import re
# Compiled by.
>>> re_telephone = (r'^(\d{3})-(\d{3,8})$')
# Use:
>>> re_telephone.match('010-12345').groups()
('010', '12345')
>>> re_telephone.match('010-8086').groups()
('010', '8086')

The compiler generates a Regular Expression object, which contains the regular expression itself, so you don't need to give the regular string when calling the corresponding method.

wrap-up

Regular expressions are so powerful that it would be impossible to cover them all in one short section. A thick book could be written to cover everything about regulars. If you encounter regular expression problems on a regular basis, you probably need a regular expression reference book.

practice

Please try to write a regular expression that validates an Email address. Version 1 should validate a similar Email:

someone@
@

# -*- coding: utf-8 -*-
import re
def is_valid_email(addr):
  return True

# Testing.
assert is_valid_email('someone@')
assert is_valid_email('@')
assert not is_valid_email('bob#')
assert not is_valid_email('mr-bob@')
print('ok')

Version 2 can extract Email addresses with names:

<Tom Paris> tom@ => Tom Paris
bob@ => bob

# -*- coding: utf-8 -*-
import re
def name_of_email(addr):
  return None

# Testing.
assert name_of_email('<Tom Paris> tom@') == 'Tom Paris'
assert name_of_email('tom@') == 'tom'
print('ok')

Reference source code

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import re

print('Test: 010-12345')
m = (r'^(\d{3})-(\d{3,8})$', '010-12345')
print((1), (2))

t = '19:05:30'
print('Test:', t)
m = (r'^(0[0-9]|1[0-9]|2[0-3]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])\:(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9]|[0-9])$', t)
print(())

The above is the python3 regular expression basic details, more information about python3 regular expression please pay attention to my other related articles!