Detailed explanation of regular expressions and text processors for Shell programming notes

1. Introduction

In the field of Shell programming, regular expressions and text processors are extremely critical tools. With its powerful string matching capabilities, regular expressions can accurately locate strings that meet specific rules; while text processors such as grep, sed, awk can efficiently process text content and meet various text processing needs. Mastering this knowledge is of great significance to writing efficient and flexible shell scripts, and this note will learn related content in depth.

2. Regular expressions

2.1 Definition and Use

Regular expression (RE), also known as regular expressions and regular expressions, are often abbreviated as regex, regexp or RE in code. It is a method of describing a single string and matching a series of strings that conform to specific syntactic rules. It occupies a core position in script programming, text editors and multiple programming languages. In Shell programming, regular expressions can be used to search, delete, replace text, etc. to improve text processing efficiency.

2.2 Basic regular expressions

To learn basic regular expressions, you need to use the test file, and the contents are as follows:

he was short and fat. He was wearing a blue polo shirt with black  home
of Football on BBC Sport online.
the tongue is boneless but it breaks bones.12!
google is the best tools for search keyword. The year ahead will test our political
establishment to the limit. P1=3.141592653589793238462643383249901429
a wood cross!
Actions speak louder than words
#woood #
#woooooood #
Axy zxyzxyzxyzC
I bet this place is really spooky late at night!
Misfortunes never come alone/single.
I shouldn't have lett so tast.

2.2.1 Finding specific characters

Use the grep command to find specific characters.-nOptions are used to display line numbers,-iOptions indicate case insensitive.-vOptions implement reverse selection (find lines that do not contain the specified characters).

Find the location of “the”:

[root@localhost ~]# grep -n 'the' 
4:the tongue is boneless but it breaks bones.12!
5:google is the best tools for search keyword. 
6:The year ahead will test our political establishment to the limit.

Case-insensitive search "the":

[root@localhost ~]# grep -in 'the' 
3:The home of Football on BBC Sport online. 
4:the tongue is boneless but it breaks bones.12!
5:google is the best tools for search keyword. 
6:The year ahead will test our political establishment to the limit.

Find lines that do not contain "the":

[root@localhost ~]# grep -vn 'the' 
1:he was short and fat. 
2:He was wearing a blue polo shirt with black pants. 
3:The home of Football on BBC Sport online. 
7:p13.141592653589793238462643383249901429
8:a wood cross!
9:Actions speak louder than words
10:
11:#woood #
12:#wo0000ood #
13:AxyzxyzxyzxyzC
14:I bet this place is really spooky late at night!
15:Misfortunes never come alone/single. 
16:I shouldn't have lett so tast.

2.2.2 Use brackets “[]” to find collection characters

Multiple characters in "[]" represent only one character.

Find "shirt" and "short":

[root@localhost ~]# grep -n 'sh[io]rt' 
1:he was short and fat. 
2:He was wearing a blue polo shirt with black pants.

Find a string containing the duplicate single character "oo":

[root@localhost ~]# grep -n 'oo' 
3:The home of Football on BBC Sport online. 
5:google is the best tools for search keyword. 
8:a wood cross!
11:#woood #
12:#woo00oood #
14:I bet this place is really spooky late at night!

Find a string that is not preceded by "oo":

[root@localhost ~]# grep -n '[^w]oo' 
3:The home of Football on BBC Sport online. 
5:google is the best tools for search keyword. 
11:#woood #
12:#wooo0oood #
14:I bet this place is really spoky late at night!

Find strings that do not want "oo" to be preceded by lowercase letters:

[root@localhost ~]# grep -n '[^a-z]oo' 
3:The home of Football on BBC Sport online.

Find rows containing numbers:

[root@localhost ~]# grep -n '[0-9]' 
4:the tongue is boneless but it breaks bones.12!
7:PI=3.141592653589793238462643383249901429

2.2.3 Find the beginning of the line "^" and the end of the line "$"

"^" indicates the beginning of the line, and "$" indicates the end of the line.

Find the line with "the" as the beginning:

[root@localhost ~]# grep -n '^the' 
4:the tongue is boneless but it breaks bones.12!

Find lines starting with lowercase letters:

[root@localhost ~]# grep -n '^[a-z]' 
1:he was short and fat. 
4:the tongue is boneless but it breaks bones.12!
5:google is the best tools for search keyword. 
8:a wood cross!

Find lines starting with capital letters:

[root@localhost ~]# grep -n '^[A-Z]' 
2:He was wearing a blue polo shirt with black pants. 
3:The home of Football on BBC Sport online. 
6:The year ahead will test our political establishment to the limit.
7:PI=3.141592653589793238462643383249901429
9:Actions speak louder than words
13:AxyzxyzxyzxyzC
14:I bet this place is really spooky late at night!
15:Misfortunes never come alone/single. 
16:I shouldn't have lett so tast.

Find lines that do not start with letters:

[root@localhost ~]# grep -n '^[^a-zA-Z]' 
11:#woood #
12:#woo000ood #

Find lines ending with the decimal point "." ("." is a metacharacter and must be escaped with "\"):

[root@localhost ~]# grep -n '\.$' 
1:he was short and fat. 
2:He was wearing a blue polo shirt with black pants. 
3:The home of Football on BBC Sport online. 
5:google is the best tools for search keyword.
6:The year ahead will test our political establishment to the limit. 
15:Misfortunes never come alone/single. 
16:I shouldn't have lett so tast.

Find blank lines:

[root@localhost ~]# grep -n '^$' 
10:

2.2.4 Find any character "." and repeat character "*"

"." represents any character, and "*" represents zero or more previous single characters.

Find the string of "w??d":

[root@localhost ~]# grep -n 'w..d' 
5:google is the best tools for search keyword.
8:a wood cross!
9:Actions speak louder than words

Find strings containing at least two "o" or more:

[root@localhost ~]# grep -n 'ooo*' 
3:The home of Football on BBC Sport online. 
5:google is the best tools for search keyword. 
8:a wood cross!
11:#woood #
12:#woo00oood #
14:I bet this place is really spooky late at night!

Find a string that starts with "w" and ends with "d" with at least one "o" in the middle:

[root@localhost ~]# grep -n 'woo*d' 
8:a wood cross!
11:#woood #
12:#woooooood #

Find a string that starts with "w" and ends with "d" and is optional in the middle character:

[root@localhost ~]# grep -n 'w.*d' 
1:he was short and fat. 
5:google is the best tools for search keyword. 
8:a wood cross!
9:Actions speak louder than words
11:#woood #
12:#woo00oood #

Find the row where any number is located:

[root@localhost ~]# grep -n '[0-9][0-9]*' 
4:the tongue is boneless but it breaks bones.12!
7:PI=3.141592653589793238462643383249901429

2.2.5 Finding continuous character range "{}"

"{}" is used to limit the range of duplicate strings and needs to be escaped when used in the shell.

Find two characters of "o":

[root@localhost ~]# grep -n 'o\{2\}' 
3:The home of Football on BBC Sport online. 
5:google is the best tools for search keyword. 
8:a wood cross!
11:#woood #
12:#wo0000ood #
14:I bet this place is really spooky late at night!

Find strings that start with "w" and end with "d" and contain 2 - 5 "o" in the middle:

[root@localhost ~]# grep -n 'wo\{2,5\}d' 
8:a wood cross!
11:#woood #

Find a string that starts with "w" and ends with "d" and contains 2 or more "o" in the middle:

[root@localhost ~]# grep -n 'wo\{2,\}d' 
8:a wood cross!
11:#woood #
12:#woo00oood #

2.3 Metacharacter summary

character	describe
\	Mark the next character as a special character, or an primitive character, or a backward reference, or an octal escape character. For example, 'n' matches the character "n". '\n' matches a newline character. The sequence '\\' matches "\" and "\(" matches "(".
^	Matches the start position of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after '\n' or '\r'.
$	Matches the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches the position before '\n' or '\r'.
*	Matches the previous subexpression zero or multiple times. For example, zo* can match "z" and "zoo". * is equivalent to {0,}.
+	Matches the previous subexpression once or more times. For example, 'zo+' can match "zo" and "zoo", but not "z". + is equivalent to {1,}.
?	Matches the previous subexpression zero or once. For example, "do(es)?" can match "do" or "does" . ? is equivalent to {0,1}.
{n}	n is a non-negative integer. Match the n times that are determined. For example, 'o{2}' cannot match 'o' in "Bob", but can match two os in "food".
{n,}	n is a non-negative integer. Match at least n times. For example, 'o{2,}' cannot match 'o' in "Bob" but can match all os in "fooooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}	m and n are non-negative integers, where n <= m. Match at least n times and match up to m times. For example, "o{1,3}" will match the first three os in "fooooooood". 'o{0,1}' is equivalent to 'o?'. Please note that there cannot be spaces between commas and two numbers.
?	The matching pattern is non-greedy when the character is immediately followed by any other restriction character (*, +, ?, {n}, {n,}, {n,m}). The non-greedy pattern matches as few strings as possible, while the default greedy pattern matches as many strings as possible. For example, for the string "oooo", 'o+?' will match a single "o", and 'o+' will match all 'o'.
.	Match any single character except line breaks (\n, \r). To match any characters including '\n', use a pattern like "(.\|\n)".
(pattern)	Match pattern and get this match. The obtained matches can be obtained from the generated Matches collection, using the SubMatches collection in VBScript, and the $0…$9 attribute in JScript. To match parentheses characters, use '$' or '$'.
(?:pattern)	Match pattern but does not get the matching result, that is, this is a non-get match and is not stored for future use. This is useful when using the "or" character (\|) to combine various parts of a pattern. For example, 'industr(?:y\|ies) is a simpler expression than 'industry\|industries'.
(?=pattern)	Look ahead positive assert, matching the search string at the beginning of any string matching pattern. This is a non-get match, that is, the match does not need to be retrieved for later use. For example, "Windows(?=95\|98\|NT\|2000)" can match "Windows" in "Windows2000", but cannot match "Windows" in "Windows3.1". Pre-checking does not consume characters, that is, after a match occurs, the next match's search begins immediately after the last match, rather than after the characters containing the pre-checking.
(?!pattern)	Negative assert, matching the search string at the beginning of any string that does not match the pattern. This is a non-get match, that is, the match does not need to be retrieved for later use. For example, "Windows(?!95\|98\|NT\|2000)" can match "Windows" in "Windows3.1", but cannot match "Windows" in "Windows2000". Pre-checking does not consume characters, that is, after a match occurs, the next match's search begins immediately after the last match, rather than after the characters containing the pre-checking.
(?<=pattern)	The reverse (looking behind) positive pre-check is similar to the positive pre-check, but the direction is opposite. For example, "(?<=95\|98\|NT\|2000)Windows" can match "Windows" in "2000Windows", but cannot match "Windows" in "3.1Windows".
(?<!pattern)	Reverse negative pre-examination is similar to forward negative pre-examination, but in the opposite direction. For example, "(?<!95\|98\|NT\|2000)Windows" can match "Windows" in "3.1 Windows", but cannot match "Windows" in "2000Windows".
x\|y	Match x or y. For example, 'z\|food' can match "z" or "food". '(z\|f)ood' matches "zood" or "food".
[xyz]	Character collection. Match any character contained. For example, '[abc]' can match 'a' in "plain".
[^xyz]	A collection of negative values characters. Match any characters not included. For example, '[^abc]' can match 'p', 'l', 'i', 'n' in "plain".
[a-z]	Character range. Match any character in the specified range. For example, '[a-z]' can match any lowercase alphabetical characters in the range 'a' to 'z'.
[^a-z]	Negative value character range. Match any arbitrary characters that are not within the specified range. For example, '[^a-z]' can match any arbitrary character that is not in the range of 'a' to 'z'.
\b	Match a word boundary, which means the position between the word and space. For example, 'er\b' can match 'er' in "never" but not 'er' in "verb".
\B	Match non-word boundaries. 'er\B' can match 'er' in "verb", but cannot match 'er' in "never".
\cx	Matches the control characters specified by x. For example, \cM matches a Control-M or carriage return. The value of x must be one of A-Z or a-z. Otherwise, treat c as an original 'c' character.
\d	Match a numeric character. Equivalent to [0-9].
\D	Match a non-numeric character. Equivalent to [^0-9].
\f	Match a page break. Equivalent to \x0c and \cL.
\n	Match a newline character. Equivalent to \x0a and \cJ.
\r	Match a carriage return character. Equivalent to \x0d and \cM.
\s	Match any whitespace characters, including spaces, tabs, page breaks, etc. Equivalent to [ \f\n\r\t\v].
\S	Match any non-whitespace characters. Equivalent to [^ \f\n\r\t\v].
\t	Match a tab character. Equivalent to \x09 and \cI.
\v	Match a vertical tab. Equivalent to \x0b and \cK.
\w	Match letters, numbers, and underscores. Equivalent to '[A-Za-z0-9_]'.
\W	Match non-letters, numbers, and underscores. Equivalent to '[^A-Za-z0-9_]'.

2.4 Extended regular expressions

Extended regular expressions can simplify instructions. The grep command only supports basic regular expressions. Using extended regular expressions requires the egrep or awk command. The egrep command is similar to grep usage, and can search for any string and symbol in a file.

Metacharacter	meaning	Examples and explanations
+	Match the previous subexpression once or more times, that is, it occurs at least once.	model`go+gle`Can match`google`、`gooogle`Wait, but cannot match`gle`。
?	Match the previous subexpression zero or once, that is, the character is optional.	model`colou?r`Can match`color`and`colour`。
\|	Representation or relationship, used to match any of the multiple selections.	model`cat\|dog`Can match`cat`or`dog`。
()	For grouping, combining multiple characters into a whole for easy subsequent operations and can also be used for backward references.	model`(ab)+`Can match`ab`、`abab`wait;`(\d{3})-(\d{4})`The phone number can be divided into two parts: area code and number, for the convenience of subsequent citations.
{m,n}	Specifies the number of times the previous subexpression occurs, m is the lower limit and n is the upper limit, both of which are non-negative integers and m <= n. If m is omitted, it means 0 to n times; if n is omitted, it means at least m times.	model`a{2,4}`Can match`aa`、`aaa`、`aaaa`；`a{3,}`Can match`aaa`、`aaaa`wait.

For example, if you query lines other than the blank line and the beginning of the line in the file, the basic regular expression needs to be searched twice with the pipeline command:

[root@localhost ~]# grep -v '^$' |grep -v '^#'

Extended regular expressions can be simplified to:

[root@localhost ~]# egrep -v '^$|^#'

3. Text processor

3.1 sed tool

sed (Stream EDitor) is a powerful and simple text parsing and conversion tool that can read text and edit it, and is widely used in Shell scripts.

3.1.1 Workflow

The sed workflow includes three processes: read, execute and display:

Read: Read a line of content from the input stream (file, pipeline, standard input) to the temporary buffer (mode space).
implement: By default, the sed command is executed in the schema space order, unless the line address is specified, it is executed in sequence on all lines.
show: Send the modified content to the output stream, and clear the mode space after sending. The process is repeated until everything is processed. By default, the input file will not change unless the output is stored with redirection.

3.1.2 Command Options

Common sed command options:

Options	Complete form	illustrate
-e	--expression=	Use specified commands or scripts to process input text files
-f	--file=	Use the specified script file to process the input text file
-h	--help	Show Help
-n	--quiet、silent	Show only the processed results
-i	none	Edit text files directly

3.1.3 Operation commands

"Operation" specifies the action behavior to the file, the format is usually "[n1 [,n2]] Operation Parameters", n1 and n2 are optional, representing the number of rows of the operation. Common operations:

operate	illustrate
a	Add a line to specify the content below the current line
c	Replace the selected row with the specified content
d	Delete selected rows
i	Insert a line of specified content on the selected line
p	If the specified line is specified, the specified line is printed. If the specified line is not specified, all content is printed. If there are non-printed characters, it is output in ASCII code, and is often used with the "-n" option.
s	Replace the specified character
y	Character conversion

3.1.4 Usage example

Take the file as an example:

Output text that meets the criteria：

Output all content:

[root@localhost ~]# sed -n 'p'

Output line 3:

[root@localhost ~]# sed -n '3p'

Output 3 - 5 lines:

[root@localhost ~]# sed -n '3,5p'

Output all odd lines:

[root@localhost ~]# sed -n 'p;n'

Output all even lines:

[root@localhost ~]# sed -n 'n;p'

Output odd lines between lines 1 - 5:

[root@localhost ~]# sed -n '1,5{p;n}'

Output even lines between line 10 and end of the file:

[root@localhost ~]# sed -n '10,${n;p}'

Output a line containing "the":

[root@localhost ~]# sed -n '/the/p'

Output from line 4 to the first line containing “the”:

[root@localhost ~]# sed -n '4,/the/p

3.2 awk tool

3.2.1 Awk Tool Overview and Command Format

awk is a powerful editing tool that reads text line by line, finds, formats output or filters according to matching patterns. The command format is:

awkOptions'Mode or condition{Editing commands}'document1document2...
awk -f脚本documentdocument1document2...

awk tends to divide a line into multiple "fields" processing, with the default field separator being a space or tab key, which can be used for logical operators and mathematical operations.

3.2.2 awk built-in variables

`FS`	Specifies the field separator for each line of text, default to space or tab
`NF`	The number of fields in the currently processed row
`NR`	The line number (ordinal number) of the currently processed row
`$0`	The entire line content of the currently processed line
`$n`	The nth field (nth column) of the current process row
`FILENAME`	The file name being processed
`RS`	Data record separation, default`\n`, that is, each behavior is one record

3.2.3 Awk usage example

Output text by line: Output all content:

awk '{print}' 
awk '{print $0}'

Output lines 1-3:

awk 'NR==1, NR==3{print}' 
awk '(NR>=1)&&(NR<=3){print}'

Summarize

This is the article about regular expressions and text processors in Shell programming notes. For more related Shell regular expressions and text processor content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!