Previous | Table of Contents | Next |
Regular Expression Examples
As I stated before, a regular expression is a string that can represent many sequences of characters. Thus the simplest regular expression is one that exactly represents the sequence of characters that you need to match. For example, the following expression
/peach/
matches the string peach exactly. If this expression was used in awk or sed, any line that contains the string peach is selected by this expression. This includes lines such as the following:
We have a peach tree in the backyard I prefer peaches to plums
Matching Characters Look at a few more expressions to demonstrate the use of the metacharacters. For example, the following pattern
/a.c/
matches lines that contain strings such as a+c, a-c, abc, match, and a3c, whereas the pattern
/a*c/
matches the same strings along with strings such as ace, yacc, and arctic. It also matches the following line:
close the window
Notice that there is no letter a in this sentence. The * metacharacter matches zero or more occurrences of the character immediately preceding it. In this case it matched zero occurrences of the letter a.
Another important thing to note about the * is that it tries to make the longest possible match. For example, consider the pattern
/a*a/
and the following line
able was I, ere I saw elba
Here you have asked to match lines that contain a string that starts with the letter a and ends with the letter a. In the sample line, there are several possibilities:
able wa able was I, ere I sa able was I, ere I saw elba
Because you used the * metacharacter, the last possibility is selected.
You can combine the . and the * metacharacters to obtain behavior equivalent to the * filename expansion wildcard. For example, the following expression
/ch.*doc/
matches the strings ch01.doc, ch02.doc, and chdoc. The shells * wildcard matches files by the same names.
Specifying Sets of Characters
One of the major limitations with the . operator is that it does not enable you to specify which characters you want to match. It matches all characters. To specify a particular set of characters in a regular expression, use the bracket characters, ([ and ]), as follows:
/[chars]/
Here a single character in the set given by chars is matched. The use of sets in regular expression is almost identical to the shells use of sets in filename substitution.
Here is an example of using sets. The following expression matches the string The and the:
/[tT]he/
Table 16.2 shows some frequently used sets of characters.
Set | Description |
---|---|
[a-z] | Matches a single lowercase letter |
[A-Z] | Matches a single uppercase letter |
[a-zA-Z] | Matches a single letter |
[0-9] | Matches a single number |
[a-zA-Z0-9] | Matches a single letter or number |
Sometimes is it hard to determine the exact set of characters that you need to match. Say that you needed to match every character except the letter T. In this case, constructing a set of characters that includes every character except the letter T is error prone. You might forget a space or a punctuation character while trying to construct the set.
Fortunately, you can specify a set that is the negation of the set that matches T as follows:
[^T]
Here the ^ character precedes the letter T. When the ^ character is the first character in the set, any character not given in the set is matched. This is called reversing or negating a set. Any set, including those given in Table 16.2, can be reversed or negated if you give ^ as the first character. For example, the following pattern
/ch[^0-9]/
matches the beginnings of the strings chapter and chocolate, but not the strings ch01 or ch02.
You can combine the sets with the * character to extend their functionality. For example, the following expression
/ch0[0-9]*doc/
matches the strings ch01.doc and ch02.doc but not the strings chdoc or changedoc.
Anchoring Patterns Now say that you are looking for lines that start with the word the, such as the following:
the plains were rich with crops
If you use the following pattern
/the/
it matches the line given previously along with the following lines:
there were many orchards of fruit tree in the dark it was like summer lightning
The two main problems are
To solve the first problem, add a space as follows:
/the /
To solve the second problem you need the ^ metacharacter, which matches the beginning of a line. In a regular expression, it anchors the expression to the beginning of the line. By anchor, I mean that an expression matches a line only if that line starts with this expression. Normally, any line that contains an expression is matched.
By adding the ^ metacharacter as follows
/^the /
you cause this expression to match only those lines that start with the word the. Some examples are
the forest of oak trees on the mountain the hillside where the chestnut forest grew
You can also anchor expressions to the end of the line using the $ metacharacter. For example, the following expression
/friend$/
matches this line:
I have been and always will be your friend
But it doesnt match this line:
What are friends for
Previous | Table of Contents | Next |