Previous | Table of Contents | Next |
The most powerful text filtering tools in the UNIX environment are a pair of oddly named programs, awk and sed. They let shell programmers easily edit text files and filter the output of other commands using regular expressions. A regular expression is a string that can be used to describe several sequences of characters.
sed (which stands for stream editor) was created as an editor exclusively for executing scripts. As its name implies, sed is stream oriented, thus all the input you feed into it passes through and goes to STDOUT. It does not change the input file. In this chapter I will show you how to use sed in shell scripts.
I will cover awk programming in Chapter 17, Filtering Text with awk, but Ill discuss some of the many similarities between awk and sed at the beginning of this chapter.
There are many similarities between awk and sed:
For those readers who are not familiar with patterns and regular expressions, I will explain them shortly.
The invocation syntax for awk and sed is as follows:
command 'script' filenames
Here command is either awk or sed, script is a list of commands understood by awk or sed, and filenames is a list of files that the command acts on.
The single quotes around script are required to prevent the shell from accidentally performing substitution. The actual contents of script differ greatly between awk and sed. The commands understood by awk and sed are covered in separate sections later in this chapter.
If filenames are not given, both awk and sed read input from STDIN. This enables them to be used as output filters on other commands.
When an awk or sed command runs, it performs the following operations:
These operations illustrate the main feature of awk and sedthey provide a method of acting on every record or line in a file using a single script. When every record has been read, the input file is closed. If the input file is the last file specified in filenames, the command exits.
Script Structure and Execution
The script specified to awk or sed consists of one or more lines of the following form:
/pattern/ action
Here pattern is a regular expression, and action is the action that either awk or sed should take when the pattern is encountered. Regular expressions will be covered shortly. The slash characters (/) that surround the pattern are required because they are used as delimiters.
When awk or sed is executing a script, it uses the following procedure on each record:
Just before step 4 is performed, sed displays the modified record. In awk you must manually display the record.
The actions taken in awk and sed are quite different. In sed, the actions consist of commands that edit single letters, whereas in awk the action is usually a set of programming statements.
The basic building blocks of a regular expression are
Ordinary characters are
Metacharacters are characters that have a special meaning inside a regular expression: They are expanded to match ordinary characters. By using metacharacters, you need not explicitly specify all the different combinations of ordinary characters that you want to match. The basic set of metacharacters understood by both sed and awk is given in Table 16.1.
Character | Description |
---|---|
. | Matches any single character except a newline. |
* | Matches zero or more occurrences of the character immediately preceding it. |
[chars] | Matches any one of the characters given in chars, where chars is a sequence of characters. You can use the - character to indicate a range of characters. If the ^ character is the first character in chars, one occurrence of any character that is not specified by chars is matched. |
^ | Matches the beginning of a line. |
$ | Matches the end of a line. |
\ | Treats the character that immediately follows the \ literally. This is used to specify patterns that contain one of the preceding wildcards. |
Frequently regular expressions are referred to as patterns. In Chapter 8, Substitution, I described the shell feature know as filename substitution, which uses a subset of regular expressions to produce lists of files.
Note:
In the context of filename substitution, I referred to metacharacters as wildcards. You might see these two terms used interchangeably in books and reference materials.
Previous | Table of Contents | Next |