Sams Teach Yourself Shell Programming in 24 Hours
(Publisher: Macmillan Computer Publishing)
Author(s): Sriranga Veeraraghavan
ISBN: 0672314819
Publication Date: 01/01/99

Previous Table of Contents Next


Hour 16
Filtering Text Using Regular Expressions


The most powerful text filtering tools in the UNIX environment are a pair of oddly named programs, awk and sed. They let shell programmers easily edit text files and filter the output of other commands using regular expressions. A regular expression is a string that can be used to describe several sequences of characters.

sed (which stands for stream editor) was created as an editor exclusively for executing scripts. As its name implies, sed is stream oriented, thus all the input you feed into it passes through and goes to STDOUT. It does not change the input file. In this chapter I will show you how to use sed in shell scripts.

I will cover awk programming in Chapter 17, “Filtering Text with awk,” but I’ll discuss some of the many similarities between awk and sed at the beginning of this chapter.

The Basics of awk and sed

There are many similarities between awk and sed:

  They have similar invocation syntax.
  They enable you to specify instructions that execute for every line in an input file.
  They use regular expressions for matching patterns.

For those readers who are not familiar with patterns and regular expressions, I will explain them shortly.

Invocation Syntax

The invocation syntax for awk and sed is as follows:

command 'script' filenames

Here command is either awk or sed, script is a list of commands understood by awk or sed, and filenames is a list of files that the command acts on.

The single quotes around script are required to prevent the shell from accidentally performing substitution. The actual contents of script differ greatly between awk and sed. The commands understood by awk and sed are covered in separate sections later in this chapter.

If filenames are not given, both awk and sed read input from STDIN. This enables them to be used as output filters on other commands.

Basic Operation

When an awk or sed command runs, it performs the following operations:

1.  Reads a line from an input file
2.  Makes a copy of this line
3.  Executes the given script on this line
4.  Repeats step 1 for the next line

These operations illustrate the main feature of awk and sed—they provide a method of acting on every record or line in a file using a single script. When every record has been read, the input file is closed. If the input file is the last file specified in filenames, the command exits.

Script Structure and Execution

The script specified to awk or sed consists of one or more lines of the following form:

/pattern/ action

Here pattern is a regular expression, and action is the action that either awk or sed should take when the pattern is encountered. Regular expressions will be covered shortly. The slash characters (/) that surround the pattern are required because they are used as delimiters.

When awk or sed is executing a script, it uses the following procedure on each record:

1.  Sequentially searches each pattern until a match is found.
2.  When a match is found, the corresponding action is performed on the input line.
3.  When the action is completed, it goes to the next pattern and repeats step 1.
4.  When all patterns have been exhausted, it reads in the next line.

Just before step 4 is performed, sed displays the modified record. In awk you must manually display the record.

The actions taken in awk and sed are quite different. In sed, the actions consist of commands that edit single letters, whereas in awk the action is usually a set of programming statements.

Regular Expressions

The basic building blocks of a regular expression are

  Ordinary characters
  Metacharacters

Ordinary characters are

  Uppercase and lowercase letters such as A or b
  Numerals such as 1 or 2
  Characters such as a space or an underscore


Metacharacters are characters that have a special meaning inside a regular expression: They are expanded to match ordinary characters. By using metacharacters, you need not explicitly specify all the different combinations of ordinary characters that you want to match. The basic set of metacharacters understood by both sed and awk is given in Table 16.1.
Table 16.1 Metacharacters Used in Regular Expressions

Character Description

. Matches any single character except a newline.
* Matches zero or more occurrences of the character immediately preceding it.
[chars] Matches any one of the characters given in chars, where chars is a sequence of characters. You can use the - character to indicate a range of characters. If the ^ character is the first character in chars, one occurrence of any character that is not specified by chars is matched.
^ Matches the beginning of a line.
$ Matches the end of a line.
\ Treats the character that immediately follows the \ literally. This is used to specify patterns that contain one of the preceding wildcards.


Frequently regular expressions are referred to as patterns. In Chapter 8, “Substitution,” I described the shell feature know as filename substitution, which uses a subset of regular expressions to produce lists of files.


Note:  
In the context of filename substitution, I referred to metacharacters as wildcards. You might see these two terms used interchangeably in books and reference materials.


Previous Table of Contents Next