Sams Teach Yourself Shell Programming in 24 Hours:Fltering Text with awk

Sams Teach Yourself Shell Programming in 24 Hours
(Publisher: Macmillan Computer Publishing)
Author(s): Sriranga Veeraraghavan
ISBN: 0672314819
Publication Date: 01/01/99

Table of Contents

Hour 17
Filtering Text with awk

In Chapter 16, “Filtering Text Using Regular Expressions,” you looked at the sed command and used regular expressions to filter text. In this chapter you will look at another powerful text filtering command called awk.

The awk command is a complete programming language that enables you to search many files for patterns and conditionally modify files without having to worry about opening files, reading lines, or closing files. It’s found on all UNIX systems and is quite fast, easy to learn, and extremely flexible.

This chapter concentrates on the awk elements that are most commonly used in shell scripts. Specifically these features are

• Field editing

• Variables

• Flow control statements

What is awk?

The awk command is a programming language that enables you to search through files and modify records with these files based on patterns. The name awk comes from the last names of its creators Alfred Aho, Peter Weinberger, and Brian Kernighan. It has been a part of UNIX since 1978, when it was added to UNIX Version 7.

Currently three main versions are available:

• The original awk

• A newer version nawk

• The POSIX/GNU version gawk

The original awk has remained almost the same since its first introduction to UNIX in 1978. Originally it was intended to be a small programming language for filtering text and producing reports.

By the mid-1980s, people were using awk for large programs, so in 1985 its authors decided to extend it. This version, called nawk (as in new awk), was released to the public in 1987 and became a part of SunOS 4.1.x. Its developers intended for nawk to replace awk eventually. This has yet to happen. Most commercial UNIX versions such as HP-UX and Solaris still ship with both awk and nawk.

In 1992 the Institute of Electrical and Electronics Engineers (IEEE) standardized awk as part of its Portable Operating Systems Interface standard (POSIX). gawk, the GNU version of awk, is based on this standard. All Linux systems ship with gawk.

The examples in this chapter work with any version of awk.

Basic Syntax

The basic syntax of an awk command is

awk 'script' files

Here files is a list of one or more files, and script is one or more commands of the form:

/pattern/ { actions }

Here pattern is a regular expression, and actions is one or more of the commands that are covered later in this chapter. If pattern is omitted, awk performs the specified actions for each input line.

Look at the simplest task in awk, displaying all the input lines from a file. In this case you use a modified version of the file fruit_prices.txt from the previous chapter:

$ awk "{ print ; }" fruit_prices.txt
Fruit           Price/lbs       Quantity
Banana          $0.89           100
Peach           $0.79           65
Kiwi            $1.50           22
Pineapple       $1.29           35
Apple           $0.99           78

Here you use the awk command print to print each line of the input. When the print command is given without arguments, it prints the input line exactly as it was read.

Notice that there is a semicolon (;) after the print command. This semicolon is required to let awk know that the command has concluded. Strictly speaking, some older versions of awk do not require this, but it is good practice to include it anyway.

Field Editing

One of the nicest features available in awk is that it automatically divides input lines into fields. A field is a set of characters that are separated by one or more field separator characters. The default field separator characters are tab and space.

When a line is read, awk places the fields that it has parsed into the variable 1 for the first field, 2 for the second field, and so on. To access a field, use the field operator, $. Thus, the first field is $1.

Note:
The use of the $ in awk is slightly different than in the shell. The $ is required only when accessing the value of a field variable; it is not required when accessing the values of other variables. I will explain creating and using variables in awk in depth later in this chapter.

As an example of using fields, you can print only the name of a fruit and its quantity using the following awk command:

$ awk '{ print $1 $3 ; }' fruit_prices.txt

Here you use awk to print two fields from every input line:

• The first field, which contains the fruit name

• The third field, which contains the quantity

The output looks like the following:

FruitQuantity
Banana100
Peach65
Kiwi22
Pineapple35
Apple78

Notice that in the output there is no separation between the fields. This is the default behavior of the print command. To print a space between each field you need to use the , operator as follows:

$ awk '{ print $1 , $3 ; }' fruit_prices.txt
Fruit Quantity
Banana 100
Peach 65
Kiwi 22
Pineapple 35
Apple 78

You can format the output by using the awk printf command instead of the print command as follows:

$ awk '{ printf "%-15s %s\n" , $1 , $3 ; }' fruit_prices.txt
Fruit           Quantity
Banana          100
Peach           65
Kiwi            22
Pineapple       35
Apple           78

All the features of the printf command discussed in Chapter 13, “Input/Output,” are available in the awk command printf.

Table of Contents

Hour 17Filtering Text with awk

What is awk?

Basic Syntax

Field Editing

Hour 17
Filtering Text with awk