Sams Teach Yourself Shell Programming in 24 Hours:Test Filters

Sams Teach Yourself Shell Programming in 24 Hours
(Publisher: Macmillan Computer Publishing)
Author(s): Sriranga Veeraraghavan
ISBN: 0672314819
Publication Date: 01/01/99

Table of Contents

Squeezing Out Spaces

At this point, several of the lines have multiple spaces separating the words. You need to reduce or squeeze these multiple spaces into single spaces to avoid problems with counting later in this example. To do this, you need to use the -s (s as in squeeze) option to the tr command. The basic syntax is

tr -s 'set1'

When tr encounters multiple consecutive occurrences of a character in set1, it replaces these with only one occurrence of the character. For example,

$ echo "feed me" | tr -s 'e'

produces the output

fed me

Here the two e’s in feed were reduced to a single e.

If you specify more than one character in set1, the replacement is character specific. For example:

$ echo "Shell Programming" | tr -s 'lm'

produces the following output:

Shel Programing

As you can see the two l’s in Shell were reduced to a single l. Also, the two m’s in Programming were reduced to a single m.

Now you can squeeze multiple spaces in the output into single spaces using the command:

$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc |
tr 'A-Z' 'a-z' |  tr -s ' '

The sort Command

To get a count of how many times each word is used, you need to sort the file using the sort command. In its simplest form, the sort command sorts each of its input lines. Thus you need to have only one word per line. You can do this changing all the spaces into new lines as follows:

$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc |
tr 'A-Z' 'a-z' | tr -s ' '  | tr ' ' '\n'

Now you can sort the output, by adding the sort command:

$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc |
tr 'A-Z' 'a-z' | tr -s ' '  | tr ' ' '\n' | sort

The uniq Command

At this point, you can eliminate all the repeated words by using the -u (u as in unique) option of the sort command. Because you need a count of the number of times a word is repeated, you should use the uniq command.

By default, the uniq command discards all but one of the repeated lines. For example, the commands

$ echo '
peach
peach
peach
apple
apple
orange
' > ./fruits.txt
$ uniq fruits.txt

produce the output

peach
apple
orange

As you can see, uniq discarded all but one of the repeated lines.

The uniq command produces a list of the uniq items in a file by comparing consecutive lines. To function properly, its input needs to be a sorted file. For example, if you change fruits.txt as follows

$ echo '
peach
peach
orange
apple
apple
peach
' > ./fruits.txt
$ uniq fruits.txt

the output is incorrect for your purposes:

peach
orange
apple
peach

Returning to the original problem, you need uniq to print not only a list of the unique words in this chapter but also the number of times a word occurs. You can do this by specifying the -c (c as in count) option to the uniq command:

$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc |
tr 'A-Z' 'a-z' | tr -s ' '  | tr ' ' '\n' | sort | uniq -c

Sorting Numbers

At this point the output is sorted alphabetically. Although this output is useful, it is much easier to determine the most frequently used words if the list is sorted by the number of times a word occurs.

To obtain such a list, you need sort to sort by numeric value instead of string comparison. It would also be nice if the largest number was printed first. By default, sort prints the largest number last. To satisfy both of these requirements, you specify the -n (n as in numeric) and -r (r as in reverse) options to the sort command:

$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc |
tr 'A-Z' 'a-z' | tr -s ' '  | tr ' ' '\n' | sort | uniq -c |
sort -rn

By piping the output to head, you can get an idea of what the ten most repeated words are:

$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc |
tr 'A-Z' 'a-z' | tr -s ' '  | tr ' ' '\n' | sort | uniq -c |
sort -rn | head
389 the
164 to
127 of
115 is
115 and
111 a
 80 files
 70 file
 69 in
 65 '

Sorting Numbers in a Different Column

In the preceding output, you used the sort -rn command to sort the output by numbers because the numbers occurred in the first column instead of the second column. If the numbers occurred in any other column, this would not be possible.

Suppose the output looked like the following:

$ cat switched.txt
files 80
file 70
is 115
and 115
a 111
in 69
' 65
the 389
to 164
of 127

Now you need to tell sort to sort on the second column, and you cannot simply use the -r and -n options. You need to use the -k (k as in key) option also.

The sort command constructs a key for each line in the file, and then it arranges these keys into sorted order. By default, the key spans the entire line. The -k option gives you the flexibility of telling sort where the key should begin and where it should end, in terms of columns.

The number of columns in a line is the number of individual words on that line. For example, the following line contains three columns:

files 80 100

The basic syntax of the -k option is

sort -k start,end files

Here start is the starting column for the key, and end is the ending column for the key. The first column is 1, the second column is 2, and so on.

For the switched.txt file, start and end are both 2 because there are only two columns and you want to sort on the second one. The command you use is

$ sort -rn -k 2,2 switched.txt
the 389
to 164
of 127
is 115
and 115
a 111
files 80
file 70
in 69
' 65

Because there are only two columns, you can omit the ending column as follows:

$ sort -rn -k 2 switched.txt

Table of Contents