Previous | Table of Contents | Next |
Squeezing Out Spaces
At this point, several of the lines have multiple spaces separating the words. You need to reduce or squeeze these multiple spaces into single spaces to avoid problems with counting later in this example. To do this, you need to use the -s (s as in squeeze) option to the tr command. The basic syntax is
tr -s 'set1'
When tr encounters multiple consecutive occurrences of a character in set1, it replaces these with only one occurrence of the character. For example,
$ echo "feed me" | tr -s 'e'
produces the output
fed me
Here the two es in feed were reduced to a single e.
If you specify more than one character in set1, the replacement is character specific. For example:
$ echo "Shell Programming" | tr -s 'lm'
produces the following output:
Shel Programing
As you can see the two ls in Shell were reduced to a single l. Also, the two ms in Programming were reduced to a single m.
Now you can squeeze multiple spaces in the output into single spaces using the command:
$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc | tr 'A-Z' 'a-z' | tr -s ' '
To get a count of how many times each word is used, you need to sort the file using the sort command. In its simplest form, the sort command sorts each of its input lines. Thus you need to have only one word per line. You can do this changing all the spaces into new lines as follows:
$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc | tr 'A-Z' 'a-z' | tr -s ' ' | tr ' ' '\n'
Now you can sort the output, by adding the sort command:
$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc | tr 'A-Z' 'a-z' | tr -s ' ' | tr ' ' '\n' | sort
At this point, you can eliminate all the repeated words by using the -u (u as in unique) option of the sort command. Because you need a count of the number of times a word is repeated, you should use the uniq command.
By default, the uniq command discards all but one of the repeated lines. For example, the commands
$ echo ' peach peach peach apple apple orange ' > ./fruits.txt $ uniq fruits.txt
produce the output
peach apple orange
As you can see, uniq discarded all but one of the repeated lines.
The uniq command produces a list of the uniq items in a file by comparing consecutive lines. To function properly, its input needs to be a sorted file. For example, if you change fruits.txt as follows
$ echo ' peach peach orange apple apple peach ' > ./fruits.txt $ uniq fruits.txt
the output is incorrect for your purposes:
peach orange apple peach
Returning to the original problem, you need uniq to print not only a list of the unique words in this chapter but also the number of times a word occurs. You can do this by specifying the -c (c as in count) option to the uniq command:
$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc | tr 'A-Z' 'a-z' | tr -s ' ' | tr ' ' '\n' | sort | uniq -c
At this point the output is sorted alphabetically. Although this output is useful, it is much easier to determine the most frequently used words if the list is sorted by the number of times a word occurs.
To obtain such a list, you need sort to sort by numeric value instead of string comparison. It would also be nice if the largest number was printed first. By default, sort prints the largest number last. To satisfy both of these requirements, you specify the -n (n as in numeric) and -r (r as in reverse) options to the sort command:
$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc | tr 'A-Z' 'a-z' | tr -s ' ' | tr ' ' '\n' | sort | uniq -c | sort -rn
By piping the output to head, you can get an idea of what the ten most repeated words are:
$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc | tr 'A-Z' 'a-z' | tr -s ' ' | tr ' ' '\n' | sort | uniq -c | sort -rn | head 389 the 164 to 127 of 115 is 115 and 111 a 80 files 70 file 69 in 65 '
Sorting Numbers in a Different Column
In the preceding output, you used the sort -rn command to sort the output by numbers because the numbers occurred in the first column instead of the second column. If the numbers occurred in any other column, this would not be possible.
Suppose the output looked like the following:
$ cat switched.txt files 80 file 70 is 115 and 115 a 111 in 69 ' 65 the 389 to 164 of 127
Now you need to tell sort to sort on the second column, and you cannot simply use the -r and -n options. You need to use the -k (k as in key) option also.
The sort command constructs a key for each line in the file, and then it arranges these keys into sorted order. By default, the key spans the entire line. The -k option gives you the flexibility of telling sort where the key should begin and where it should end, in terms of columns.
The number of columns in a line is the number of individual words on that line. For example, the following line contains three columns:
files 80 100
The basic syntax of the -k option is
sort -k start,end files
Here start is the starting column for the key, and end is the ending column for the key. The first column is 1, the second column is 2, and so on.
For the switched.txt file, start and end are both 2 because there are only two columns and you want to sort on the second one. The command you use is
$ sort -rn -k 2,2 switched.txt the 389 to 164 of 127 is 115 and 115 a 111 files 80 file 70 in 69 ' 65
Because there are only two columns, you can omit the ending column as follows:
$ sort -rn -k 2 switched.txt
Previous | Table of Contents | Next |