Previous | Table of Contents | Next |
Sometimes you dont really care about the actual lines in a file that match a particular word. You want a list of all the files that contain that word.
For example, the following command looks for the word delete in all the files in my projects directory:
$ grep delete /home/ranga/docs/projects
In my case, it produces the following output:
pqops.c:/* Function to delete a node from the heap. Adapted from ⇒Introduction pqops.c:void heap_delete(binary_heap *a,int i) { pqops.c: node deleted; pqops.c: /* return with an error if the input is invalid, ie trying ⇒to delete pqops.c: sprintf(messages,"heap_delete(): %d, no such element.",i); pqops.c: /* switch the item to be deleted with the last item, and ⇒then pqops.c: deleted = a->elements[i]; pqops.c: /* (compare_priority(a->elements[i],deleted)) ? heap_ ⇒up(a,i) : heap_down(a,i); */ pqops.h:extern void heap_delete(binary_heap *a,int i); scheduler.c: /* if the requested id is in the heap, delete it */ scheduler.c: heap_delete(&my_heap,node_num);
As you look at the output, you see that only three filespqops.c, pqops.h, and scheduler.ccontain the word delete.
Here you had to generate a list of matching lines and then manually look at the filenames in which those lines were contained. By using the -l option of the grep command, you reach this conclusion much faster. For example, the following command
$ grep -l delete * pqops.c pqops.h scheduler.c
produces the list you wanted.
Counting words is an essential capability in shell scripts. There are many ways to do it, with the easiest being the wc command. Unfortunately, it displays only the number of characters, words, or lines.
What about when you need to count the number of occurrences of word in a file? The wc command falls short. In this section, you will solve this problem using the following commands:
The tr command (tr for transliterate) changes all the characters in one set into characters in a second set. Sometimes it deletes sets of characters.
The sort command sorts the lines in an input file. If you dont specify an input file, it sorts the lines given on STDIN.
The uniq command (uniq for unique) prints all the unique lines in a file. If a line occurs multiple times, only one copy of the line is printed out. It can also list the number of times a particular line was duplicated.
I will use the text of this chapter, ch15.doc, as the input file for this example.
First, you need to eliminate all the punctuation and delimiters in the input file because the word end. and the word end are the same. You accomplish this task using the tr command. Its basic syntax is
tr 'set1' 'set2'
Here tr takes all the characters in set1 and transliterates them to the characters in set2. Usually, the characters themselves are used, but the standard C language escape sequences also work.
To accomplish my first task, I used the following command:
$ tr '!?":;\[\]{}(),.' ' ' < /home/ranga/docs/ch15.doc
Here I specified set2 as the space character because words separated by the characters in set1 need to remain separate after the punctuation is removed.
Notice that the characters [ and ] are given as \[ and \]. As you will see later in this chapter, these two characters have a special meaning in tr and need to be escaped using the backslash character (\) in order to be handled correctly.
At this point most of the words are separated by spaces, but some of the words are separated by tabs and newlines. To get an accurate count, all the words should be separated by spaces, so you need to covert all tabs and newlines to spaces:
$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc
The next step is to transliterate all capitalized versions of words to a lowercase version because the words To and to, The and the, and Files and files are really the same word. To do this, you tell tr to change all the capital characters 'A-Z' into lowercase characters 'a-z' as follows:
$ tr '!?":;\[\]{}(),.\t\n' ' ' < /home/ranga/docs/ch15.doc | tr 'A-Z' 'a-z'
I broke the command into two lines, with the pipe character as the last character in the first line so that the shell does the right thing and uses the next line as the command to pipe to. This makes it easier to read and cut and paste, also.
Note:
Differences between tr versionsIn this example, you are using a single space for set2. Most versions of tr interpret this to mean transliterating all the characters in set1 to a space. Some versions of tr do not do this.
You can determine whether your tr works in this manner using the test code:
$ echo "Hello, my dear!" | tr ',!' ' 'Most versions of tr produce the following output:
Hello my dear
Some versions produce the following output instead:
Hello my dear!
To obtain the desired behavior from these versions of tr, make sure that set1 and set2 have the same number of characters. In this case, set2 needs to contain two spaces:
$ echo "Hello, my dear!" | tr ',!' ' 'In the case of the sample problem, set2 would need to contain 15 spaces.
Previous | Table of Contents | Next |