sorting multi-line blocks

with sed and awk


We encountered a way to use the sort command, which sorts lines, to sort blocks of lines. We saw that it can be done by reducing each block of multiple lines into a single one, applying sort to those, and then breaking the sorted lines back up into their multiple, original constituent lines. That's done by temporary replacement of newline characters with substitues before the sort, and their restoration afterwards. Voila, block sort.

 

 

The idea was introduced in the Robbins book. It depended on the insertion to the file of a first line for each block containing whatever key value you wish to sort on. While this is artificial, and requires explicit effort to put the sort keys in place, it also offers flexibility to sort on anything you want.  Robbins used the sample text above in a file named my-friends, and a shell pipeline with awk doing all the character substitutions. We however wanted to avoid awk and get the job done by tr and sed instead. This was our strategy:

 

 
 -replace all \n's with a first control character ( ^X 030 18)
  use tr (not sed, because it won't do \n's)

 -replace pairs of first with a second one (^Y 031 19)
  use sed (not tr, because it doesn't do pairs/strings, only individual characters)

 -replace remaining first's (those that were single) with a third (^Z 032 1A)
  use tr or sed

 - replace seconds (that's where there was double \n) with \n
  use tr or sed

 -sort (by lines; now whole block is reduced to its own single line)

 -double space

 -replace thirds with \n (to turn lines back into blocks from which they came)


and this was our implementation:


 cat my-friends | tr "\n" "\030"| sed 's/\o030\o030/\o031/g' | sed 's/\o030/\o032/g' | sed 's/\o031/\n/g' | sort -f | sed G | sed 's/\o032/\n/g'


Please do it both our way, and Robbins'. Obtain copies of the sample data in the file my-friends. Also there is a file with the SORTKEY lines eliminated, named my-friends.no-sortkey-line. They are in my-friends.zip. Obtain, unzip. Then perform the sort using sed/tr:

cat my-friends | tr "\n" "\030"| sed 's/\o030\o030/\o031/g' | sed 's/\o030/\o032/g' | sed 's/\o031/\n/g' | sort -f | sed G | sed 's/\o032/\n/g' | grep -v '# SORTKEY'

and again using Robbins' awk based shell pipeline:

cat my-friends | gawk -v RS="" '{ gsub("\n", "^Z"); print }' | sort -f | gawk -v ORS="\n\n" '{ gsub("^Z", "\n"); print }' | grep -v '# SORTKEY'

(Tip: to insert the ctrl-Z charaters literally from the keyboard, type ctrl-V followed in quick succession by ctrl-Z. A single ctrl-Z character is inserted into the text. You could feed it to xxd to reveal them, where they would appear as 1a's.) In doing it both ways, I suggest breaking down the pipelines into their components. Start with just the first command. Then start piping, adding the second and repeat. Then the third, and so on one at a time to grasp what each stage adds toward the goal.

Yet a third variation on this task appears in the classic and definitive book on awk by its authors and namesakes Aho, Weinberger, and Kernighan, The AWK Programming Language. It uses awk and its specific features more fully than did Robbins. It does so in script msort.sh:

# pipeline to sort address list by last names

awk '
BEGIN { RS = ""; FS = "\n" }
      { printf("%s!!#", x[split($1, x, " ")])
        for (i = 1; i <= NF; i++)
            printf("%s%s", $i, i < NF ? "!!#" : "\n")
      }
' |
sort |
awk '
BEGIN { FS = "!!#" }
      { for (i = 2; i <= NF; i++)
            printf("%s\n", $i)
        printf("\n")
      }
'

The AWK Programming Language p84
(appreciate this is a shell script, not an awk script. The code this shell script contains that awk executes is shown in blue)


You have it; it was included in the zip file you downloaded earlier. So was a modified version of the sample text file, that dispensed with the SORTKEY lines. That's because msort.sh doesn't rely on such a line. It picks out what-to-sort-on from within the text itself. Sort again:

cat my-friends.no-sortkey-line | ./msort.sh

All three sorts produced the same output using similar methods. msort.sh gives us the opportunity to look at several awk-specific features. Here are several that figure in the above script.

RS - input record separator

FS - input field separator

NF - number of fields in current record

arrays

split( ) - distribute fields-in-record into elements-in-array

for loops

conditional expression    expr1 ? expr2 : expr3

 

  Multiline Records

    By default, records are separated by newlines, so the terms
    "line" and "record" are normally synonymous. The default record
    separator can be changed in a limited way, however, by assigning
    a new value to the built-in record-separator variable RS. If
    RS is set to the null string, as in

          BEGIN { RS = "" }

    then records are separated by one or more blank lines and each
    record can therefore occupy several lines. Setting RS back to
    newline with the assignment RS = "\n" restores the default
    behavior. With multiline records, no matter what value FS has,
    newline is always one of the field separators.
       A common way to process multiline records is to use

          BEGIN { RS = ""; FS = "\n" }

    to set the record separator to one or more blank lines and the
    field separator to a newline alone; each line is thus a separate
    field.   The AWK Programming Language, pp.60-61
 

Understand what the script does. To help, create a truncated version of msort.sh that ends before the "sort" command it contains. (Copy msort.sh under another name,  remove the pipe symbol in the line above sort, then remove the sort line and all that follow.) Feed the input to that, and observe the intermediate data that sort will see.

What about the role of  split($1, x, " ") ? It splits the first line of each record into the array x and returns the number of elements created. So therefore, what does  x[split($1, x, " ")]  signify?