extract data from structured text files


Inspiration: Classic Shell Scripting  Arnold Robbins and Nelson A. F. Beebe, O'Reilly & Associates, 2005; Chapter 5 "Pipelines Can Do Amazing Things"

"In this chapter we solve several relatively simple text processing jobs. What's interesting about all the examples here is that they are scripts build from simple pipelines: chains of one command hooked into another. Yet each one accomplishes a significant task."

----------


Sometimes we may wish to perform a text-to-text transformation, producing a pretty-format report for human consumption from an existing storage-format data source. In this example the passwd file is used from which to produce a desired report. It demonstrates the power of pipelines, regular expressions, sed for search-and-replace, and gawk for text field manipulation.

The passwd file has a well known, 7-field stucture. All but the 5th field have well defined content. That field is in effect a comment. Standard utilities don't use it for anything. You can put whatever you want in it. Imagine the passwd records for several U.S. presidents, in which the comment field is utilized to hold a president's name, home state, and party affiliation. From it, we wish to programmatically produce a report with certain content and format. Both the passwd file (raw material) and derived report (finished product) are shown below. (Obtain the abbreviated passwd file shown, holding four past U.S. presidents.)


How do we get from here to there? We will first produce a reduced starter file that gets rid of the parts of the password file we don't want. Then from that, we'll make 3 intermediate files corresponding to the 3 columns above: name, state, political party. We can use gawk to select out just the parts of the /etc/passwd records we do want, which are the 1st and 5th fields. Test it manually on a single record by explicitly feeding gawk just the one record:

echo truman:x:664:666:Harry S. Truman/Missouri/Democrat:/home/truman:/bin/bash | gawk -F: '{ print $1 ":" $5 }'

It extracts just the fields you specify. Or do it for all records, the whole file, by handing gawk the file on the command line. Please execute:

gawk -F: '{ print $1 ":" $5 }' ppasswd

(While you could manually type these commands, that's tedious and error prone. I suggest instead you copy them from this page and paste them into your shell. After copying to the clipboard a simple right click within your terminal window may suffice to paste, depending which terminal window program you are running. Paste at the prompt, hit enter, and you just executed the indicated command without risk of typo.)

This is the data and format we want at this stage:


Now we want to disaggregate this into 3 intermediate working files. :One will hold the person's name, another his state, and finally the political party. Each will additionally retain the user account name ("truman" as opposed to "Harry S. Truman"), for use as a key field for associating the right records between files. We will use pipelines to do it, liberally employing regular expressions (in sed). First, you will do it on a manual, single-record basis. Once you understand what is happening to each record, you will do it on a full file-wide basis covering all records in the file.

We use President Truman's record as our sample. Carrying the above "truman:Harry S. Truman/Missouri/Democrat" format a stage further, execute:

echo truman:Harry S. Truman/Missouri/Democrat | sed 's=/.*=='

This turns  truman:Harry S. Truman/Missouri/Democrat  into  truman:Harry S. Truman

and then, additively (you could use command history to recover the last command, then pipe-extend it):

echo truman:Harry S. Truman/Missouri/Democrat | sed 's=/.*==' | sed -e 's=^\([^:]*\):\(.*\) \([^ ]*\)=\1:\3, \2='

This turns it further into last, first middle form as  truman:Truman, Harry S.

That's the form we want for the first of our 3 intermediate files. Detailed analysis of this regular expression is depicted below, in the section titled OPERATION 1.

The format we'll want for the second file follows this pattern:

echo truman:Harry S. Truman/Missouri/Democrat | sed -e 's=^\([^:]*\):[^/]*/\([^/]*\)/.*$=\1:\2='

It turns  truman:Harry S. Truman/Missouri/Democrat  into  truman:Missouri

This regular expression is annotated below in the section titled OPERATION 2.

And for the third file's format:

echo truman:Harry S. Truman/Missouri/Democrat | sed -e 's=^\([^:]*\):[^/]*/[^/]*/\([^/]*\)$=\1:\2='

which turns truman:Harry S. Truman/Missouri/Democrat  into  truman:Democrat

The regular expression is detailed in the section below titled OPERATION 3.

Now make whole files, using the same regular expressions you just tested:

gawk -F: '{ print $1 ":" $5 }' ppasswd  >  USER


sed 's=/.*==' USER | sed -e 's=^\([^:]*\):\(.*\) \([^ ]*\)=\1:\3, \2=' | sort > PERSON

sed -e 's=^\([^:]*\):[^/]*/\([^/]*\)/.*$=\1:\2=' USER | sort > STATE

sed -e 's=^\([^:]*\):[^/]*/[^/]*/\([^/]*\)$=\1:\2=' USER | sort > PARTY


With these intermediate files prepared, holding the pieces of data desired for the final report in the desired form, let's join them to generate that report. (Note above that we made sure to sort our keyed intermediate files to meet the requirements for join-ing them now.) I suggest you do it in stages to reveal what is going on:

join -t: PERSON STATE

join -t: PERSON STATE | join -t: - PARTY

join -t: PERSON STATE | join -t: - PARTY | cut -d: -f 2-

join -t: PERSON STATE | join -t: - PARTY | cut -d: -f 2- | sort -t: -k1,1 -k2,2 -k3,3

join -t: PERSON STATE | join -t: - PARTY | cut -d: -f 2- | sort -t: -k1,1 -k2,2 -k3,3 | awk -F: '{ printf("%-39s\t%s\t%s\n", $1, $2, $3) }'

There is the report. Isn't it pretty?

Here is a script that packages all this for programatic execution.

 

----------
OPERATION 1 - basis for producing the intermediate file PERSON from the file USER 

Objective: account name, colon, person's name in form last, comma, first and middle

sed 's=/.*=='

applied to: 

truman:Harry S. Truman/Missouri/Democrat

COLOR MEANING CONTENT Also known as
green a slash followed by a run (*) of any characters (.) /Missouri/Democrat  

output expression is input with replacement (of matched expression, with nothing)
so in this case
input truman:Harry S. Truman/Missouri/Democrat
 yields
output truman:Harry S. Truman

then:

sed -e 's=^\([^:]*\):\(.*\) \([^ ]*\)=\1:\3, \2='

applied to: 

truman:Harry S. Truman

COLOR MEANING CONTENT Also known as
grey beginning of line    
green a run ( * ) of characters that are not colons ( [^:] ) truman \1
red a colon :  
turquoiise a run ( * ) of any characters ( . ) Harry S. \2
uncolored a space the space between . and T  
purple a run ( * ) of characters that are not spaces ( [^ ] ) Truman \3

output expression is \1:\3, \2
so in this case
input truman:Harry S. Truman
 yields
output truman:Truman, Harry S.


----------
OPERATION 2 - basis for producing the intermediate file STATE from the file USER

Objective: account name, colon, person's state

sed -e 's=^\([^:]*\):[^/]*/\([^/]*\)/.*$=\1:\2='

applied to: 

truman:Harry S. Truman/Missouri/Democrat

COLOR MEANING CONTENT Also known as
grey beginning of line    
green a run ( * ) of characters that are not colons ( [^:] ) truman \1
red a colon ( : ) followed by a run ( * ) of characters that are not slashes ( [^/] ) followed by a slash :Harry S. Truman/  
turquoiise a run ( * ) of characters that are not slashes ( [^/] ) Missouri \2
purple a slash ( / ) followed by a run ( * ) of any characters ( . ) at the end of the line ( $ ) /Democrat  

output expression is \1:\2
so in this case
input truman:Harry S. Truman/Missouri/Democrat
 yields
output truman:Missouri


----------
OPERATION 3 - basis for producing the intermediate file PARTY file from the file USER 

Objective: account name, colon, person's political party

sed -e 's=^\([^:]*\):[^/]*/[^/]*/\([^/]*\)$=\1:\2='

applied to: 

truman:Harry S. Truman/Missouri/Democrat

COLOR MEANING CONTENT Also known as
grey beginning of line    
green a run ( * ) of characters that are not colons ( [^:] ) truman \1
red a colon ( : ) followed by a run ( * ) of characters that are not slashes ( [^/] ) followed by a slash ( / ) followed by another run of characters that are not slashes followed by another slash :Harry S. Truman/Missouri/  
turquoiise a run ( * ) of characters that are not slashes ( [^/] ) at the end of the line ( $ ) Democrat \2

output expression is \1:\2
so in this case
input truman:Harry S. Truman/Missouri/Democrat
 yields
output truman:Democrat