7. Awk Interface / Awk Primer

This section covers the Awk interface, which allows preprocessing of input files, as well as postprocesing of output files.

You can even write multiple output files with it.

Feel free to mail your comments to me at: rog@NOSPAM_rs-freeware.org.

Return to the global table of contents
Frames mode, or No frames


1:  The GNU Awk documentation: boring bandwidth . . . or "words of wisdom?"

2:  Awk Overview for ADB users

3:  A quick Awk Tour

4:  Awk's [s]printf for C/C++/Java programmers

5:  Overview of ADB's Awk Interface

6:  Awk Interface: small (fixed) tables and massaging for ADB users

7:  Using the ADBAWKLIB environmental variable for global functions

8:  Using the ADBAWKLIB environmental variable for global data

9:  More on Importing Global Application Data

10:  Exporting Global Application Data

11:  GAwk's "system" command (another way to output global data)

12:  How to write debugging messages when testing Awk scripts

13:  How to signal errors from within an Awk script, in order to stop the processing "*IMMEDIATELY*"

14:  Multiple output files with ADB_BldOutLineFnc and FileMux.Exe

15:  Testing your Awk scripts

16:  Adding "phantom" (calculation result) fields to input files

17:  Index of ADB's Awk Interface functions


1:  The GNU Awk documentation: boring bandwidth . . . or "words of wisdom?"

You guessed it.

If you really want to get the most out of Awk, and/or ADB's Awk interface, you absolutely must read the documentation.

This file is called GAwk.Doc, and it's part of the main distribution zip for ADB.

Although I do my best to provide you with many examples in this section, there's absolutely no question that you need to "RTFM" (read the "fine" manual).

Alas, the documentation was written by Unix afficiandos, who put "pattern matching" on a pedastal.  Just ignore that aspect of it, and pay close attention to the built-in functions and variables, as well as the data types, and array processing.

Go ahead and read this documentation first (i.e. this entire section).  But if you want to get the most out of Awk, don't forget about GAwk.Doc.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

2:  Awk Overview for ADB users

(If you're not interested in ADB, you can skip this section.)

Awk is a very simple scripting language that has been used for decades in the Unix environment.  The Free Software Foundation has implemented GNU Awk for DOS.  ADB has been specially designed to let you use Awk scripts in two places:

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

3:  A quick Awk Tour

(You can skip this section if you're already familiar with Awk. Alternatively you can read GAwk.Doc in the GAwk zip file that comes with the distribution.)

Although Awk is typically described as a "pattern matching" language, I've found that it makes a lot more sense just to view it as a kind of script-builder for what's traditionally known as a "filter."

By a "filter," I mean a program that takes one input data file and writes one output data file.  (Later in this section, I'll show you how the utility FileMux.Exe can be used to easily "simulate" multiple output files.)

If you're used to the DOS or Unix command lines, you'll recognize this type of processing structure:


NoDups.Exe <InFile.Dat >OutFile.Dat

In this very simple example, NoDups.exe is a (hypothetical) executable program that takes the file InFile.Dat and writes InFile.Dat records to OutFile.dat, provided that record isn't identical to the previous InFile.Dat record.  (Obviously the program won't work properly unless the input is sorted.)

GNU Awk has a syntax which is fairly similar to this:


GAwk -f AwkScript.Awk <InFile.Dat >OutFile.Dat

Here, the "-f AwkScript.Awk" part of the command line is there to specify the script file itself.

WARNING:
GNU Awk is the only program in this package which uses "Unix" command-line conventions.  That means that there must be a space between the parameter (argument) name, and any supplied value.  In other words, you can't omit the space after the -f.  However, for all other programs in this package, you must not put a space between the name of the parameter and any supplied value.


As I recommended earlier, let's look at GAwk as a simple filter and forget about all the mumbo-jumbo in the documentation that refers to "pattern matching."

If we take that view, then there are just three parts to a very simple Awk script: (1) initialization (statments that are processed before a record is read); (2) the loop processing (statements that process each record); and (3) the end-of-job processing (statements that run after all records have been read).

Let's see how just easy it is to code a simple Awk script, by diving right into a simple example.

The invoice file has already been sorted by customer number.  It has just two fields: a customer number (the first 4 bytes of the record), and an invoice amount (represented as a fixed decimal set of ASC digits, which is a total of 8 bytes long).

So here's what a typical input file might look like:


1111   10.00
1111   15.00
2222   25.00
2222   10.00

For example: customer 1111 has one invoice for $10.00; and another for $15.00.

Now if you already know some variant of C (C, C++, VC++, etc.) and/or Java[Script], you may find that learning Awk takes you less time than it has ever taken you (or ever will take you) to learn a programming language!


Let's write the Awk script which will output a file of totals for each customer number.


function Print_Cust_Tots() {
         printf("Cust No=%s Total=%9.2f\n", Last_CustNo, Cust_Tot);
         Cust_Tot = 0.0;
}

BEGIN {
       FIELDWIDTHS = "4 8";   # Widths of the fields
       Rec_Seq     = 0;       # This is the rec #
       Cust_Tot    = 0.0;     # Total for current cust
}

{
      Curr_CustNo  = $1 "";   # Cust # is field #1, string
      Invoice_Amt  = $2 + 0;  # Invoice amount, field #2

      if ((Rec_Seq++ > 0) && (Last_CustNo != Curr_CustNo))
         {Print_Cust_Tots();}

      Cust_Tot    += Invoice_Amt;
      Last_CustNo  =  Curr_CustNo;
}

END  {if (Rec_Seq > 0) Print_Cust_Tots();  # Flush last
     }

The code in the BEGIN block is what gets executed before any records are read; and the code in the END block is run at EOF on input.

The block in the middle gets executed for each record.

Although I didn't really need a function to print the customer totals, I decided to code one, just in case I wanted to expand the processing a little bit (for example, to check for an unsorted input file).

As you can see, the syntax is essentially identical to that of C or [Java]Script.  The one aspect of Awk that might be unfamiliar to JS programmers is the Awk "printf" statement (used in the Print_Cust_Tots function).  I'll talk about how to code those later on this documentation.

If you're a C/C++/Java programmer who's already familiar with printf, you should note that Awk doesn't handle long integer fields correctly.  You should always use %f specifications (with a precision of 0) instead of %d or %ld to get the best results.

This is such an important "booby trap" in Awk that I'll mention it later in this section.

Note that the BEGIN block is where my field lengths and offsets get defined.

FIELDWIDTHS = "4 8" means that the first field is 4 bytes long, and the second is 8 bytes long.

Since variables in GAwk have no type until they're assigned, the input fields have no definite type until I assign them to values.

The statement:


Curr_CustNo = $1 "";

essentially declares Curr_CustNo to be the first field, and defines it as a string.

The reason for the type conversion is that string concatenation in Awk is assumed when two variable names are placed next to each other.

JavaScript programmers will recognize this basic idea (dynamic variable definition and typing).

The only special nuance here is that no plus sign is required for the string concatenation.

WARNING:
if you accidentally leave out an arithmetic operator, this can cause unexpected type convsions, as in: a=1; b=2; c=a b; # whoops . . . c is now "12"!

The statement:


Invoice_Amt = $2 + 0;

does a similar thing--i.e. the Invoice_Amt variable is declared by assigning it.

Since a "+ 0" is part of the assignment, GAwk assumes that the value is numeric.

Note that if there was non-numeric data in this field, GAwk would parse as many digits as it could, discarding the rest.


WARNING:
Awk expects a negative sign for a number to either follow the number on the right, or to preceed the number on the left (with no intervening spaces).  Although this is standard practice in most numeric formats, you'll have to write a special-purpose conversion routine and define the field as "special" if this is not so.  See the
example for one such conversion routine.


Line comments are done with the pound sign (aka "hash mark"), as with the Unix Bourne shell.  Unlike C or Java, Awk offers no multiline comment convention.


Awk has only two basic data types: string and floating point.

Because strings are a primitive type in Awk, you can test for equality between a string variable and a string constant (or another string variable) "directly." This is the norm in Java[script], but not in C.  For example, you can write:


if (Cust_Name == "Smith") { ... }  # Do stuff for Smith


Other than arrays and floating point numbers, only functions and arrays are recognized.

Arrays are also dynamically created, and are (nominally) only single-dimensional.  But since they're indexed by strings, you can simulate multidimensional arrays (see the documentation file GAwk.Doc in the GAwk zip file for more details.)

Awk has a special statement that allows you to get the most out of the fact that arrays are indexed by strings:


if (val in array) { ... }

For example:


if (State_Code in Sales_Tax_Array) { # Found state code
   Sales_Tax = Sales_Tax_By_State_Array[State_Code];
   }
else {
   Sales_Tax = 0; Err_Flag = '9';   # Oops, missing state code
   }


Functions can be defined with both passed parameters as well as local variables.

Unfortunately, because local variables aren't a feature that was initially built into Awk, you must declare them as extra parameters at the end of the argument list (putting in a few spaces to clue the reader in).  E.g.:


function(p1,p2,  l1,l2) # p? are parms, l? are locals)

The disadvantage of this is that there's no special detection available when you call a function with an insufficient number of parameters.  The missing parms will initialized to either the null string or 0 (depending on how they're later used).


Functions can be recursive, and naturally they can call other functions.  They should be defined before the BEGIN block.

One little quirk about function defintion: you can't put a space between the word "function" (which may be abbreviated as "func") and the opening parenthesis for the parameter list.


GAwk recognizes either CR/LF or LF as the input line separator. That means you can use it to read ASC text files created on a Unix system that have been transferred to you as binary.  I'm not aware of a mechanism that allows you to go the other way (i.e. to output ASC file lines under Unix conventions: LF but no CR.)


That completes my whirlwind tour of Awk and its features--but as you can see, it's just as powerful and flexible (if not more so) than most DBMSs' built-in "programming languages."

And since it's so much like C or [Java]script, a large number of people can read the code without much in the way of special training.


Note that the file Gawk.Doc contains much more information about Awk's built-in functions and other special conventions.


You may have noticed that Awk doesn't have objects, classes, or collections, nor does have type checking, record types, and there's no (true) dynamic allocation (other than that associated with arrays).  There's no concept of pointers or pointer incrementation/arithmetic, etc.  As far as I know, there's no elegant way to test for equality between arrays, nor any recognition for a data type of "function," or anything that even comes close to JavaScript's or LISP's concept of "eval".

But these aren't necessarily limitations when it comes to the kinds of small- and medium-sized applications which ADB is designed to handle.  In fact, the more you learn about Awk, the more you'll probably be tempted to use it (instead of a typical compiled language).  This is why you need to think in terms of "using the right tool for the right (sub-)process."


Let's return to our brief example.

Assuming the obvious file name assignments, running this statement from the DOS prompt:


GAwk -f AwkScript.Awk <InFile.Dat

yields this output:


Cust No=1111 Total=    25.00
Cust No=2222 Total=    35.00

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

4:  Awk's [s]printf for C/C++/Java programmers

If you're a JavaScript programmer who's unfamiliar with C's printf, I've written a special section for you that's in another part of this documentation.

If you're a C/C++/Java programmer who's already familiar with printf, you should note that Awk doesn't handle long integer fields correctly.

For the best (and most predictable) results, you should always use %f specifications (with a precision of 0) instead of %d or %ld.

Also, Awk's sprintf works differently than C's.  Instead of coding:


sprintf(Result_String, Format_String, Format_Args);

you code:


Result_String = sprintf(Format_String, Format_Args);

Note that Awk has no scanf function.  If you want to "convert" a string value to a number, you simple code:


String += 0;

If you want to convert a numeric value to a string, you code:


Number = Number ""

(This is because string concatenation is implicit.)

You might also want to search for the string "CONVFMT" in the GAwk documentation (Gawk.Doc).

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

5:  Overview of ADB's Awk Interface

(If you're not interested in ADB, you can skip this section.)

When you use ADB with Awk, ADB does most of the "heavy lifting" for you, in terms of defining fields and creating all the other required infrastructure . . . so you can focus on writing the very small number of lines of code that you'll need in order to get the job done.

In fact for most applications, you need to only write one Awk function, although you may write two others.


You must code a function called "ADB_Main." This function is what goes at the end of what I've been calling the "loop" block.

WARNING:
Don't put a space between the word "function" and "ADB_Main".  Also, don't abbreviate "function" to "func".  The same goes for the other two functions, ADB_BOJ and ADB_EOJ.


If you wish to write out the current record, you may call ADB_WriteFnc() (it takes no arguments).  Note that if a record isn't actually written out, then it's going to be ignored.


Let's return to the previous example's files.

The invoice file has already been sorted by customer number.  It has just two fields: a customer number (the first 4 bytes of the record), and an invoice amount (represented nas a fixed decimal set of ASC digits, which is a total of 8 bytes long).

Let's say that the input file looks like this:


1111   10.00
1111   15.00
2222   25.00
2222   10.00

Suppose the fields are defined like so in your ADB defintions files:


FLD  Cust_No       4        # Customer # (treat as string)
FLD  Invoice_Amt__ 8 2      # Invoice amount (fixed-decimal)
FSET Cust_Srt; Cust_No END  # Sort on customer#
FSET Invoice_File           # Invoice file
     Cust_No; Invoice_Amt__ END

Note that I had to end the Invoice_Amt field in two underscores, because it's a fixed-decimal field.

Let's assume that we are writing an Awk script for the input file, and we only wish to select customers who have customer numbers that begin with "11".

Assume further that the file name is "invoice.xx" (i.e. the base file name is "invoice" and the project file suffix is "xx").

In this case, the Awk file would be called "invoice.xxA" (same name but with an "A" on the end of the file name extension).

Since master files can't have duplicates on sort keys, we'd declare this as a trans file, i.e.:


TRANS Invoice_File Cust_Srt SORT AWK

Note that we would also have to code "-TRANSinvoice" on the command line, to identify the physical file name of "invoice.xx" (remember: "xx" is the project file suffix.)

And let's assume that we're using the input Awk feature to select only customers who have customer numbers beginning in "1".


function ADB_Main() {
         if (substr(Cust_No,1,1) == "1") ADB_WriteFnc();
}

In this case we used the built-in GAwk substr function (see the GAwk documentation file, GAwk.Doc, which is in the GAwk zip file).

ADB did all the rest of the work for us.


Now if you've been following this documentation closely, you might notice that there's one niggling little problem with string values.

These string values are coming from a fixed-length record.  And that means that they have spaces padded on the end.

Fortunately, ADB's Awk interface will automatically strip trailing spaces from string fields, so that you can treat them as if they were formatted like C (or Java[script]) strings; i.e. strings of variable length.


It's also worth noting that the field names you use in the GAwk interface are identical to those that you've coded in ADB.

So as a general rule, you don't have to think about either the sequence of fields or their lengths when using the Awk interface. You just code essentially the same statements that you'd code in your DBMS--but since everything is done with functions, you have a lot more power and flexibility than you might have in many interfaces (particularly as compared to "true" SQL).


In addition to the mandatory ADB_Main function (which is run for each record), you may also code ADB_BOJ and ADB_BOJ functions.

Your ADB_BOJ function is run before any records are processed, and your ADB_EOJ function is run after all records are processed.

You may also code any additional functions that you wish to use, in order to simplify the processing.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

6:  Awk Interface: small (fixed) tables and massaging for ADB users

(If you're not interested in ADB, you can skip this section.)

Let's extend our example to cover two other basic operations: simple massaging, and the inclusion of values from "small fixed" tables, i.e. those that have relatively few values (say, fewer than 1,000) and which don't change very often.

By "don't change very often" what I mean is that updates to these tables don't occur as part of regularly scheduled jobstreams. (Although there's no reason in principle why you couldn't set up an Awk script to generate Awk code).

Suppose the fields are defined like so in your ADB defintions files:


FLD  Cust_No       4        # Customer # (treat as string)
FLD  Invoice_Amt__ 8 2      # Invoice amount (fixed-decimal)
FLD  Cust_State    2        # 2-char state code
FLD  Err_Flag      1        # Error flag (0 for no error)
FSET Cust_Srt  Cust_No END  # Sort on customer#
FSET Invoice_File           # Invoice file
     Cust_No; Invoice_Amt__;
     Cust_State; Err_Flag  END

As part of the preprocessing for this input file, we're going to look up the customer's 2-character state code in a table, determine a sales tax percentage, and then apply the sales tax to the invoice amount field.

If the customer's 2-character state code isn't found in the sales tax table, then we'll set the Err_Flag to "9" which is the designated error for bad state codes.

Invoice records that are flawed in this way (i.e. those with illegal state codes) will be caught later on in the jobstream (and separated from "good" invoice records).  Manual intervention will be required before any final updating cycle for this accounting period.  (Naturally, we expect that errors of this sort will be very rare indeed, and only caused by some other serious problem.)

So here's what the input records look like:


1111   10.00CA0
1111   15.00CA0
2222   25.00NY0
2222   10.00NY0

(For those of you from outside the states, I should explain that "CA" is the abbreviation for the American province of California, and "NY" is the abbreviation for the American province of New York.)

We're going to write three functions.  First, we need a function that defines the sales tax rate table:


function Define_Sales_Tax_Rate_Table () {
         Sales_Tax_By_State["CA"] = 0.085;
         Sales_Tax_By_State["NY"] = 0.0925;
}

This function needs to be run at the beginning of processing (we don't want to run it for each input record:


function ADB_BOJ() {Define_Sales_Tax_Rate_Table();}

Finally we need to apply the tax, and let's not forget to set the Err_Flag to "9" for unrecognized state codes.


function ADB_Main(  Sales_Tax) {
         if (Cust_State in Sales_Tax_By_State)
            Sales_Tax = Sales_Tax_By_State[Cust_State];
         else {Sales_Tax = 0; Err_Flag = "9";}  # Bad state code
         Invoice_Amt__ += (Sales_Tax * Invoice_Amt__);
         ADB_WriteFnc();  # Write the record out
}

Note that Sales_Tax is a local variable for the ADB_Main function (it wll be called with no arguments).

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

7:  Using the ADBAWKLIB environmental variable for global functions

(If you're not interested in ADB, you can skip this section.)

There are many instances in which you may wish to store "global" functions that you'll need for many different Awk processes.

The previous example is an excellent illustration of this: the Define_Sales_Tax_Rate_Table function will probably come in handy in various other jobs, and/or needs to be put someplace where there will be only one copy of it.  (State sales tax rates change all the time.)

The same goes for miscellaneous Awk utility functions that you might have.  For example, Awk string comparision is binary; there's no case-INsensitive version of it.  So at some point, you might wish to write an Awk version of C's stricmp (tolower and toupper are built-in functions).

GAwk actually allows multiple Awk script files . . . they're simply read in one after another (in the sequence that they're coded on the command line). For example:


GAwk -f Script1.Awk -f Script2.Awk <InFile.Dat >OutFile.Dat

ADB takes advantage of this Awk flexibility by letting you code the "ADBAWKLIB" variable.  Set it equal to the file name that holds all your "global" Awk functions.

E.g., you code the following in your DOS Batch files:


Set ADBAWKLIB=GlobAwks.Awk

For reasons that should be obvious, this is the only situation in which ADB doesn't require you to use the project file suffix as part of a filename's extension (i.e. the part after the period).

You may even wish to put this Set statement in a computer's AutoExec.Bat file (although this might create a "booby trap" for someone who later tries to port your code to another computer).

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

8:  Using the ADBAWKLIB environmental variable for global data

(If you're not interested in ADB, you can skip this section.)

The ADBAWKLIB variable is also an excellent way to import global data into an application.  You can have a function in your "ADBAWKLIB" file like this:


function Define_Appl_Globals() {
         APPL_GLOBAL_Var1 = "whatever";
         APPL_GLOBAL_Var2 = 10;
}

You just need to make sure that you always define the ADB_BOJ function like so:


function ADB_BOJ(){
         Define_Appl_Globals();  # Set application globals
         ...                     # Maybe do some other stuff
}

If you're especially lazy (and don't care much about efficiency), you can just code the ADB_Main function in a similar way.  The globals will be reset for each record you read, but that may not matter very much.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

9:  More on Importing Global Application Data

(If you're not interested in ADB, you can skip this section.)

I just described one way to get global application data into ADB, via "ADBAWKLIB".

Note how much more elegant and efficient this is, in contrast to setting up a special table in a DBMS and always making sure to join that table.

What if you need to import global variables "on the fly?"

The best way to do that is via DOS environmental variables.  Of course there are many restrictions in the type of data that you can move.  For example, you can't move hexadecimal data (although this can be convered with GAwk's sprintf contruct), nor can you use fields that have characters that DOS doesn't "like" to see in batch files (e.g. the less-than or greater-than signs).

Neverthless, typical application-dependent dynamic values tend to be either strings or numbers (flags fall into this category).

You can assign a DOS environmental variable simply by saying:


Set VARNAME=VALUE

(Note that DOS environmental variables are case-sensitive, just as Awk variables.  These are the only case-sensitive aspects of ADB.)

To incorporate assignments of DOS environmental variables into a batch file, you simply place the assignment statemnts in another DOS batch file, and you code:


Call SetGlobs.Bat

in another file.  Now all those assignments have been made.


WARNING:
in many other places in this documentation as well as the on-line help, I tell you to put the following in the machine's C:\Config.Sys file:


SHELL = C:\COMMAND.COM C:\ /e:4096 /p

(For NT, you need to put this line in ...\SYSTEM32\CONFIG.NT:)


SHELL=%SYSTEMROOT%\SYSTEM32\COMMAND.COM /E:4096

This ensures a DOS environmental size of 4K, which should be enough for most of your applications.  If you need more space, you should jack up the value accordingly.  ADB needs only about 400K of free DOS memory to execute.

If you need to know how much free DOS memory you have, go to the DOS prompt and type "CHKDSK".  You'll see something like this at the end:


655,360 total bytes memory
558,352 bytes free
WARNING:
I also tell you in other places that you need to have the following line in C:\AutoExec.Bat:


Set COPYCMD=/Y

This prevents DOS from prompting you every time you replace an existing copy of a file with a new one.


So let's assume that you've built the code that creates a DOS batch file to assign your global application data.  You've called that DOS batch file, prior to running the other code in your ADB application.

How do you access the value of the environmental variable, within Awk?

The variable ENVIRON is an Awk array, indexed by the name of the environmental variable.

So, for example, this statment will access the value of the MONTHUPDATE variable:


if (ENVIRON["MONTHLY_UPDATE"] == "Y") { ... } # Do monthly update

IMO, the fact that ADB can access global application data in two convenient ways (both of which rely on ASC text files that can be easily generated or maintained) . . . makes it vastly more convenient in this regard than a DBMS which forces you to define a special table for your globals.

However, if you must have a form that allows the users to enter values, you can easily enough write an Awk script which reads those exported values (export them in commas-n'-quotes and use ADB).

That script can then generate a function (similar to the one we saw in the last section, i.e. a function to set your global values).  Or it could output a DOS batch file to do the same thing.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

10:  Exporting Global Application Data

What if you need your Awk script to compute some value, which you must make available to other job steps? For example, the total accounts receivable balance for this month, which will then get shoved into your general ledger update?

This is trivial.  You simply output a DOS batch file which will then be CALLed in a later step.

That DOS batch file consists of a single SET statement, e.g.:


Set AR_TOTAL=104328.43

In the two sections, we'll see how to output files from within Awk.

(Again: if you're not comfortable with the [s]printf statement in C, I'll cover that in a later section of this documentation. That's your basic mechanism for formatting strings, either internally, or those which will be outputted as parts of reports or files.  It's much more powerful than anything offered by Javascript, although it is present in Java.  Regrettably, it's less powerful than C++'s cout.)

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

11:  GAwk's "system" command (another way to output global data)

GNU Awk has an excellent (nonstandard) feature that lets you invoke the DOS command processor (command.com) from _within_ GNU Awk.

For example, this Awk script implements the obvious:


BEGIN {system("Echo Jello, World!>Jello.Dat");}
{;}
END {;}

If you saved that into the file Jello.Awk, then you can invoke that with:


GAwk -f Jello.Awk <NUL

Jello.Dat is now:


Jello, World!

(The special device file "NUL" in DOS refers to an empty file. It's the equivalent of Unix's "/dev/nul".)

You can use this method to print error messages or to print the value of fields when debugging Awk scripts.  You can also use it to output the value of global data that has to be used in a later step (see the previous section).

You can also use this method to output multiple output files if you wish.  Note that the ">" directive in DOS refers to overwriting a file with a new file, whereas ">>" will append the file (if it already exists).

For example:


BEGIN
      {system("Echo Jello, World!>Hello.Dat");
       system("Echo A mellow herd.>>Hello.Dat");
      }
{;}
END {;}
Hello.Dat is now:


Jello, World!
A mellow herd.

Note that the DOS Echo command isn't a particularly efficient way of building files.  The operating system has to open and close the file for each such invocation.

I'll describe a more efficient way to output multiple files later on in this section.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

12:  How to write debugging messages when testing Awk scripts

(If you're not interested in ADB, you can skip this section.)

Although Awk has no built-in debugger, ADB provides a special interface that lets you write messages to a special "debugging" file.

This file is named ADBDbg_A.xx (where "xx" is your project file suffix.)

To write a message to it, you call the function ADB_DebugFnc("Debug Message"), (where "Debug Message" is the string that describes your debugging message.

This file is erased prior to each run of ADB, so you don't have to worry about its size.

The debug message printout will also display the current record sequence number.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

13:  How to signal errors from within an Awk script, in order to stop the processing "*IMMEDIATELY*"

(If you're not interested in ADB, you can skip this section.)

You can call the function ADB_ErrorFnc("Error Message") to "signal" an error from an Awk script (where "Error Message" is the string that describes your error message.

This will write your error message to the file ADBErr_A.xx (where "xx" is your project file suffix).

This file is erased prior to every run of an Awk script, and if it exists after the Awk script has finished, ADB will print a special message to ADB's log and error files.

This message will alert the reader of those files to the fact that the Awk Script signalled an error, and instruct them to look at ADBErr_A.xx.

You can do the ADB_ErrorFnc("Error Message") at any time, either in your ADB_BOJ function, within your ADB_Main function. or within your ADB_EOJ function.

You can't write an error message longer than 70 characters: if you do, it will be truncated.  This is to keep the DOS command line from exceeding the maximum of 128.

The error message printout will also display the current record sequence number. &nbps;You can use ADB's log file to discover the contents of that record, since ADB will print out the execution line.

At most one additional record will be read after you've done the ADB_ErrorFnc("Error Message") call.

This ensures that the processing won't continue "blindly" when a large file is being read.

And since the ADB error file (ADBErr.xx) has been created, ADB won't run again, until this file is deleted.

(Or, more precisely: this is treated as a "job bombing" error.)

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

14:  Multiple output files with ADB_BldOutLineFnc and FileMux.Exe

(If you're not interested in ADB, you can skip this section.)

(This is the most efficient way to create multiple output files when there are a lot of records in each file.)

FileMux.Exe is a utility that I wrote for taking a single input file and dividing the records into mutually exclusive and mutually exhaustive subsets.

(In simple language, it splits the records in the file up into groups :-)

FileMux.Exe can split records up based on the data in the input records.

Let's see a quick example:


A This is record 1
B This is record 2
A This is record 3

Now suppose you run FileMux like so (constants are in capital letters):


FileMux -INmyfile.inp -PREFout -EXT.dat -COL1 -LEN1

That says to split MyFile.Inp up into different parts, based on the data in column 1.  Two output files will be created: "OutA.Dat" and "OutB.Dat". "OutA.Dat" will have records 1 and 3. "OutB.Dat" will have record 2.

FileMux's default is to keep the field that you use to split the records up (in this case, the field beginning at column 1, which goes for 1 character).

If you want to remove that field, code the "-NOKEEP" argument.


(FileMux, like all the utilities supplied with ADB, prints its command line arguments when you run it from the DOS prompt with no arguments.  Note also that FileMux's command line parameter names are case-INsensitive, like all ADB's utilities.)

If you have a background in computer science (or in hardware), you'll recognize the term "Mux" as short for "multiplexor".  A multiplexor is a hardware device that has one input and multiple outputs.  Based on a specific set of bits in an incoming word (AKA a set of bytes), a multiplexor will "route" the rest of the word to a specific output.

(If the above paragraph made no sense to you, just ignore it.)


How do you use ADB in tandem with FileMux?

The best way is to set up a function that will "prefix" output records with "codes" that determine which output file the record is going to.

Let's return to our example of customer invoices.

Here are the ADB definitions for that record format:


FLD  Cust_No       4        # Customer # (treat as string)
FLD  Invoice_Amt__ 8 2      # Invoice amount (fixed-decimal)
FLD  Cust_State    2        # 2-char state code
FLD  Err_Flag      1        # Error flag (0 for no error)
FSET Cust_Srt; Cust_No END  # Sort on customer#
FSET Invoice_File           # Invoice file
     Cust_No; Invoice_Amt__;
     Cust_State; Err_Flag  END

As you recall, the records looked like this, with the exception of the last one:


1111   10.00CA0
1111   15.00CA0
2222   25.00NY0
2222   10.00NY0
3333   20.00IN0

(This new last record is for a customer in IN=Indiana, and it will turn out that Indiana isn't in our sales tax rate table.)

We're going to write the code which will split customers up, based on their state.  Customers from CA will go into one file, those from NY into another file, and those with unrecognized sales taxes into a third file (these are errors, since all customers are currently located only in NY and CA.)

For convenience, I'm going to write a little utility function called Mux_Out.

Mux_Out will be called with two arguments: an output file prefix, and the line which is to be outputted.


function Mux_Out (OutFilePref, Data) { # Route to proper file
printf("%s%s\n", OutFilePref, Data);}

Here's our old familiar friend, the function that defines sales tax rates:


function Define_Sales_Tax_Rate_Table () {
         Sales_Tax_By_State["CA"] = 0.085;
         Sales_Tax_By_State["NY"] = 0.0925;
}

Since both of these functions are useful for other jobs in our application, we'll toss them into the file GlobFncs.Awk (this is where our global application functions are going).

Let's not forget that the first two lines of our DOS batch file will define our project file suffix and our "ADBAWKLIB" file:


Set ADBPROJ=xx
Set ADBAWKLIB=GlobFncs.Awk

To keep things simple, I'm just going to be concerned with splitting the customers into categories by state.

Let's assume that the records are in the physical file called "invoice.xx" (remember, "invoice" is the name of the file, and "xx" is our project file suffix.)


We have three types of work to do now.

First, let's finish off the ADB defintions.  We need to define our input file and our output file:


TRANS Invoice_File Cust_Srt SORT
T-M   Invoice_File AWK END

Note that the TRANS statement defines the input invoice records.  The "T-M" statement is for unmatched transactions (i.e. all of them, since there is no master file for this run).

Let's assume that the output file for this run is "invsplit.xx"

So the command line for ADB is going to look like this:


ADB -TRANSinvoice -T-Minvsplit

Once we define the script for the output Awk file, we'll be done.

Since this is the output Awk file script for the file "invsplit.xx", its name will be "invsplit.xxA" (same name, but with an "A" on the end of the extension).

First, we'd better not forget to initialize the state sales tax rate table:


function ADB_BOJ() {Define_Sales_Tax_Rate_Table();}

Now we have to define ADB_Main, which is the function that will be run for each record:


function ADB_Main(  File_Code) {
         if (Cust_State in Sales_Tax_By_State) File_Code = Cust_State;
         else File_Code = "ZZ";  # A bad state code

         Mux_Out(File_Code, ADB_BldOutLineFnc());
}

File_Code is a local variable.  This is going to be the 2-character state code, for all recognized states.

If a state isn't recognized, we're going to use a File_Code of "zz".

The third line in this function actually calls Mux_Out.

What does ADB_BldOutLineFnc do? It actually builds the output line in the very same format as the output file.

Since the output file format is the same as the input file format, ADB_BldOutLineFnc returns a string which is identical to the input record.

This frees you from having to think about the sequence of fields, their data types, their lengths, their offsets, etc.  Just as any good DBMS product should.

So what does "invsplit.xx" look like? This is the output file that we're creating:


CA1111   10.00CA0
CA1111   15.00CA0
NY2222   25.00NY0
NY2222   10.00NY0
ZZ3333   20.00IN0

Hmmm.  Looks like we haven't done much work.  The state code is now replicated in the first two bytes of the record, right?

Not quite.

The last record has "ZZ" in the first two bytes.  That's because "IN" isn't a recognized state code (i.e. it's not in our sales tax rate table.)


Finally, we run FileMux.Exe:


FileMux -INinvsplit.xx -COL1 -LEN2 -NOKEEP -PREFInv -EXTxx

This is going to create 3 files:


InvNY.xx:
2222 25.00NY0 2222 10.00NY0


InvCA.xx:
1111 10.00CA0 1111 15.00CA0


InvZZ.xx:
3333 20.00IN0

FileMux, in conjunction with a function similar to Mux_Out, and ADB's built-in function ADB_BldOutLineFnc . . . can all work together to allow you to split an output file into pieces, &/or to generate several output files with different layouts &/or reports from the same ADB run.

Note that ADB's built-in function that we saw earlier, (ADB_WriteFnc) just looks like this:


function ADB_WriteFnc() {printf("%s\n", ADB_BldOutLineFnc());}

It's ADB_BldOutLineFnc which does the hard work of assembling all the fields in the output record, and ensuring that they're outputted with the proper lengths and offsets.


Note that ADB won't stop you from you modifying any of the input fields to be longer than they should be.  Future releases may address this potential problem area.

In a "real" application, we might want to "bomb the job," i.e. to print some special message if the file InvZZ.xx exists.  We can do that with a simple DOS batch file statement:


If Exist InvZZ.xx ErrMsg BAD STATE CODES WERE FOUND!
If ErrorLevel 1 Goto ABORT_JOB_LABEL

These two statements in the DOS batch file would immediately follow the invocation of FileMux.  Presumably, ABORT_JOB_LABEL would contain a helpful message that tells the user what to do when things haven't run properly.


ErrMsg.Exe is another utiilty that comes with ADB.  It prints all of its command line arguments and returns 1.

You can run it with no arguments from the DOS prompt for execution examples.

ErrMsg saves you from writing:


If ErrorLevel 1 Echo BAD STATE CODES WERE FOUND!
If ErrorLevel 1 Pause
If ErrorLevel 1 Goto ABORT_JOB_LABEL

Saving one line in a batch file might not seem like much, but if you have a dozen steps in an ADB process, it can help make the file a bit more elegant.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

15:  Testing your Awk scripts

(If you're not interested in ADB, you can skip this section.)

The whole point of ADB is that you can structure your larger applications as a series of two-input-file operations.

While that may often involve many processing steps, ADB frees you from the idiosyncracies of SQL (or a particular DBMSs' "shell" or programming language), and lets you break the process down into very simple components.

This idea of building application systems up with tiny pieces is familiar to Unix programmers, but seems to be less popular in the "WinTel" world (where massive compiled executables seem to be the norm).

And just as you build your ADB applications in a series of small, relatively insignificant processing steps, you can also build your Awk scripts up by incrementally increasing the complexity of the processing.

Regardless of whether you have an input file Awk script or an output file script, you can test them simply by coding a syntax error. The following line of code:


a=*/;

is guaranteed to cause an error in Awk.

Once you've found the error, go to the log file ("ADBLog.xx", where "xx" is your project file suffix.)

ADB will show you the GAwk execution line very near the end of the log file.  Chances are, it will reference temporary files that ADB created.

For example:


GAwk -f invsplit.xxA -f $ADBAbaa.xx$ -f globawk.awk
     <ADB_daa.xx$ >invsplit.xx

(This line has been split into two physical lines for display purposes only.)

Note that the "-f" parms in GAwk use Unix conventions (there must be a space between a command line argument and its value).  The three Awk files here are the user-coded one, ADB's generated Awk file, and any global "ADBAWKLIB" file that you may have used.

The input file (preceeded by the less-than sign) is a temporary file which ADB may have created after (say) sorting an input file or creating an output file.

So you just cut that line out and paste it into a temporary DOS batch file.  You can then use that DOS batch file to test your Awk script.  When the output looks good, you can continue the development process.

The same goes for tracking down bugs in a (supposedly) "finished" sytem.  If you think that an Awk script is the source of the problem, you can modify the DOS batch file so that the ADB step which uses that script is the last step.  Then put an error in the Awk script which will force ADB to stop.  Now the command line will be there and you can test with it (as well as all the temporary files you need).

When you're ready to get rid of the temporary files, just delete: $*.xx$ (where "xx" is the project file suffix.)

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

16:  Adding "phantom" (calculation result) fields to input files

(If you're not interested in ADB, you can skip this section.)

Most DBMSs let you create fields that are a result of a calculation.

You can simulate that same behavior with ADB.

Let's suppose that we have the same layout that we were working with before:


FLD  Cust_No       4        # Customer # (treat as string)
FLD  Invoice_Amt__ 8 2      # Invoice amount (fixed-decimal)
FLD  Cust_State    2        # 2-char state code
FLD  Err_Flag      1        # Error flag (0 for no error)
FSET Cust_Srt; Cust_No END  # Sort on customer#
FSET Invoice_File           # Invoice file
     Cust_No; Invoice_Amt__;
     Cust_State; Err_Flag  END

Suppose we want to compute a 5% discount amount which will be applied only to customers in California.  You want this field to be part of the input record.  (Normally you would do this as part of an output Awk script, but there are situations in which it might be more convenient as "part" of the input file).

First, define the new field and field set:


FLD  Invoice_Discount__ 8.2
FSET Invoices_With_Discount @Invoice_File; Invoice_Discount__ END

This second field set would be the _actual_ fieldset that you would use for your input file.  Note that the "@Invoice_File" simply merges in all the fields that are part of the invoice file's field set.

E.g.:


TRANS Invoices_With_Discount Cust_Srt PREP AWK

Notice that I specified "PREP" as the the type of preprocessing. This ensures that "basic prepocessing" would occur.  That means that fields get aligned.  But it also means that short records will padded with spaces, *IF* you additionally code "-LAXPRE" on the command line.

And that's all there is to it.

Keep in mind that "-LAXPRE" can be dangerous.  If you accidentally get the wrong input file, and that file is shorter than the one you really want to use, ADB won't be able to tell the difference.

This is why ADB always prints a warning message whenever you code "-LAXPRE".  It will also return a code of 1 (instead of 0), because warning messages were printed.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

17:  Index of ADB's Awk Interface functions

(If you're not interested in ADB, you can skip this section.)

Here's an index of ADB's functions that are used in the Awk interface.

You need only write one such function (the first one, ADB_Main).

You may write the two optional functions, ADB_BOJ, and ADB_EOJ, if you wish.

All other functions are automatically supplied for you, by ADB itself.

Note that no records will actually be outputted if your ADB_Main function doesn't actually output them, either by calling ADB_WriteFnc, or by printing them to the output file with printf.

Most functions take no arguments (and are so indicated).  The exceptions that take one string argument are shown that way as well.

Awk Function: Status: Purpose:
ADB_Main() You must supply Called to process each input file record.  To write the record out, you need to either call ADB_WriteFnc(), or you need to use printf.
ADB_BOJ() You may supply Called at the beginning, before any records are processed.  You can code this function to initialize and/or read in global data.
ADB_EOJ() You may supply Called at the end, after all records have been read.  You can code this function to write global data, or print the final totals for reports, etc.
ADB_WriteFnc() Supplied Call this function with no arguments to write the current record.
ADB_BldOutLineFnc() Supplied If you call this function with no arguments, it will return the output line which corresponds to the current record.  The record will be properly built in fixed-length format.  The return value (a string), lacks a line feed at the end.
ADB_ErrorFnc("Msg") Supplied Call this function with a string argument (e.g. an error message) to stop the processing.  At most one more input record will be read.  The string argument that you pass will be written to ADBErr_A.xx.  ADB will treat this as a catastrophic error, and refer to this file in its own log and error files.
ADB_DebugFnc("Msg") Supplied Call this function with a string argument to help you debug you Awk scripts.  The argument you pass will be written to ADBDbg_A.xx.  ADB erases this file at the beginning of every run (but doesn't do so before calling each Awk script involved in a run).  So you don't have to worry about that file becoming too large.
ADB_StripRtSp(("String") Supplied ADB builds and uses this function in order to strip trailing spaces off of string and special fields.  Feel free to use it for your own purposes.


Of course you could shorten these function names if you wished, simply by putting "wrappers" in your ADBAWKLIB file.

For example, to "wrap" ADB_DebugFnc, you might write something ilke this:


function Debug(s) {ADB_DebugFnc(s);}

To "wrap" ADB_WriteFnc, and ADB_BldOutLineFnc, you might code:


function Wrt()     {ADB_WriteFnc();}
function OutLine() {return(ADB_BldOutLineFnc());}

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

Next documentation section