5. Input Files and Preprocessing

This section covers the syntax for the statements that define input files.

It also explains a great deal about ADB's "preprocessing" cycle, which can consist of:

  • Conversion from commas-n'-quotes

  • "Basic Preprocessing"

  • Sorting and/or Duplicate Removal

  • Applying an Awk Script

Feel free to mail your comments to me at: rog@NOSPAM_rs-freeware.org.

Return to the global table of contents
Frames mode, or No frames


1:  Input File Definition Syntax

2:  Excel conversion

3:  Input File commas-n'-quotes conversion

4:  Input File "Basic Preprocessing"

5:  Input preprocessing & record lengths

6:  Input preprocessing & "flexible" fields

7:  Input preprocessing & Sorting

8:  Input preprocessing & Awk Scripts

9:  Awk Scripts and "small" tables

10:  Input Preprocessiong: the Sort Criterion

11:  The Sort Criterion with just one file

12:  Input Files on the Command Line


1:  Input File Definition Syntax

Here's the syntax for a master file:

MAST  fieldset  sort_fieldset  (NONE | PREP | SORT | NODUPS)  {AWK}  {CQ}  {END}

(Let me remind you again that the concepts of "master" and "transaction" file are merely abstractions.  ADB takes two input files: one is called the "master file" and the other, the "transaction file."  They aren't perfectly symettric, and I'll describe the differences below.)

That first four items here are required (select exactly one from the parenthesized group).

The last three are optional, but must occur in that sequence.

End has no effect whatsover: the only reason I allow it is that the output file syntax does require an End.  So I found myself adding it to input files . . . and I realized that this was an easy and natural error to make.

Note that the above description is rather sloppy: by "fieldset", I really mean "fieldset_name," and by "sort_fieldset," I mean "sort_fieldset_name".  I used the shorter descriptions in order to keep the physical line from being too long :-)


ADB reads definitions sequentially, on a "one pass" basis.  That means that any fieldsets you name in the I/O file definitions must have been defined earlier.  Remember that the global definitions file is read before any step defintions file.

For single-step runs, you may have a global definitions file (e.g. "ADB.xx", where "xx" is the project file suffix), but no step definitions file. &nbps;For typical jobstreams, the I/O statements are generally in the step definitions file.

The syntax for the TRANS statement is identical.


Note that the MAST file's key can be shorter than the TRANS file's key, but not the other way around.

If the MAST's file's key is shorter than the TRANS file's key, then "matches" will be determined (case-insensitively) based on the shorter key length.

As I mentioned in the preview, this is how you can associate one record (in the MAST file) with a series of records (in the TRANS file).

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

2:  Excel conversion

There's currently no "fully automated" way to get ADB to process an Excel file.

Instead, you have to use Control-A to select all the records, paste the file into Windows Wordpad, and then run the special converter Excel.Awk to get the Excel file into commas-n'-quotes format.  See Excel.Awk for more detailed instructions.

Because Excel's proprietary format "scatters" the data, there's also no way to get it back into Excel (aside from cutting-and-pasting).

However, Excel is compatible with Access as well as other standard DBMSs, and they will import and export data using the commas-n'-quotes standard.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

3:  Input File commas-n'-quotes conversion

If CQ is specified, input files are converted from commas-n'-quotes format, according to the rules described in the previous section.

This will always happen before anything else, since ADB's "native mode" deals with fixed-length records.  This is why I often refer to commas-n'-quotes conversion (for input files) as "importation."  Likewise, commas-n'-quotes conversion for output files might be referred to as "exporting."

Remember, string fields must have double quotes around them when they're imported in commas-n'-quotes format, although integer, fixed decimal, and "special" fields need not.

If any field is longer than expected, the commas-n'-quotes phase will abort the processing, and you'll see a fairly descriptive error message.  The same goes for situations in which the input record doesn't have the correct number of fields.

commas-n'-quotes conversion is done by the utility DBCvt.

If it fails, the log file (ADBLOG.xx, where "xx" is the project file suffix) will contain the command line to invoke DBCvt.  You can then correct the error and run that command line at the DOS prompt, in order to make sure that there are no similar errors in the file (if you wish).


If you want to know how to use DBCvt by itself, you can run it from the DOS command line with no arguments.

DBCvt is an excellent tool for validating the correctness of commas-n'-quotes format.  Although ADB itself handles the conversion from fixed-length format to commas-n'-quotes format, DBCvt can go both ways (like all utilities in this package, DBCvt's quite useful on its own terms.)


After commas-n'-quotes conversion is finished (if specified), ADB generally insists that every record in the file be of the exact length that the fieldset for the file would imply.  (See the next subsection for more details.)

You can run ADB with just the "-P" command line parm, if you want to know how long ADB "thinks" a file format should be.

ADB will then "pretty print" the definitions into the log file.

You can use the "-P" argument for producing "clean" versions of file layouts . . . which can then be used as the basis for applying a compiled programming language or any other tool to ADB's fixed-length files.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

4:  Input File "Basic Preprocessing"

After any requested commas-n'-quotes conversion, ADB will process the fourth item in the MAST or TRANS statment.

This item must be one of:
"NONE", "PREP", "SORT", or "NODUPS".

If you specify "NONE", ADB will do nothing at all (other than to validate that the total file length is consistent with the record size).

If you specify "PREP", ADB will analyze each record and verify that it's the correct length.  ADB will also remove any control-Z (DOS ASC EOF) which is at the end of the file.

Windows editors don't use the control-Z anymore, but that some old-fashioned DOS command line editors (including a few that are much easier and faster to use than anything more modern :-) will always terminate an ASC text file with a control-Z.

So you do need to specify "PREP" if you think that the file has been manually edited in its fixed-length format.  (If the file has been manually edited when it's in commas-n'-quotes format, then that's another story.)

"PREP" will also re-align fields.  That means that string and "special" fields will have spaces and tabs stripped on the left and right, and then they'll be aligned so that the first byte of data (i.e. not a blank or a tab) appears in the leftmost position (with spaces padded on the left).  The same thing happens for integer and fixed decimal fields, except that they're left-aligned.

Re-allignment is particularly important if you plan to use an integer or a fixed decimal field as part of the sort key, and you suspect that such fields may not be properly aligned in the input data.


Keep in mind that sort comparisons are done according to the "true" ASCII case-INsensitive sequence.  So you need to be very careful if you want to include fixed-decimal or integer fields in the sort key.


What if you want to be absolutely certain that the format of integer and/or fixed decimal fields is appropriate for sorting?

In that case, I suggest that you run a prelimary "massaging" step in which you use an Awk script to rewrite the input file.  Because awk's printf statement uses uniform conventions, the sort sequence should come out properly.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

5:  Input preprocessing & record lengths

If you expect that the file will be edited manually (when it's not in commas-n'-quotes format), then you shouldn't code "NONE" . . . because most ASC text editors won't pad spaces on the end of a line (without the user going to at least some trouble).

What that means is that you could end up with "short" records.

To combat this problem, you code any of "PREP", "SORT", or "NODUPS" along with the command line parm of "-LAXPRE".

"-LAXPRE" is short for "reLAXed basic PREprocessing."

Essentially it means that ADB will pad "short" lines with spaces.

The cardinal rule to remember is that:
ADB has no hidden data structures.

So the only way that ADB "knows" that a particular FSET (file structure) corresponds to a specific physical file is by examining the length of each record.  (At least if the file isn't in commas-n'-quotes format.  If it is, then the number of fields and the field lengths will serve as its validation method.)

That means that you do need to keep track of which file is which.

However it's not that likely that two files which are stored in fixed-length format and which have the same length could be interchanged.  Nor is it too likely that two files stored in commas-n'-quotes format with compatible field lengths could be interchanged.

Nevertheless, the above scenarios are still possible.  That's the price one pays for the "openness" and "simplicity" of ADB.

If ADB manitained a "file table" or any other data that served to identify whether a physical file had a particular strucure, then ADB would be vulnerable to the type of "overhead data" corruption that can make a traditional database "unopenable." &nbps;ADB would also be a lot harder to use in batch mode.

WARNING:
-LAXPRE won't have the effect you desire if you've coded "NONE" for preprocessing.  You have to have one of the other three values ("PREP", "SORT" or "NODUPS").

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

6:  Input preprocessing & "flexible" fields

There's also one other "trick" that you can do with the "-LAXPRE" (the command line argument).

Suppose a file is stored in fixed-length format, and you wish to append fields to the end of the record.  You can simply ask for "PREP," "SORT," or "NODUPS" in the input file specification, and pretend hat the extra fields at end-of-record are there.

Although ADB will issue a warning message about the number of short records that were padded with spaces, the processing will continue merrily on its way.

(Warnings result in a return code of 1; catastrophic errors in a return code of 2.)


Finally, because ADB concerns itself only with overall record lengths (at least once a file is in fixed-length format), you can seamlessly concatenate several contiguous fields, or pretend that a single field is actually several distinct continguous fields.

Obviously the data involved has to be carefully defined so that the right result will ensure.  But you can gaurantee the correct format by making sure that this file is outputted via an Awk script.

So, for example, you can "split" a name field that contains a first and last name into two fields by running it through ADB and postprocessing the output with Awk.  Or you can combine a "firstname" and "lastname" field in this way.

To play any of these games, you need to have a thorough understanding of what's said both in this portion of the documentation as well as the next part (which discusses output file post processing).

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

7:  Input preprocessing & Sorting

Whenever you code "SORT" or "NODUPS," ADB will always put the file through "basic" preprocessing first.

That means ADB aligns fields and padding them with spaces (as described earlier).

It may also mean that ADB's padded short records with spaces (if you've coded "-LAXPRE" on the command line).

After those processes are complete, ADB will then invoke Robert Pirko's RPSort.Com (the sort utility) as a subprocess.

The second item on the MAST or TRANS statement (the sort fieldset) specifies the exact parameters of the sort.

As with all three of the executables that ADB calls (DBCvt, RPSort, and GAwk), ADB will write the command line into the log file.

In the (unlikely) event that the sort fails, ADB will stop processing.

You can then re-run the sort from the DOS command line if you wish to get more information.  RPSort's errors are documented in the documentation file that comes with RPSort.Zip (in the install).

As far as I can tell, the only error that's even within the realm of possibility is a "full disk" (no more space).

That's not too terribly likely with today's machines.


If you code "NODUPS", RpSort will remove all-but-the-first dupilcate in a sequence of records with identical keys.

I've reserved the finer points of removing duplicates (within the same file) for a special section devoted to duplicates.

There, I show you how to remove all-but-the-first duplicate, all-but-the-last duplicate, or how to output all records which have any "fellow duplicates."

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

8:  Input preprocessing & Awk Scripts

If you code the AWK item, the last step is to run your Awk script.

I'll cover Awk scripts in more details later on, but there are a few crucial facts I want to emphasize about Awk scripts that are used on input files.


First, these scripts can be used to filter the input records.


Second, you must maintain the input file structure (including the ordering on the sort key).


Third, Awk scripts can help you fill in table values "as if" they were already there in the input records.  (See the next subsection.)

Note that the name of the Awk script is identical to the coresponding file name (as the latter is specified on the command line), except that the Awk script name ends in the project file suffix + "A".

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

9:  Awk Scripts and "small" tables

In the section on Awk, I'll show you how you can use the ADBAWKLIB variable and the Awk preprocessing to augment input records with the values from small tables.

You define the fields corresponding to table values (e.g. a sales tax percentage) as fields on the end of input record (even tho' they're not in fact there).

The assumption is that the table keys (e.g. the state codes which correspond to the sales tax percentages) are actually in the record.

Then you use an Awk function to fill in those values.

This is very convenient and efficient for tables with fewer than (say) 1000 values . . . and with a binary look-up facility, you can extend this to maybe 5,000 or 10,000 values in a table.  For many applications, this may be "neater" than running a entirely separate ADB step to join the table value fields explicitly (i.e. treat the table as if it was the "master" file).

You have to use the "-LAXPRE" command line argument to do this (because the input records will be "short").

Of course if the table values are changing constantly, or the table is gaining and losing entries constantly, then you may be best off with an explicit join.

I'll demonstrate both techniques in a later section.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

10:  Input Preprocessiong: the Sort Criterion

Even if you haven't specifed SORT or NODUPS, the third item in the TRANS or MAST statement (the sort fieldset) is still used to verify the sort sequence of the input file during the "matching" (comparison) process.

Records in a MAST file must occur in ascending sort sequence, with no duplicates.  So by codingNONE or PREP as the fourth item, you're essentially telling ADB that the records are already sorted.

ADB will stop the processing and print an error if it discovers that the sort sequence is violated (the error will point you to the first out-of-sequence record.

The same holds true for TRANS files with just one minor change: duplicates are allowed (but the sequence must still be "ascending").

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

11:  The Sort Criterion with just one file

The sort criterion is enforced even if you just have a MAST file but no TRANS file.

However, if you have a TRANS file by itself, ADB doesn't care if it's unsorted.  This allows you to verify that a single file has no duplicates (define it as a MAST file), or to process a single file with no particular concern for duplicates or sorting (define it as a TRANS file).


What if you want to verify that a file is properly sorted, in the case where duplicates are allowed on a key?

The most obvious way is to sort the file.

But if the file's very large, and you don't want to slow down the processing, then you can create a "dummy" master file of 0 length, with the utility Create0.Exe.

You don't even need to code "-LAXPRE" since this file will have no records in it.  Then create a dummy record layout, with one field that doesn't appear in the trans file (you can use a special field name like "Invalid_Field_Name" to make sure that no file will ever have this field).

Then you just "pretend" that an ordinary mast-versus-transaction comparision is occurring.  If ADB doesn't return an error code of 0, you'll know that the input trans records are indeed properly sorted.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

12:  Input Files on the Command Line

The "-MAST" command-line argument tells ADB which physical file is referenced by your MAST statement.  This file can have an extension (the part after the period at the end), but if so, that must be identical to the project file suffix.

(In practice, there's little reason to code an extension, as this makes it more difficult to copy the code from a DOS batch file intended for one project to a DOS batch file intended for another project.)

For example, if the project file suffix is "xx", and you code "-MASTwhatever" on the command line, then ADB expects that the file named "whatever.xx" will be the "master" file for this run.

Furthermore, there must be a "MAST" statement which identifies the relevant characteristics of this file.

The same is true for the relationship between the "-TRANS" command-line argument and the "TRANS" statement.


It's perfectly legal to have a "MAST" statement, but no "-MAST" parm.  If you do, ADB will limit itself to checking the syntax of the "MAST" statement.  The same goes for the "TRANS" statement and the "-TRANS" parm.

But ADB won't let you use a "-MAST" ("-TRANS") parm without a "MAST" ("TRANS") statement, for reasons that should be fairly obvious.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

Next documentation section