12. Digging Into Duplicates

This section analyzes duplicate removal and/or detection in ADB.

Few of the techniques described here are likely to be relevant to most applications.

Feel free to mail your comments to me at: rog@NOSPAM_rs-freeware.org.

Return to the global table of contents
Frames mode, or No frames


1:  Eliminating all-but-the-last duplicate.

2:  Eliminating all-but-the-first duplicate.

3:  Printing out all duplicate records


1:  Eliminating all-but-the-last duplicate.

To understand why you might want to do eliminate all duplicates other than the last one, consider an example.

Suppose customers are allowed to change their e-mail addresses from web page.  Your CGI interface (whatever it may be) will write a very simple file to the server.

This file has a layout of:


FLD Cust_No       4  # Customer number
FLD Update_Time_ 10  # Unix time (seconds since 1970)
FLD Phone_No_    12  # Country code plus 10 digits
FSET Phone_Update_Sort1 Cust_No; Update_Time_            END
FSET Phone_Update_Sort2 Cust_No                          END
FSET Phone_Update_File  Cust_No; Update_Time_; Phone_No_ END

(For simplicity, I've set up this example to use Unix time.  If you have a date field or some other temporal encoding, then you'll have to think about converting it properly so that a case insensitive ASC sort yields the result that "later" records appear after "earlier ones.)

In this case, you definitely want the last duplicate (i.e. where duplicates are understood as identical entries for a particular customer number).

Otherwise, a customer could change their phone number once . . . and then realize that they'd made a mistake, and then change it a second time.

If you don't get the last change they've made, then you're going to have some very unhappy customers.

This issue can also be relevant if you're working in a distributed environment, in which data can be collected from multiple servers.

If you've got a process that puts it all together, you'd prefer to minimize the amount contributed by any one of the collection servers, right?

As it happens, RPSort (ADB's underlying sort utility) does have a duplicate removal option.  The only problem is that it will eliminate all-but-the-first duplicate; not all-but-the-last one.

And this makes very good sense for a 30K piece of extremely tight assembly code which can run with virtually no extra memory :-)

However, ADB does allow you to do this, provided that the incoming file is already sorted.  I.e., whenever you ask for "NODUPS" for an input file, ADB will use a buffer to make sure that all-but-the-last duplicate is eliminated.

This occurs during what I refer to as "basic" preprocessing, i.e. after a file has been converted from commas-n-quotes (if applicable), but before it gets sorted (and/or duplicates are removed).

Here's a DOS batch file that will do just the right thing for the layout you see above.  Assume that all the above statements are in ADB.xx (your global project definitions file), and that this file is called "phonechg.xx".


Set ADBPROJ=xx
Rem Build step definitions file
Echo TRANS Ph_UpDt_File Ph_UpDt_Sort1 SORT   >Step.xx
Echo T-M Ph_UpDt_File END                   >>Step.xx
ADB -TRANSPhchg -T-MPhnew -STEPStep
Rem Now the input is properly sorted
Move PhNew.%ADBPROJ% PhChg.%ADBPROJ%
Rem Build step definitions file
Echo TRANS Ph_UpDt_File Ph_UpDt_Sort2 NODUPS >Step.xx
Echo T-M Ph_UpDt_File END                   >>Step.xx
ADB -TRANSPhchg -T-MPhNew -STEPStep

This is the first example in which we're using a step definitions file.

As you may recall, ADB reads the definitions from the file ADB.xx (where "xx" is the project file suffix).

Then it looks to see if there's a "-STEP" parm on the command line.  If so, that file is read in to see if ADB can find more definitions.

Typically your basic definitions go in the global definitions file ("ADB.xx"), and those which define the input and output files for each step go in your step definitions file.  One of the nice things about this arrangment is that someone who's familiar with the files in your application can get a very good idea of what's going on, just by reading the one DOS batch file.  If you have Awk scripts, then you will presumably want to explain what they do in the DOS batch file comments.

The result is that a long, complex jobstream of (say) 20 steps can be easier to grasp (and debug/develop) than (say) a 3,000-line compiled executable.

You can run this example with the following input file:

(Although this may seem confusing at first, I decided to use alphabetic "customer numbers" for this example in order to make the input file easier to read.)


AAAA0000000000111111111111
BBBB2222222222444444444444
AAAA1111111111222222222222
BBBB3333333333666666666666

(Customers are AAAA and BBBB.  The first phone number for AAAA is all 0's and the second one is all 1's.  BBBB's first phone number is all 2's and the second is all 3's.)

Note that the first sort used the customer number as the primary key and the time as the secondary key.

The second sort just used the customer number.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

2:  Eliminating all-but-the-first duplicate.

It's relatively easy to eliminate all but the first duplicate.

The trick here is to remember that output file duplicate removals are done exclusively by RPSort.Com (ADB's underlying sort program).

RPSort will automatically remove all-but-the first duplicate.

So the following batch file works for the file layouts used in the previous example:


Set ADBPROJ=xx
Rem Build step definitions file
Echo TRANS Ph_UpDt_File Ph_UpDt_Sort1 SORT       >Step.xx
Echo T-M Ph_UpDt_File Ph_UpDt_Sort2 NODUPS END  >>Step.xx
ADB -TRANSPhchg -T-MPhnew -STEPStep
Rem Now all-but-the-first dup is removed
Move PhNew.%ADBPROJ% PhChg.%ADBPROJ%

What's the difference between this batch file and the previous one?

First off, I did no duplicate removal on the input file.  I just sorted it by customer number and update time.

Instead, I asked to remove duplicates on the output file, _but_ I sorted it by customer number only.

Since RPSort is a "stable" sort (two equal records in the input file will retain their sequence in the output file), its natural tendency to remove all-but-the-first duplicate shined through.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

3:  Printing out all duplicate records

There are many applications in which duplicate records aren't allowed.  If they're present, some serious error may be occurring.

Often these errors can't be fixed automatically, and some "eyeballing" (AKA manual inspection) is required.

So what you really want to do is to print out every record in a chain (AKA "sequence") of duplicates.

One of the most obvious ways to approch the problem is to (a) eliminate duplicates (either all-but-the-first or all-but-the-last; it hardly matters).

Then, (b) compare the original file against the one without duplicates using DOS's built-in FC command.

Here's an example:


Set ADBPROJ=xx
Rem Build step definitions file
Echo TRANS Ph_UpDt_File Ph_UpDt_Sort1 SORT       >Step.xx
Echo T-M Ph_UpDt_File Ph_UpDt_Sort2 NODUPS END  >>Step.xx
ADB -TRANSPhchg -T-MPhnew -STEPStep
Rem Now we are sure that all-but-the-first dup is removed
FC /A PhNew.%ADBPROJ% PhChg.%ADBPROJ% >>Dups.Dat

This isn't a particularly satisfying solution, because FC, like most DOS programs, still doesn't return an error code.  (Microsoft: gotta love 'em!).  FC's formatting also leaves something to be desired.

The other solution is to use the Awk interface.

This is painstaking, but if you're a programmer, writing this kind of processing should be a familiar experience.  (Some slightly deranged programmers even enjoy writing these kinds of algorithms!)

And this is a task that you probably can't do any more easily with a traditional DBMS interface.

Here's an Awk script that can be used in tandem with ADB to output all duplicates for our example (but which will ignore unique records):


function New_Key() {
         Last_Rec_CustNo = Cust_No;
         Last_Rec_Data   = ADB_BldOutLineFnc();
         Rec_In_Buff     = 1;
}

function ADB_BOJ() {
         Rec_Seq     = 0;
         Rec_In_Buff = 0;
}

function ADB_Main() {
         if (Rec_Seq++ == 0) New_Key();
         else if (Last_Rec_CustNo == Cust_No) {
                 if (Rec_In_Buff == 1) printf("%s\n", Last_Rec_Data);
                 ADB_WriteFnc();
                 Rec_In_Buff = 0;}
              else {
                 if (Rec_In_Buff == 1) {;
                    /* printf("%s\n", Last_Rec_Data); UNIQUE */
                 }
                 New_Key();
              }
}

function ADB_EOJ() {
         if (Rec_In_Buff == 1) {;
            /* printf("%s\n", Last_Rec_Data); UNIQUE */
         }
}

Now at this point, the output file either contains some records (meaning that duplicates exist), or it contains no records.

To check for the former case (some duplicates exist), you can use the utility FileStat.Exe

So the following batch file works:


Set ADBPROJ=xx
Echo TRANS Ph_Updt_File Ph_Updt_Sort1 SORT AWK  >Step.xx
Echo T-M Ph_Updt_File END                         >>Step.xx
ADB -TRANSPhchg -T-MPhnew -STEPStep
FileStat /Phnew.%ADBPROJ% <NUL
ErrMsg Duplicates were found!

FileStat.Exe and ErrMsg.Exe are described in the utilities section.

Return to local table of contents
Return to global table of contents
Frames mode, or No frames

Next documentation section