Eliminating all-but-the-last duplicate.
To understand why you might want to do
eliminate all duplicates other than the last one, consider
an example.
Suppose customers are allowed to change their e-mail addresses
from web page. Your CGI interface (whatever it may be) will
write a very simple file to the server.
This file has a layout of:
FLD Cust_No 4 # Customer number
FLD Update_Time_ 10 # Unix time (seconds since 1970)
FLD Phone_No_ 12 # Country code plus 10 digits
FSET Phone_Update_Sort1 Cust_No; Update_Time_ END
FSET Phone_Update_Sort2 Cust_No END
FSET Phone_Update_File Cust_No; Update_Time_; Phone_No_ END
(For simplicity, I've set up this example to use Unix time. If
you have a date field or some other temporal encoding, then
you'll have to think about converting it properly so that a case
insensitive ASC sort yields the result that "later" records
appear after "earlier ones.)
In this case, you definitely want the last duplicate (i.e.
where duplicates are understood as identical entries for a
particular customer number).
Otherwise, a customer could change their phone number
once . . . and
then realize that they'd made a mistake, and then
change it a second time.
If you don't get the last change they've made, then you're
going to have some very unhappy customers.
This issue can also be relevant if you're working in a distributed
environment, in which data can be collected from multiple servers.
If you've got a process that puts it all together, you'd prefer
to minimize the amount contributed by any one of the collection
servers, right?
As it happens, RPSort (ADB's underlying sort utility) does have a
duplicate removal option. The only problem is that it will
eliminate all-but-the-first
duplicate; not all-but-the-last one.
And this makes very good sense for a 30K piece of extremely tight
assembly code which can run with virtually no extra memory :-)
However, ADB does allow you to do this,
provided that the
incoming file is already sorted. I.e., whenever you ask for
"NODUPS" for an input file, ADB will use a buffer to make sure that
all-but-the-last duplicate is eliminated.
This occurs during what I refer to as "basic" preprocessing, i.e.
after a file has been converted from commas-n-quotes (if
applicable), but before it gets sorted (and/or duplicates are
removed).
Here's a DOS batch file that will do just the right thing for the
layout you see above. Assume that all the above statements are
in ADB.xx (your global project definitions file), and that this
file is called "phonechg.xx".
Set ADBPROJ=xx
Rem Build step definitions file
Echo TRANS Ph_UpDt_File Ph_UpDt_Sort1 SORT >Step.xx
Echo T-M Ph_UpDt_File END >>Step.xx
ADB -TRANSPhchg -T-MPhnew -STEPStep
Rem Now the input is properly sorted
Move PhNew.%ADBPROJ% PhChg.%ADBPROJ%
Rem Build step definitions file
Echo TRANS Ph_UpDt_File Ph_UpDt_Sort2 NODUPS >Step.xx
Echo T-M Ph_UpDt_File END >>Step.xx
ADB -TRANSPhchg -T-MPhNew -STEPStep
This is the first example in which we're using a step definitions
file.
As you may recall, ADB reads the definitions from the file ADB.xx
(where "xx" is the project file suffix).
Then it looks to see if
there's a "-STEP" parm on the command line. If so, that file is
read in to see if ADB can find more definitions.
Typically your basic definitions go in the global definitions
file ("ADB.xx"), and those which define the input and output
files for each step go in your step definitions file. One of the
nice things about this arrangment is that someone who's familiar
with the files in your application can get a very good idea of
what's going on, just by reading the one DOS batch file. If you
have Awk scripts, then you will presumably want to explain what
they do in the DOS batch file comments.
The result is that a
long, complex jobstream of (say) 20 steps can be easier to grasp
(and debug/develop) than (say) a
3,000-line compiled executable.
You can run this example with the following input file:
(Although this may seem confusing at first, I decided to use
alphabetic "customer numbers" for this example in order to make
the input file easier to read.)
AAAA0000000000111111111111
BBBB2222222222444444444444
AAAA1111111111222222222222
BBBB3333333333666666666666
(Customers are AAAA and BBBB. The first phone number for AAAA is
all 0's and the second one is all 1's. BBBB's first phone number
is all 2's and the second is all 3's.)
Note that the first sort used the customer number as the primary
key and the time as the secondary key.
The second sort just used the customer number.
Return
to local table of contents
Return
to global table of contents
Frames mode, or
No frames