Synchronization puzzle.

Alex Tweedly alex at tweedly.net
Fri Jan 21 18:52:34 EST 2005


[ slightly OT - this is a "design / algorithm" question rather than a 
Rev/xtalk one, really ]
[and it's a long, rambling email - so feel free to skip it]


I am in the process of writing my own Address Book application.
That may sound like an odd thing to do :
   aren't there already hundreds of programs to do that already ? 
   isn't there an address book built into the OS and/or major apps ?
   can't you get everything from freeware to professional apps to do that ?
   what's wrong with the software that came with your Palm Pilot (or 
PDA, or Pocket PC, or ....)
   why re-invent the wheel ?

The reason that none of them quite satisfies my needs is simple - I want 
a shared address book, between myself, my wife and various other family 
members. There need to be up to 8 different computers running this 
application - all of them on a network, but never all on the same 
network. There are two separate home networks, plus maybe a server on a 
web-site somewhere so I can use it from anywhere with an Internet 
connection; the 3 laptops will intermittently appear on one or other 
network.

Any person can make updates to the address book, and I want a simple way 
to synchronize those changes on to the other instances of the app. What 
I've been doing until recently was using the Palm Desktop software that 
came with my Palm Pilot on each machine; periodically I would sync the 
Palm Desktop to my Palm Pilot, then re-connect the Pilot to another 
machine and sync there, then to another, then .....   which all worked 
OK, though it's a bit of a drag connecting the Pilot to the serial port 
of each machine in turn. And it just seems crazy when they're connected 
together by 100M Ethernet :-)

But now the two newest machines don't have a traditional serial port, 
only USBs. So I have the choice of spending £29.95 for a cable or tens 
(hundreds) of hours writing software - no competition :-)

I looked at using the Palm database, and simply (??) writing an app to 
sync databases - but the combination of incomplete documentation on the 
format (not to mention its complexity) and concern that I might conflict 
with Palm's synchronization technique has put me off that.  I should 
mention that I never (or very, very rarely) actually use the Palm Pilot 
for anything other than this synch'ing effort. The Palm Pilot is almost 
never carried around; it has, for me, been squeezed out of its niche by 
my laptop one one side and my cell phone with its 100+ phone numbers and 
storage for many SMS messages on the other. (I sometime text myself an 
address or directions if I think I'm likely to need it and won't be 
taking the laptop - or occasionally even use pen+paper !!)

What I'm looking for, then, is a method to do "peer-to-peer" 
synchronization between an undefined number of instances of the address 
database. I've got a design, I've thought it through as much as I can, 
I've done "dry-run"s on paper to check it works, I've implemented it and 
tested it - but I'm still concerned that I've missed something, and that 
it will fail in some circumstances.

Here's the relevant section of my "design document" (i.e. transcription 
of the scribbling on the back of an envelope); I'd be very grateful for 
any comments, corrections, etc.  On the other hand, if there is some 
standard technique or algorithm for doing this, and I just didn't know 
about it (and couldn't find it in a Google search), then a pointer to a 
good description would be also very useful.

---------------------------
Synchronization.

All synchs are user-triggered - i.e. a user on one machine will trigger 
a sync to another machine; the user is available to resolve anyh 
conflicts immediately) or she can choose to postpone that part of the 
synchronization until later. The two machines must be networked together 
at the time of a sync. That sync can use any of:
  - shared file systems   (e.g. laptop to file server)
  - http request / post   (e.g. laptop to web site)
  - TCP based connection (e.g. laptop to laptop while on same network).

(I don't export the filesystems on the laptops, nor do I run http server 
on them, so the plan is that each machine can run a address-book-sync 
server that will accept TCP connections on some odd port, and transfer 
data that way).

The fundamental sync algorithm is the same in each case; the description 
below is written as though the synching machine had complete access to 
the 'other' address files. The optimizations possible to avoid complete 
transfer of the data should be apparent - but in fact for my address 
book, the whole file is probably never going to be above 100K, so it's 
not clear if those optimizations are worth the implementaion effort.

Time synchronization.

It cannot be assumed that the clocks of the different computers are in 
sync, even approximately. It can (and will) be assumed that each 
computer's clock will return times that are continuously increasing.

(Yes, I do know about NTGP, and SNTP, and TSP, and various other time 
sync protocols - but I just know that I can't rely on people running them.)


Database format.

The Address Book consists of a number of records; each record has 3 
"control" fields, and a number of data fields. The sync should use ONLY 
the 3 control fields, except for conflict resolution, and hence should 
be readily extended to cover Calendar, Tasks, and other similar PIM 
database synchronization.

The three control fields for each record are:
  ID  : this is a 'globally' unique ID. It is allocated when the record 
is first created, and never changed thereafter. It comprises the machine 
name (these must be unique between all machines which are to share a 
database), and a numeric ID allocated by the machine which creates the 
record. The numeric portion of the ID will be increasing - but there 
should, if possible, be no dependency on the fact that a later-created 
record has a higher ID number than another record created earlier; they 
just need to be all different.

 last-event: a machine/time stamp identifying when this record was last 
modified, and the machine on which this modification was done.

 status:  not currently defined, but put in here for future growth. Set 
to "OK" for now.

Record creation: when a new record is created, the new entry receives an 
ID consisting of the machine name of the machine in question plus the 
"next" numeric ID, a last-event of the (same) machine name plus current 
millisec timestamp, and status of OK.

Record modification: the ID and status are unchanged, the last-event is 
set to the machine in question plus current timestamp. The complete 
record (control and data fields) is added to the History database PRIOR 
to any changes.

Record deletion: the record is removed from the address database, and 
placed in the History database.

Note that the allocation of next record ID cannot simply check for the 
highest current record (because deleted records are removed): the 
highest numeric ID used must be tracked separately.

Synchronization set-up

When asked to do a sync, the program will fetch the the address and 
history databases from the other machine.

(Note : optimization possible - fetch only control fields !)

Synch main loop.

Sort each address database ('mine' and 'his') in ID order. Reset 
'current' counters to start.

Compare current record IDs:

  mine is before his:
      either mine has been added, or his has been deleted.
      check his history database:
      if it contains 'my' current entry, then it has been deleteed from 
his database:
           ACTION: delete from my database
      else:           it must have been added to mine, so
          ACTION: add this recodrd to his database
      advance my current counter

  his is before mine:  (exact mirror of above)

  same ID.
  compare 'last-events' :
        same :   everything is good, advance both current counters.
        different - but the last-event machine is the same
                 take whichever entry has the later edit time (note - 
from same machine, so they can be compared),
                           and add to the other database
                  advance both counters
         different - and different machines:
                  there is no way to tell which is later - user needs to 
resolve differences, or postpone.
                  then advance both counters.

User resolution of conflicts.

User should be presented with the set of common, and different fields, 
and for each field chose which of the alternates is to be chosen.
BOTH prior versions are added to BOTH history databases  (this may be 
overkill - but seems a good option)
newly resolved entry gets last-event of the source machine plus its 
current timestamp, and is added to both address databases.

User postpones resolution.
Both database retains its own version of the entry, and neither has its 
history updated. (i.e. all other entries are synched - just this one is 
left unchanged).

When both current counters have reached their end, all actions have been 
collected.

[optimization: if the entire database was not fetched ahead of time, 
then any required entries can be collected from the remote system (this 
would be those needing resolution plus those where 'his' entry needs to 
be added to 'my' database.]

Send to the remote system : either the complete updated databse and 
history for him, or a series of add/delete/change records arising from 
the actions collected above.

Done.

Note that even if the write to remote system fails, the updates can 
still be done to the local system without any ill effects (and vice 
versa, though that seems unlikely).

--------------------------------------------
I can't see any failing scenarios. The one part that I haven't worked 
out fully is archiving / cleaning-out the history database. Clearly, it 
can be pruned back such that only the latest event needs to be kept for 
each entry - so the max size of the history database is limited to the 
total size of all entries that have ever existed.

Since this is a 'tool' and not a 'product', that's probably good enough 
for me. I can wait until I *know* that every existing copy is in sync, 
and then manually clear out all the history files. Clearly that wouldn't 
be adequate for a real product - but it's good enough for me. 
('tool' - something the author can use; 'product' - something that 
anyone can use, is easy to use, hard to misuse, fully documented, etc.)  
This is a tool for my own use (though it will naturally be available on 
revonline (or maybe a web site, if I ever organize one for my Rev 
stacks) for anyone who wants to play with it, look at it, modify it for 
their own purposes, etc.)

All comments welcome,
-- Alex.


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.1 - Release Date: 19/01/2005



More information about the use-livecode mailing list