testing 'strings' of binary data for equality

Ben Rubinstein benr_mc at cogapp.com
Wed Feb 5 09:42:01 EST 2003


I had to write a utility that would take two parallel trees of files, and
check portions of data from the corresponding files in each to locate
corrupt blocks.  The total amount of data to be compared was around 300MB.
I opened the files for binary reading, read various header records using the
'read ... for <nunits> as <format>' variant in order to locate the blocks,
and the read the actual contents of the blocks using the direct 'read ...
for <nbytes>'.  Then I simply compared the results of the two reads using
the '=' operator. 

My first attempt ran pretty quick, but failed to locate any corrupt (ie
different between the two) blocks; although I knew there were differences.
I blushed, slapped myself, and set the 'caseSensitive' property to true
(although thinking that it was surprising if all the differences amounted to
bytes in the range 65 to 92 being shifted by 32 or vice versa - this was
binary data).  This made no difference.

So then I changed to reading the actual data with the 'read .. for <bytes>
uint1' (so that every byte would be rendered as an integer, with commas in
between.  This worked fine, but not surprisingly was many times slower.

So my question is, why should '=' have returned true when given two strings
which were actually different?  I know I've run into problems in the past
with strings which just happened to consist of all digits except for a
single character 'e', which were treated as numbers in scientific notation.
But in this case the strings were typically 2-300K long, of binary data, so
they almost certainly contained every value from 0 to 255.

Is this a bug or a feature?  If the latter... why?

TIA,
 
  Ben Rubinstein               |  Email: benr_mc at cogapp.com
  Cognitive Applications Ltd   |  Phone: +44 (0)1273-821600
  http://www.cogapp.com        |  Fax  : +44 (0)1273-728866




More information about the metacard mailing list