On 22/08/2016 15:47, Richard Gaskin wrote:
> Alex Tweedly wrote:
> > Would caseSensitive make it faster ?
> In theory yes, since it avoids having to run the internal equivalent 
> of toLower on each thing being compared.
But since these are bytes, not chars, that doesn't apply.

> However in some recent experiments involving pattern matching on text 
> I was unable to measure a difference.  That shouldn't be taken as 
> definitive; there are a lot of distracting things going on in the 
> routine I was testing with.  I haven't yet done a good isolated test 
> of caseSensitive.
> > Re md5 for repeated use - yes, it probably is worth doing.
> The rsync algo offers an md5 option, but by default it compares files 
> based only on mod date and size.  The thinking is that if both of 
> those match, the odds of having a changed file are very low.
> Perhaps an optimal algo in your system would reserve md5 for those 
> cases where size and mod date match, which will eliminate most cases 
> with less CPU time.
Thanks Richard, but this is a very different context. In my case, the 
mod dates will never match; the duplicate files arise because the user 
has imported the same photos from a camera more than once (into 
different folders, or into the the same one using auto-renaming), or has 
copied a folder of files to trim out the ones to be copied to another  
machine, or .... any of a number of things, but all causing the copied 
file to have a different mod date from the original.

My original benchmarking was faulty; in fact, taking the md5hash for the 
two files is only 50% more expensive than simply comparing them (higher 
if they are actually different), but that leaves the conclusion 
unchanged - it's not worth the extra complexity. There is an assumption 
underlying this - that in real life (different from my development 
phase), the majority of genuine duplicates will be dealt with (i.e. one 
copy deleted or moved elsewhere) fairly quickly, so the same comparisons 
won't be run repeatedly. The remaining cases of same file size are so 
rare (around 80 in my full 50,000 file set) that pair-wise comparisons 
take only 4 seconds (or 2 seconds if I use an older version of LC), so 
no great impact on the user experience.

(The other parts of the overall workflow - where I would like to gather 
and use the exif data - are more strongly impacted by the performance 
issue - but my desire to use the latest of LC8 rather than an obsolete 
version is probably strong enough to override that, and I'll just be 
more patient - even though patient is not my natural state :-)

