The Joy of Removing Features - Part 2: Finding / removing duplicate files / photos.

Alex Tweedly alex at tweedly.net
Thu Aug 18 19:20:40 EDT 2016


Part 2 of a 4-part series on developing simple apps for photo management 
and viewing.

[ previously ... Part 1 described the justification and development of a 
very simple photo viewing app ]

The next issue to deal with is the run-away number of photos, and the 
amount of disk space taken up by them. I strongly suspect that is at 
least partly due to my casual (some would say "disorganized") approach 
to managing the photos, and the multiple computers they originated from 
and are kept on (my desktop, laptop, daughter's laptop, back-up disks, 
safe copies on other external drives, USB drives previously used to 
store / transfer folders of photos, etc.)

So the next step is to find and eliminate (or at least reduce) 
duplicated photos. Of course, I could simply Google "remove duplicate 
photos mac" and follow some of the 382,000 resulting links - but where's 
the fun in that :-)

At least some of those apps do, or claim to do, amazing things - find 
different resolution or different quality versions of the same photo, 
etc. - but I don't feel a need to look for those; I just need, initially 
at least, to find the simple, exact duplicates.  To give some context, I 
have been using a sample subset of 16,000 out of my approx 55,00 photos; 
these are mostly low/med resolution (i.e. iPhone or old digital camera 
JPEGs, between 200Kb and 1.5Mb each). However. my new camera is rather 
more resource-hungry (JPEGs are 24Mb or so - hence the urgency to 
actually implement some of these ideas that I have been kicking around 
for a long time :-)

I have a variety of schemes in mind to speed up the process, though each 
of them needs to be verified for effectiveness, or indeed necessity.

The basic outline *was*

1. walk through to collect all folder names (i.e. the complete tree(s) 
within the folder(s) specified by the user)

2. visit each folder in turn to collect details of all (relevant) files 
- with optimizations for folders/files that haven't changed since the 
info was previously collected

3. partition the files by size; and then reduce the list of files to the 
potential duplicates

4. further reduce by file signature (i.e. a small sample of say 12 bytes 
from pre-specified locations)

5. get the md5hash of remaining files, and look for duplicates

6. present the data to the user (!?)

However, some simple benchmarking suggested that this was unnecessarily 
complicated - i.e. I can again remove features, even before they have 
been specified or implemented. The task of detecting and avoiding 
redundant work in step 2a is not terribly complicated - but it's 
definitely the most brain-taxing part of the whole problem - and in any 
case, won't apply to the first time the app is used. So that part can be 
delayed at least until I find out how slow the process is - i.e. 
hopefully forever.

The need for using MD5 hashes, rather than simply comparing the files 
completely is also questionable. It turns out that calculating an MD5 
hash of a file takes roughly 10x as long as comparing that file to 
another identical one (i.e. the worst case for comparison - comparing to 
a differing file would complete more quickly). So step 5 can also be 
delayed (or avoided) until we determine how often it is likely we will 
be matching larger sets of files.

Similarly, step 4 can be delayed (or avoided) until we see how well the 
file size works as a partition - and it turns out to do a good job.

Of the 16,073 files, there are 14652 different sizes; of these, 1400 
sizes have 2 matching files while 10 sizes have 3 files, and the 
remainder have only a single file.

And it turns out that all 1410 of those are genuine duplicates - i.e. 
there are no cases of files which have the same size without actually 
being the same; therefore size is a very effective discriminator for 
photo files.
Even better - running this simplified algorithm on my 16,000 sample 
takes about 20 seconds on my aging Macbook Pro. So I can indeed 
eliminate all those extra features in steps 2a, 4 and 5.

Part 3 of this series will describe what I did for step 6 above - i.e. 
how to present this data to the user, how to make it easy to eliminate 
any duplicates found and how to not make it easy to inadvertently delete 
files you shouldn't.

Part 4 will (probably) describe an app for removing uninteresting photos.

And Part 5 will (perhaps) describe whether or how I found it necessary 
to improve the image viewer app described in part 1. The increase in 
average file size from 0.5 Mb to 24 Mb means that the time to transition 
from one photo to the next has gone from "feels instant" to "hmmm, feels 
fairly quick". I'll decide from using the app regularly over the next 
week or two whether "fairly quick" is good enough, or whether it's worth 
implementing pre-caching for the adjacent photo(s) to get back the 
"instant" feel.

-- Alex.
P.S. I *will* get these apps cleaned up and onto revOnline. Soon. Any 
day now. RSN. Promise :-)




More information about the use-livecode mailing list