The Joy of Removing Features - Part 2: Finding / removing duplicate files / photos.
Alex Tweedly
alex at tweedly.net
Thu Aug 18 19:20:40 EDT 2016
Part 2 of a 4-part series on developing simple apps for photo management
and viewing.
[ previously ... Part 1 described the justification and development of a
very simple photo viewing app ]
The next issue to deal with is the run-away number of photos, and the
amount of disk space taken up by them. I strongly suspect that is at
least partly due to my casual (some would say "disorganized") approach
to managing the photos, and the multiple computers they originated from
and are kept on (my desktop, laptop, daughter's laptop, back-up disks,
safe copies on other external drives, USB drives previously used to
store / transfer folders of photos, etc.)
So the next step is to find and eliminate (or at least reduce)
duplicated photos. Of course, I could simply Google "remove duplicate
photos mac" and follow some of the 382,000 resulting links - but where's
the fun in that :-)
At least some of those apps do, or claim to do, amazing things - find
different resolution or different quality versions of the same photo,
etc. - but I don't feel a need to look for those; I just need, initially
at least, to find the simple, exact duplicates. To give some context, I
have been using a sample subset of 16,000 out of my approx 55,00 photos;
these are mostly low/med resolution (i.e. iPhone or old digital camera
JPEGs, between 200Kb and 1.5Mb each). However. my new camera is rather
more resource-hungry (JPEGs are 24Mb or so - hence the urgency to
actually implement some of these ideas that I have been kicking around
for a long time :-)
I have a variety of schemes in mind to speed up the process, though each
of them needs to be verified for effectiveness, or indeed necessity.
The basic outline *was*
1. walk through to collect all folder names (i.e. the complete tree(s)
within the folder(s) specified by the user)
2. visit each folder in turn to collect details of all (relevant) files
- with optimizations for folders/files that haven't changed since the
info was previously collected
3. partition the files by size; and then reduce the list of files to the
potential duplicates
4. further reduce by file signature (i.e. a small sample of say 12 bytes
from pre-specified locations)
5. get the md5hash of remaining files, and look for duplicates
6. present the data to the user (!?)
However, some simple benchmarking suggested that this was unnecessarily
complicated - i.e. I can again remove features, even before they have
been specified or implemented. The task of detecting and avoiding
redundant work in step 2a is not terribly complicated - but it's
definitely the most brain-taxing part of the whole problem - and in any
case, won't apply to the first time the app is used. So that part can be
delayed at least until I find out how slow the process is - i.e.
hopefully forever.
The need for using MD5 hashes, rather than simply comparing the files
completely is also questionable. It turns out that calculating an MD5
hash of a file takes roughly 10x as long as comparing that file to
another identical one (i.e. the worst case for comparison - comparing to
a differing file would complete more quickly). So step 5 can also be
delayed (or avoided) until we determine how often it is likely we will
be matching larger sets of files.
Similarly, step 4 can be delayed (or avoided) until we see how well the
file size works as a partition - and it turns out to do a good job.
Of the 16,073 files, there are 14652 different sizes; of these, 1400
sizes have 2 matching files while 10 sizes have 3 files, and the
remainder have only a single file.
And it turns out that all 1410 of those are genuine duplicates - i.e.
there are no cases of files which have the same size without actually
being the same; therefore size is a very effective discriminator for
photo files.
Even better - running this simplified algorithm on my 16,000 sample
takes about 20 seconds on my aging Macbook Pro. So I can indeed
eliminate all those extra features in steps 2a, 4 and 5.
Part 3 of this series will describe what I did for step 6 above - i.e.
how to present this data to the user, how to make it easy to eliminate
any duplicates found and how to not make it easy to inadvertently delete
files you shouldn't.
Part 4 will (probably) describe an app for removing uninteresting photos.
And Part 5 will (perhaps) describe whether or how I found it necessary
to improve the image viewer app described in part 1. The increase in
average file size from 0.5 Mb to 24 Mb means that the time to transition
from one photo to the next has gone from "feels instant" to "hmmm, feels
fairly quick". I'll decide from using the app regularly over the next
week or two whether "fairly quick" is good enough, or whether it's worth
implementing pre-caching for the adjacent photo(s) to get back the
"instant" feel.
-- Alex.
P.S. I *will* get these apps cleaned up and onto revOnline. Soon. Any
day now. RSN. Promise :-)
More information about the use-livecode
mailing list