The Joy of Removing Features - Part 2: Finding / removing duplicate files / photos.
selander at tkf.att.ne.jp
Thu Aug 18 19:41:43 EDT 2016
Very enlightening. Thanks for taking the time to share this with us.
On 2016/08/19 8:20, Alex Tweedly wrote:
> Part 2 of a 4-part series on developing simple apps for photo management
> and viewing.
> [ previously ... Part 1 described the justification and development of a
> very simple photo viewing app ]
> The next issue to deal with is the run-away number of photos, and the
> amount of disk space taken up by them. I strongly suspect that is at
> least partly due to my casual (some would say "disorganized") approach
> to managing the photos, and the multiple computers they originated from
> and are kept on (my desktop, laptop, daughter's laptop, back-up disks,
> safe copies on other external drives, USB drives previously used to
> store / transfer folders of photos, etc.)
> So the next step is to find and eliminate (or at least reduce)
> duplicated photos. Of course, I could simply Google "remove duplicate
> photos mac" and follow some of the 382,000 resulting links - but where's
> the fun in that :-)
> At least some of those apps do, or claim to do, amazing things - find
> different resolution or different quality versions of the same photo,
> etc. - but I don't feel a need to look for those; I just need, initially
> at least, to find the simple, exact duplicates. To give some context, I
> have been using a sample subset of 16,000 out of my approx 55,00 photos;
> these are mostly low/med resolution (i.e. iPhone or old digital camera
> JPEGs, between 200Kb and 1.5Mb each). However. my new camera is rather
> more resource-hungry (JPEGs are 24Mb or so - hence the urgency to
> actually implement some of these ideas that I have been kicking around
> for a long time :-)
> I have a variety of schemes in mind to speed up the process, though each
> of them needs to be verified for effectiveness, or indeed necessity.
> The basic outline *was*
> 1. walk through to collect all folder names (i.e. the complete tree(s)
> within the folder(s) specified by the user)
> 2. visit each folder in turn to collect details of all (relevant) files
> - with optimizations for folders/files that haven't changed since the
> info was previously collected
> 3. partition the files by size; and then reduce the list of files to the
> potential duplicates
> 4. further reduce by file signature (i.e. a small sample of say 12 bytes
> from pre-specified locations)
> 5. get the md5hash of remaining files, and look for duplicates
> 6. present the data to the user (!?)
> However, some simple benchmarking suggested that this was unnecessarily
> complicated - i.e. I can again remove features, even before they have
> been specified or implemented. The task of detecting and avoiding
> redundant work in step 2a is not terribly complicated - but it's
> definitely the most brain-taxing part of the whole problem - and in any
> case, won't apply to the first time the app is used. So that part can be
> delayed at least until I find out how slow the process is - i.e.
> hopefully forever.
> The need for using MD5 hashes, rather than simply comparing the files
> completely is also questionable. It turns out that calculating an MD5
> hash of a file takes roughly 10x as long as comparing that file to
> another identical one (i.e. the worst case for comparison - comparing to
> a differing file would complete more quickly). So step 5 can also be
> delayed (or avoided) until we determine how often it is likely we will
> be matching larger sets of files.
> Similarly, step 4 can be delayed (or avoided) until we see how well the
> file size works as a partition - and it turns out to do a good job.
> Of the 16,073 files, there are 14652 different sizes; of these, 1400
> sizes have 2 matching files while 10 sizes have 3 files, and the
> remainder have only a single file.
> And it turns out that all 1410 of those are genuine duplicates - i.e.
> there are no cases of files which have the same size without actually
> being the same; therefore size is a very effective discriminator for
> photo files.
> Even better - running this simplified algorithm on my 16,000 sample
> takes about 20 seconds on my aging Macbook Pro. So I can indeed
> eliminate all those extra features in steps 2a, 4 and 5.
> Part 3 of this series will describe what I did for step 6 above - i.e.
> how to present this data to the user, how to make it easy to eliminate
> any duplicates found and how to not make it easy to inadvertently delete
> files you shouldn't.
> Part 4 will (probably) describe an app for removing uninteresting photos.
> And Part 5 will (perhaps) describe whether or how I found it necessary
> to improve the image viewer app described in part 1. The increase in
> average file size from 0.5 Mb to 24 Mb means that the time to transition
> from one photo to the next has gone from "feels instant" to "hmmm, feels
> fairly quick". I'll decide from using the app regularly over the next
> week or two whether "fairly quick" is good enough, or whether it's worth
> implementing pre-caching for the adjacent photo(s) to get back the
> "instant" feel.
> -- Alex.
> P.S. I *will* get these apps cleaned up and onto revOnline. Soon. Any
> day now. RSN. Promise :-)
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
More information about the Use-livecode