The Joy of Removing Features - Part 2: Finding / removing duplicate files / photos.

Phil Davis revdev at pdslabs.net
Fri Aug 19 00:04:50 EDT 2016


I agree. This peek behind the curtain is a rare and very beneficial 
thing for the community. Thanks Alex!

Phil Davis


On 8/18/16 4:41 PM, Tim Selander wrote:
> Very enlightening. Thanks for taking the time to share this with us.
>
> Tim Selander
> Tokyo, Japan
>
> On 2016/08/19 8:20, Alex Tweedly wrote:
>>
>> Part 2 of a 4-part series on developing simple apps for photo management
>> and viewing.
>>
>> [ previously ... Part 1 described the justification and development of a
>> very simple photo viewing app ]
>>
>> The next issue to deal with is the run-away number of photos, and the
>> amount of disk space taken up by them. I strongly suspect that is at
>> least partly due to my casual (some would say "disorganized") approach
>> to managing the photos, and the multiple computers they originated from
>> and are kept on (my desktop, laptop, daughter's laptop, back-up disks,
>> safe copies on other external drives, USB drives previously used to
>> store / transfer folders of photos, etc.)
>>
>> So the next step is to find and eliminate (or at least reduce)
>> duplicated photos. Of course, I could simply Google "remove duplicate
>> photos mac" and follow some of the 382,000 resulting links - but where's
>> the fun in that :-)
>>
>> At least some of those apps do, or claim to do, amazing things - find
>> different resolution or different quality versions of the same photo,
>> etc. - but I don't feel a need to look for those; I just need, initially
>> at least, to find the simple, exact duplicates.  To give some context, I
>> have been using a sample subset of 16,000 out of my approx 55,00 photos;
>> these are mostly low/med resolution (i.e. iPhone or old digital camera
>> JPEGs, between 200Kb and 1.5Mb each). However. my new camera is rather
>> more resource-hungry (JPEGs are 24Mb or so - hence the urgency to
>> actually implement some of these ideas that I have been kicking around
>> for a long time :-)
>>
>> I have a variety of schemes in mind to speed up the process, though each
>> of them needs to be verified for effectiveness, or indeed necessity.
>>
>> The basic outline *was*
>>
>> 1. walk through to collect all folder names (i.e. the complete tree(s)
>> within the folder(s) specified by the user)
>>
>> 2. visit each folder in turn to collect details of all (relevant) files
>> - with optimizations for folders/files that haven't changed since the
>> info was previously collected
>>
>> 3. partition the files by size; and then reduce the list of files to the
>> potential duplicates
>>
>> 4. further reduce by file signature (i.e. a small sample of say 12 bytes
>> from pre-specified locations)
>>
>> 5. get the md5hash of remaining files, and look for duplicates
>>
>> 6. present the data to the user (!?)
>>
>> However, some simple benchmarking suggested that this was unnecessarily
>> complicated - i.e. I can again remove features, even before they have
>> been specified or implemented. The task of detecting and avoiding
>> redundant work in step 2a is not terribly complicated - but it's
>> definitely the most brain-taxing part of the whole problem - and in any
>> case, won't apply to the first time the app is used. So that part can be
>> delayed at least until I find out how slow the process is - i.e.
>> hopefully forever.
>>
>> The need for using MD5 hashes, rather than simply comparing the files
>> completely is also questionable. It turns out that calculating an MD5
>> hash of a file takes roughly 10x as long as comparing that file to
>> another identical one (i.e. the worst case for comparison - comparing to
>> a differing file would complete more quickly). So step 5 can also be
>> delayed (or avoided) until we determine how often it is likely we will
>> be matching larger sets of files.
>>
>> Similarly, step 4 can be delayed (or avoided) until we see how well the
>> file size works as a partition - and it turns out to do a good job.
>>
>> Of the 16,073 files, there are 14652 different sizes; of these, 1400
>> sizes have 2 matching files while 10 sizes have 3 files, and the
>> remainder have only a single file.
>>
>> And it turns out that all 1410 of those are genuine duplicates - i.e.
>> there are no cases of files which have the same size without actually
>> being the same; therefore size is a very effective discriminator for
>> photo files.
>> Even better - running this simplified algorithm on my 16,000 sample
>> takes about 20 seconds on my aging Macbook Pro. So I can indeed
>> eliminate all those extra features in steps 2a, 4 and 5.
>>
>> Part 3 of this series will describe what I did for step 6 above - i.e.
>> how to present this data to the user, how to make it easy to eliminate
>> any duplicates found and how to not make it easy to inadvertently delete
>> files you shouldn't.
>>
>> Part 4 will (probably) describe an app for removing uninteresting 
>> photos.
>>
>> And Part 5 will (perhaps) describe whether or how I found it necessary
>> to improve the image viewer app described in part 1. The increase in
>> average file size from 0.5 Mb to 24 Mb means that the time to transition
>> from one photo to the next has gone from "feels instant" to "hmmm, feels
>> fairly quick". I'll decide from using the app regularly over the next
>> week or two whether "fairly quick" is good enough, or whether it's worth
>> implementing pre-caching for the adjacent photo(s) to get back the
>> "instant" feel.
>>
>> -- Alex.
>> P.S. I *will* get these apps cleaned up and onto revOnline. Soon. Any
>> day now. RSN. Promise :-)
>>
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your 
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>

-- 
Phil Davis





More information about the use-livecode mailing list