The Joy of Removing Features - Part 3: Presenting the duplicate files / photos info to the user.
Alex Tweedly
alex at tweedly.net
Wed Sep 7 05:10:52 EDT 2016
Update to Part 2.
Part 2 of this series described a scheme for collecting the info about
duplicate photos. This included some justification for not bothering to
collect or store md5Hash info for each (relevant) file - and it turned
out that my investigative / benchmark code was buggy. There were two
bugs - one was just a simple bug of mine, which led to much shorter
apparent times for some operations, while the other was misleading times
due to a significant improvement in LC8 compared to earlier versions.
After correcting these two bugs of mine, the justification became much
weaker - but the conclusion remained unchanged.
Part 3 - how to present this data to the user.
There are many possibilities, from the simplest (long list of duplicate
files path names - simple but unusable !!), to the complex (coloured
context diffs of the directory structures).
If LC had a built-in "accordian" view or tree view with
expanding/contracting disclosure triangles, I'd probably use that.
However, it doesn't (yet), so I decided to stick with the really simple.
But before describing it, let's talk a little bit about how/why all
these duplicate photos have arisen in the first place.
1. You import a bunch of photos from a camera, choose NOT to delete them
from the camera, and then later re-import the same photos (possibly
along with some additional ones that have been taken in the meantime).
1a. if you do this into the same folder (e.g. same import program, same
settings), you probably get a message asking if you want to "replace,
skip or keep both" (or something like that). Of course, you choose "keep
both" :-) - and now you have a set of photos called e.g. IMGP0021 and
IMGP0021-1, etc. Note you can get this same effect by importing the
photos twice, on different laptops - and subsequently copying / merging
the folders between them.
1b. if you do this into different folders (e.g. once with say Picasa,
once with Lightroom - with their differing default naming schemes), you
might finish up with two folders (say "2015-06-01" and "01 June 2016").
And again these could be exact duplicates, or the latter could contain
additional photos taken later. Or one or other folder could have had
some 'sidecar' files added (e.g. Thumbs.db for some viewing apps, etc).
2. You want to copy some of the photos from a folder to another machine.
First you copy the whole folder, then delete some files you know you
don't need - then you can put this tidied folder onto a USB stack, or
use an FTP app, or .... and in the end, you forget to remove this temp
copy.
3. Almost any other thing you can imagine ...
So I decided to keep the output very simple, and only try to distinguish
two cases (1a and all the rest). See below for the gory details, if you
want :-)
This gave me a rather long, but usable set of output descriptions. It
then took me less than an hour of slightly tedious work (one IDE window,
one terminal/shell window and two Finder windows ...) to eliminate
roughly 20 fully duplicated directories, plus an additional 20-30
partially duplicated ones and a lot of file-name duplicated ones (i.e.
case 1a above) - removing just over 10% of the total number of files and
disk space in use.
Design choices.
If I had been developing an app to market (either for sale, or as a
freebie for people to just *use*), it would have been an easy choice to
spend an extra day programming, provide some in-built display and
deletion options - and reduce the "tedious one hour" to a "mildly
annoying 10 minutes" of time to remove the files. On the other hand, if
this had been just a tool for myself to find and eliminate these 4-5000
files, I'd have stopped programming a day or two earlier and spent two
tedious hours doing "rm *-2.jpg" etc., and achieved the same end
result. But because I wanted to make the stack available for others,
and I wanted to think about and write about the choices I make along the
way - but not make it anything like a polished app - this middle-path
feels right. But the point is that there is no single *right* answer to
the question of how much development time is justified until you are
clear on your target and purpose for doing the development in the first
place.
Now on to the interesting challenge of finding and dealing with
"non-interesting" photos ...
-- Alex.
P.S. the gory details of the output format I chose in the end.
The first case is handled by reporting
Folder <foldername> has N files, with M self-matches, followed y a
paried-lisr of matching files
(i.e. M cases of duplicated files (directly) within the same folder out
of N total files).
e.g.
Folder /Users/alextweedly/Dropbox (Personal)/Pictures/2014/2014-06-18
has 38 files, and 18 self matches
IMGP0128-2.JPG IMGP0128.JPG
IMGP0118-2.JPG IMGP0118.JPG
IMGP0129-2.JPG IMGP0129.JPG
.....
IMGP0120-2.JPG IMGP0120.JPG
All other cases report as
Folders <foldername1> and <foldername2> have N1 and N2 files with NN
matches, followed by a list of matches.
Generally this will result in something like
Folders /Users/alextweedly/Dropbox (Personal)/Pictures/2011/2011-09-13
/Users/alextweedly/Dropbox (Personal)/Pictures/2011/9 Oct 2011
Have 9 54 files, and 9 matches
IMG_0236.JPG IMG_0236.JPG
IMG_0237.JPG IMG_0237.JPG
....
In this particular example, the names all match, there are 9 matches for
the 9 files in "folder1" - so as a user, you then need to decide whether
(a) you should just delete folder1 and all its contents
(b) there are other duplicated folders/files (e.g. 2011-09-12,
2011-09-10, etc.) which between them contain all the other files from
the 54 in "9 Oct 2011" and that's the one you should completely delete.
On fact, that's what I did. I'm sticking with Lightroom rather than the
other photo management apps - so the 2011-09-13 style of folder naming
is the one I 'd prefer to use.
More information about the use-livecode
mailing list