The Joy of Removing Features - Part 3: Presenting the duplicate files / photos info to the user.

Alex Tweedly alex at tweedly.net
Wed Sep 7 05:10:52 EDT 2016


Update to Part 2.

Part 2 of this series described a scheme for collecting the info about 
duplicate photos. This included some justification for not bothering to 
collect or store md5Hash info for each (relevant) file - and it turned 
out that my investigative / benchmark code was buggy. There were two 
bugs - one was just a simple bug of mine, which led to much shorter 
apparent times for some operations, while the other was misleading times 
due to a significant improvement in LC8 compared to earlier versions.

After correcting these two bugs of mine, the justification became much 
weaker - but the conclusion remained unchanged.


Part 3 - how to present this data to the user.

There are many possibilities, from the simplest (long list of duplicate 
files path names - simple but unusable !!), to the complex (coloured 
context diffs of the directory structures).

If LC had a built-in "accordian" view or tree view with 
expanding/contracting disclosure triangles, I'd probably use that.

However, it doesn't (yet), so I decided to stick with the really simple. 
But before describing it, let's talk a little bit about how/why all 
these duplicate photos have arisen in the first place.

1. You import a bunch of photos from a camera, choose NOT to delete them 
from the camera, and then later re-import the same photos (possibly 
along with some additional ones that have been taken in the meantime).

1a. if you do this into the same folder (e.g. same import program, same 
settings), you probably get a message asking if you want to "replace, 
skip or keep both" (or something like that). Of course, you choose "keep 
both" :-)    - and now you have a set of photos called e.g. IMGP0021 and 
IMGP0021-1, etc.   Note you can get this same effect by importing the 
photos twice, on different laptops - and subsequently copying / merging 
the folders between them.

1b. if you do this into different folders (e.g. once with say Picasa, 
once with Lightroom - with their differing default naming schemes), you 
might finish up with two folders (say "2015-06-01" and "01 June 2016"). 
And again these could be exact duplicates, or the latter could contain 
additional photos taken later. Or one or other folder could have had 
some 'sidecar' files added (e.g. Thumbs.db for some viewing apps, etc).

2. You want to copy some of the photos from a folder to another machine. 
First you copy the whole folder, then delete some files you know you 
don't need - then you can put this tidied folder onto a USB stack, or 
use an FTP app, or ....  and in the end, you forget to remove this temp 
copy.

3. Almost any other thing you can imagine ...


So I decided to keep the output very simple, and only try to distinguish 
two cases (1a and all the rest).  See below for the gory details, if you 
want :-)

This gave me a rather long, but usable set of output descriptions. It 
then took me less than an hour of slightly tedious work (one IDE window, 
one terminal/shell window and two Finder windows ...) to eliminate 
roughly 20 fully duplicated directories, plus an additional 20-30 
partially duplicated ones and a lot of file-name duplicated ones (i.e. 
case 1a above) - removing just over 10% of the total number of files and 
disk space in use.


Design choices.
If I had been developing an app to market (either for sale, or as a 
freebie for people to just *use*), it would have been an easy choice to 
spend an extra day programming, provide some in-built display and 
deletion options - and reduce the "tedious one hour" to a "mildly 
annoying 10 minutes" of time to remove the files. On the other hand, if 
this had been just a tool for myself to find and eliminate these 4-5000 
files, I'd have stopped programming a day or two earlier and spent two 
tedious hours doing "rm *-2.jpg" etc., and achieved the same end 
result.  But because I wanted to make the stack available for others, 
and I wanted to think about and write about the choices I make along the 
way - but not make it anything like a polished app - this middle-path 
feels right. But the point is that there is no single *right* answer to 
the question of how much development time is justified until you are 
clear on your target and purpose for doing the development in the first 
place.

Now on to the interesting challenge of finding and dealing with 
"non-interesting" photos ...
  -- Alex.

P.S. the gory details of the output format I chose in the end.

The first case is handled by reporting

Folder <foldername> has N files, with M self-matches, followed y a 
paried-lisr of matching files

(i.e. M cases of duplicated files (directly) within the same folder out 
of N total files).

e.g.

Folder /Users/alextweedly/Dropbox (Personal)/Pictures/2014/2014-06-18

has 38 files, and 18 self matches

IMGP0128-2.JPG IMGP0128.JPG

IMGP0118-2.JPG IMGP0118.JPG

IMGP0129-2.JPG IMGP0129.JPG

  .....

IMGP0120-2.JPG IMGP0120.JPG


All other cases report as

Folders <foldername1> and <foldername2> have N1 and N2 files with NN 
matches, followed by a list of matches.

Generally this will result in something like

Folders /Users/alextweedly/Dropbox (Personal)/Pictures/2011/2011-09-13 
/Users/alextweedly/Dropbox (Personal)/Pictures/2011/9 Oct 2011

Have 9 54 files, and 9 matches

IMG_0236.JPG IMG_0236.JPG

IMG_0237.JPG IMG_0237.JPG

....


In this particular example, the names all match, there are 9 matches for 
the 9 files in "folder1" - so as a user, you then need to decide whether

(a) you should just delete folder1 and all its contents

(b) there are other duplicated folders/files (e.g. 2011-09-12, 
2011-09-10, etc.) which between them contain all the other files from 
the 54 in "9 Oct 2011" and that's the one you should completely delete.

On fact, that's what I did. I'm sticking with Lightroom rather than the 
other photo management apps - so the 2011-09-13 style of folder naming 
is the one I 'd prefer to use.







More information about the use-livecode mailing list