sorting words ?

Mark Waddingham mark at livecode.com
Sat Dec 12 07:48:49 EST 2015


On 2015-12-12 02:39, Kay C Lan wrote:
> So the 3 possibilities are:
> 
> 1) The example I gave in my last post. However many tabs and spaces 
> would
> remain the same, they would be ordered, tabs before spaces, segments
> (words) would be placed between them. There would only be one white 
> space
> between each segment (word) so some segments might have tabs between 
> them
> and others spaces.You might end up with multiple tab/spaces at the
> beginning of the output, just as you end up with multiple empty items 
> at
> the beginning of a sorted List if there are multiple empty items. This 
> is
> ugly.

I think this one perhaps rules itself out by virtue of the fact it does 
(as you say) seem 'ugly' and also reorders the whitespace in (from any 
one particular point of view in terms of what you might want to achieve) 
a fixed, but arbitrary order. It doesn't preserve structure between 
parts which is perhaps what you would want if you want to do 'something' 
with the whitespace.

> 2) The List is outputted with a single space between each segment. This
> would mean that if there happened to be tabs or multiple spaces between
> certain segments, these would be removed/converted. This is helpful.

This is certainly the simplest option (implementation wise), however it 
does 'hide' the fact that you are losing information when you do a 'sort 
words'.

> 3) A straight reshuffle, where the actual segments are reordered whilst
> preserving the white space location:
> 
> [tab][tab]Mark[space][space]Geoff[tab]Kevin[space][tab]Richard[space][space]
> 
> would become:
> 
> [tab][tab]Geoff[space][space]Kevin[tab]Mark[space][tab]Richard[space][space]
> 
> This last case, although less helpful to me, would arguable be the 
> computer
> logical thing to do, all you've done is asked to sort the segments 
> (words);
> the number of characters remains the same, the location of the white 
> space
> has remained the same, the only thing that has changed is the order in
> which the words appear - and that's what you asked for. When you think
> about it, that's really all that sort by line or item does, it leaves 
> the
> CRs and commas in place and just shuffles things about.

This approach occurred to me the other day but I couldn't think of a 
use-case at the time. However, your example of processing text 
structured tables and such does suggest that it could well be useful in 
some circumstances. It also has the advantage of being the natural 
extension of what line/item delimiter sort (should) do - the case of 
preserving a trailing delimiter is a direct extension of the idea that 
sort does the following:

1) Find the ranges of the things you want to sort in the string.

2) Compute the new order of the substrings (from the ranges) based on 
the requested sort.

3) Rebuild the string replacing the original ranges in the original 
string with the reordered substrings.

It also means that whether or not a text chunk is 'sortable' (in the 
sense I described in a previous email) is immaterial as the output 
string is directly derived from the input string. (There would still be 
the caveat that you might break the invariants I previously described 
though for some chunks - such as sentence).

> All LC has to do is pick one, implement it and then publish it. If 
> people
> don't like the choice then they have to roll their own, but they have 
> to
> roll their own now anyway. Whilst I'd think most people needing to sort
> words would like the 'benefits' of option 2, I hate to say it, but I 
> think
> option 3 would be the 'safer' road LC could go down.

It is possible to potentially have your cake and eat it here.

We could have a string sort (as we do now), which does its best to 
preserve original structure (as described in your option 3) - as you say 
this would do *precisely* what you asked, but you have to be aware of 
some edge cases which might bite you in some cases.

In addition we could add explicit chunk splitting and combine 
operations, and a sort which could act on a numerically keyed array. In 
this case, option 2 becomes:

   split tValue by word
   sort tValue
   combine tValue using space

Here you would be able to choose explicitly what delimiter you want to 
use in the output string.

If we made it so that you could do:

   split <expr> by item
   combine <expr> by item

Then 'sort string' for items would actually be:

   split tValue by item
   sort tValue
   combine tValue by item

The subtlety here is that the 'by item' forms would understand trailing 
delimiter rules (which is essentially that if your string-list's last 
item is empty, then you must have a trailing delimiter). Note that you 
can only have 'combine' for strict delimited chunks (which item and line 
are) for the reasons we have been discussing - there isn't an 'obvious' 
choice for the delimiter for things like word.

Of course, I've just realized that the proposed invariant rules for sort 
don't even hold for item and line - since you can now have 
multi-character item and line delimiters:

   set the itemDelimiter to ",c,"
   get "c,foo,c,bar,c"
   sort items of it
   put it

Here, with the given delimiter, the input string breaks down into:
   [c,foo] [,c,] [bar,c]
Which, when sorted becomes:
   [bar,c] [,c,] [c,foo]
And then recombined gives you:
   bar,c,c,c,foo
Which breaks up as:
   [bar] [,c,] [c] [,c,] [foo]

This breaks the proposed invariant rules. Thus, the invariant argument 
for determining what 'sort' is essentially useless already - it doesn't 
generalize to the features we've already added. Therefore, I think 
structure preservation (option 3) is definitely winning in terms of 
'underlying logic'.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list