sorting words ?

Mark Waddingham mark at livecode.com
Fri Dec 11 03:28:12 EST 2015


On 2015-12-11 07:56, Kay C Lan wrote:
> On Thu, Dec 10, 2015 at 4:38 PM, Mark Waddingham <mark at livecode.com> 
> wrote:
> 
>> 
>> The "word" chunk is not loosely implemented - it does precisely what 
>> it is
>> meant to do.
>> 
>> Which of course is the reason why the sort container command has no
> problem if you sort by word on the right side of the equation - 'by 
> word x
> of each'

Indeed but remember that the 'right hand side' is the 'sort key' - it 
allows the parts which are to be sorted to be mapped to something else 
to do the sort. The point at issue here is how to split and then 
recombine the parts which are sorted, not what actually is used to 
perform the sort.

> Well LC's definition of what a word is isn't exactly universally 
> accepted,
> but once you understand it, it's extremely powerful and saves a huge 
> amount
> of effort. As LC has it's own definition of what a word is then surely 
> it
> could define exactly how it's going to output the final combine. More 
> on
> this below.

Actually LC's definition of a word is a well-defined concept - it is 
essentially what you might call a 'shell token' as it is the same 
definition that (UNIX) shells use to process arguments:

    ls foo -- list directory foo
    ls "foo bar" -- list directory "foo bar"

Perhaps it shouldn't really have been called a 'word', which is why we 
added a 'segment' synonym for it in 7 where we introduced 'trueWord' 
(which is closer to what people might actually consider to be a word, 
albeit still algorithmically defined).

> This is not quite true is it:
> 
> put "the,quick,brown," into tVar
> put the number of items in tVar into msg -- 3
> sort items of tVar
> put the number of items in tVar into msg -- 4
> 
> Now I don't wish to discuss why this is, I understand why it is, I'm OK 
> as
> to why it is. As with LC's definition of what a word is, when you
> understand what is happening under the hood you can work around it or 
> work
> it to your advantage. Of note in the above, the number of chars has
> remained the same.

Hehe - perhaps best not to open that particular can of worms. For what 
its worth, I'd actually class that behavior as an anomaly as it breaks 
the logic of string lists - if there is a trailing delimiter, the 
trailing delimiter should be ignored but preserved: sorting 
"the,quick,brown," should result in "brown,quick,the," and not 
",brown,quick,the". (Indeed, I noticed this particular case hadn't been 
added to BZ - it has now: 
http://quality.livecode.com/show_bug.cgi?id=16588).

> Now if you don't agree, and think it should come out some other way, 
> that's
> OK, all that matters is whatever the output, it is consistent and
> published. LC could convert all tabs to spaces, it could remove all
> instances of multiple whitespace and replace it with a single space, I
> don't care, just as long as whatever it does is consistent and 
> published.
> Just as some people don't think "New York" is one word, LC does, it's
> published that quoted phrases are counted as one word, and that's a 
> very
> powerful thing.

I don't think I do agree with 'trying to do something sensible with the 
whitespace' as I don't really see why that would be useful. If you break 
down a string into a sequence of segments (I'll stop using word since it 
perhaps obfuscates the issue slightly ;)) then what use is the 
whitespace after that? Particularly if it has been reordered in some 
'arbitrary' way. (Here I mean 'arbitrary' in the sense that there are a 
great many choices one could make as to how one might 'do something 
with' the whitespace here and as such any one choice can be seen as 
arbitrary - I don't think there are any particularly logical arguments 
one could make as to why to favour one choice over another beyond 
personal taste and explicit specific use-case).

> So again, as LC can already sort words on the right side of the 
> equation -
> sort xxxxx of tVar by word y of each, it's seems only a minor step to 
> make
> it possible on the left side of the equation - sort words of tVar.
> Obviously the sorting mechanism is in place it's just the actual
> presentation that needs a little thought - surely not that hard.

Yes - there's nothing particularly 'hard' about making sort act on 
segments, although it has nothing to do with the sort key part (as I've 
said before).

If we break sort down into the steps which are actually taken the 
choices become more clear:

1) split the things you want to sort into a list (numerically keyed 
array)

2) sort the elements of the list (via a sortKey if specified)

3) combine the list back into a string

Clearly (1) is well defined for segments - you can iterate over a string 
using 'segment x of' and construct a list of all the segments within it. 
Similarly (2) is well defined as at this point there is no need to 
ponder segments or any other 'text chunk' structure, since the things 
you want to sort have been neatly listed as separate entities. It is (3) 
which is where there is some freedom of choice.

Basically, the choice one makes at (3) doesn't really matter as long as:

repeat for each segment x in tMyWords
   add 1 to tMyWordCount[x]
end repeat

sort segments of tMyWords

repeat for each segment y in tMyWords
   subtract 1 from tMyWordCount[x]
end repeat

Ends up with tMyWordCount being an array where all elements are zero. 
i.e. You can iterate over the segments before the sort, and after the 
sort, and end up seeing exactly the same segments in exactly the same 
multiplicities (just in a different order).

In the vein of 'KISS' (keep it simple stupid) it therefore seems 
sensible to make the simplest choice for how to recombine the string 
after sorting - and I think that is to use a single space.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list