sorting words ?

Mark Waddingham mark at livecode.com
Thu Dec 10 03:38:46 EST 2015


On 2015-12-09 22:27, dunbarx at aol.com wrote:
> The "word" chunk has always been loosely implemented, in that it can
> hardly be considered a truly delimited chunk at all, like an item or a
> line is. Even though it seems to have been included in the "family" of
> chunks.

The "word" chunk is not loosely implemented - it does precisely what it 
is meant to do.

The 'item' and 'line' chunks can be classified as 'delimited' chunks in 
the sense that items in the string are divided up into an alternating 
sequence of part and delimiter:

a,b,c => [a] [,] [b] [,] [c]

In the case of 'word', though, what the 'parts' are and the 'delimiters' 
are is slightly more complex:

the quick    "brown fox" => [the] [ ] [quick] [    ] ["brown fox"]

Indeed, 'word' is perhaps what you might call a tokenized chunk, in that 
it is essentially matching parts and delimiters to distinct patterns:

delimiter pattern is matched to the regular expression [ \n\t]+

part pattern is matched to the regular expression (\"[^\"]*\"|[^ 
\n\t\"]+)

So, the 'inobviousness' I referred to is just the fact that there is not 
a single choice for what the delimiter should be when 'recombining' the 
string.

The 'sort' command can actually be broken down into more fundamental 
operations:

sort chunks of tVariable by func(each)

Is actually:

-- Create a sequence of parts to be sorted
split tVariable by chunk

-- Create a ordered list of indicies from 1 up to the number of parts
repeat with i = 1 to the number of elements in tVariable
   put i into tIndexes[i]
end repeat

-- Sort the indicies by comparing the relevant parts from the original 
variable
-- (note that this is a notional command - you can't currently sort an 
indexed array)
sort tIndexes by func(tVariable[each]) -- this notionally sorts the 
elements of tIndexes

-- Create a sorted sequence of parts based on the constructed indicies
repeat with i = 1 to the number of elements in tVariable
   put tVariable[tIndexes[i]] into tSortedVariable[i]
end repeat

-- Reconstruct a sorted string using the original 'delimiter'
combine tSortedVariable by chunk

Here you see that there are two operations which are obviously defined 
for item and line, but not so obviously defined for word - as there is 
not a unique choice for the delimiter when doing the final combine.

Of course, as Geoff correctly pointed out space works as a suitable 
choice for the delimiter in the final step in the case of word although 
you could choose tab, return, or any combination of them (indeed 
anything matching the regular expression [\n\t ]+.

In the more general case, the key thing is that you have to choose how 
to recombine the output string for the sort such that you have the 
following invariants:

1) the number of chunks in tSortedVariable is the number of chunks in 
tVariable

2) for each tChunk in tVariable, the number of occurrences of tChunk in 
tVariable is the number of occurrences of tChunk in tSorted Variable

Here (1) expresses the fact that you must have same number of chunks 
before and after. (2) expresses the fact that every part before must be 
present after, in exactly the same quantity.

With this in mind, you can then ask the question for any chunk (however 
it is defined) whether it is 'sortable' - a chunk is sortable if there 
exists a choice for delimiter which means (1) and (2) hold.

This is definitely true for item and line (as there is only one choice 
of delimiter). It is true of word if you choose space (or, indeed, any 
pattern matching [\n\t ]+). It is true of character, codepoint and 
codeunit. It is true of trueWord if you choose space. It is not in any 
way obvious for sentence chunks - the rules that define those are not 
simple - and I'm not sure there exists a choice of delimiter which would 
not (at least in some cases) change the set of parts you get in a 
recombined output string (i.e. you can probably construct examples where 
invariant (2) is broken).

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list