sorting words ?
Mark Waddingham
mark at livecode.com
Thu Dec 10 03:38:46 EST 2015
On 2015-12-09 22:27, dunbarx at aol.com wrote:
> The "word" chunk has always been loosely implemented, in that it can
> hardly be considered a truly delimited chunk at all, like an item or a
> line is. Even though it seems to have been included in the "family" of
> chunks.
The "word" chunk is not loosely implemented - it does precisely what it
is meant to do.
The 'item' and 'line' chunks can be classified as 'delimited' chunks in
the sense that items in the string are divided up into an alternating
sequence of part and delimiter:
a,b,c => [a] [,] [b] [,] [c]
In the case of 'word', though, what the 'parts' are and the 'delimiters'
are is slightly more complex:
the quick "brown fox" => [the] [ ] [quick] [ ] ["brown fox"]
Indeed, 'word' is perhaps what you might call a tokenized chunk, in that
it is essentially matching parts and delimiters to distinct patterns:
delimiter pattern is matched to the regular expression [ \n\t]+
part pattern is matched to the regular expression (\"[^\"]*\"|[^
\n\t\"]+)
So, the 'inobviousness' I referred to is just the fact that there is not
a single choice for what the delimiter should be when 'recombining' the
string.
The 'sort' command can actually be broken down into more fundamental
operations:
sort chunks of tVariable by func(each)
Is actually:
-- Create a sequence of parts to be sorted
split tVariable by chunk
-- Create a ordered list of indicies from 1 up to the number of parts
repeat with i = 1 to the number of elements in tVariable
put i into tIndexes[i]
end repeat
-- Sort the indicies by comparing the relevant parts from the original
variable
-- (note that this is a notional command - you can't currently sort an
indexed array)
sort tIndexes by func(tVariable[each]) -- this notionally sorts the
elements of tIndexes
-- Create a sorted sequence of parts based on the constructed indicies
repeat with i = 1 to the number of elements in tVariable
put tVariable[tIndexes[i]] into tSortedVariable[i]
end repeat
-- Reconstruct a sorted string using the original 'delimiter'
combine tSortedVariable by chunk
Here you see that there are two operations which are obviously defined
for item and line, but not so obviously defined for word - as there is
not a unique choice for the delimiter when doing the final combine.
Of course, as Geoff correctly pointed out space works as a suitable
choice for the delimiter in the final step in the case of word although
you could choose tab, return, or any combination of them (indeed
anything matching the regular expression [\n\t ]+.
In the more general case, the key thing is that you have to choose how
to recombine the output string for the sort such that you have the
following invariants:
1) the number of chunks in tSortedVariable is the number of chunks in
tVariable
2) for each tChunk in tVariable, the number of occurrences of tChunk in
tVariable is the number of occurrences of tChunk in tSorted Variable
Here (1) expresses the fact that you must have same number of chunks
before and after. (2) expresses the fact that every part before must be
present after, in exactly the same quantity.
With this in mind, you can then ask the question for any chunk (however
it is defined) whether it is 'sortable' - a chunk is sortable if there
exists a choice for delimiter which means (1) and (2) hold.
This is definitely true for item and line (as there is only one choice
of delimiter). It is true of word if you choose space (or, indeed, any
pattern matching [\n\t ]+). It is true of character, codepoint and
codeunit. It is true of trueWord if you choose space. It is not in any
way obvious for sentence chunks - the rules that define those are not
simple - and I'm not sure there exists a choice of delimiter which would
not (at least in some cases) change the set of parts you get in a
recombined output string (i.e. you can probably construct examples where
invariant (2) is broken).
Warmest Regards,
Mark.
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
More information about the use-livecode
mailing list