sorting words ?
Kay C Lan
lan.kc.macmail at gmail.com
Fri Dec 11 01:56:45 EST 2015
On Thu, Dec 10, 2015 at 4:38 PM, Mark Waddingham <mark at livecode.com> wrote:
>
> The "word" chunk is not loosely implemented - it does precisely what it is
> meant to do.
>
> Which of course is the reason why the sort container command has no
problem if you sort by word on the right side of the equation - 'by word x
of each'
>
> Here you see that there are two operations which are obviously defined for
> item and line, but not so obviously defined for word - as there is not a
> unique choice for the delimiter when doing the final combine.
>
Well LC's definition of what a word is isn't exactly universally accepted,
but once you understand it, it's extremely powerful and saves a huge amount
of effort. As LC has it's own definition of what a word is then surely it
could define exactly how it's going to output the final combine. More on
this below.
>
> In the more general case, the key thing is that you have to choose how to
> recombine the output string for the sort such that you have the following
> invariants:
>
> 1) the number of chunks in tSortedVariable is the number of chunks in
> tVariable
>
> This is not quite true is it:
put "the,quick,brown," into tVar
put the number of items in tVar into msg -- 3
sort items of tVar
put the number of items in tVar into msg -- 4
Now I don't wish to discuss why this is, I understand why it is, I'm OK as
to why it is. As with LC's definition of what a word is, when you
understand what is happening under the hood you can work around it or work
it to your advantage. Of note in the above, the number of chars has
remained the same.
> With this in mind, you can then ask the question for any chunk (however it
> is defined) whether it is 'sortable' - a chunk is sortable if there exists
> a choice for delimiter which means (1) and (2) hold.
>
> This is definitely true for item and line (as there is only one choice of
> delimiter). It is true of word if you choose space (or, indeed, any pattern
> matching [\n\t ]+). It is true of character... and I'm not sure there
> exists a choice of [word] delimiter which would not (at least in some
> cases) change the set of parts you get in a recombined output string (i.e.
> you can probably construct examples where invariant (2) is broken).
>
> As with the item example above it's possible to 'confuse' the invariant
rule so the fact that LC may sort words and output it in a manner some
people don't agree with is irrelevant. IMO sort by word would work like
this:
the quick "brown fox" jumped over --contains spaces and tabs
the[space]quick[tab]"brown[tab]fox"[space][tab]jumped[tab]over[tab]
when sorted would come out like this:
jumped over the quick "brown fox"
[tab][tab]jumped[tab]over[tab]the[space]quick[space]"brown[tab]fox"
In this case LC would follow the universally accepted sort order of tabs
precede spaces. It keeps the number of chars exactly the same, just as LC
already does. It sorts empties (although there really isn't such a thing as
an empty word) to the beginning and then proceeds to hand out the rest of
the delimiters singularly between each word and so there are tabs between
some words, just as there was in the original, and spaces between others.
No trailing delimiters, as is the current case for line/items.
Now if you don't agree, and think it should come out some other way, that's
OK, all that matters is whatever the output, it is consistent and
published. LC could convert all tabs to spaces, it could remove all
instances of multiple whitespace and replace it with a single space, I
don't care, just as long as whatever it does is consistent and published.
Just as some people don't think "New York" is one word, LC does, it's
published that quoted phrases are counted as one word, and that's a very
powerful thing.
So again, as LC can already sort words on the right side of the equation -
sort xxxxx of tVar by word y of each, it's seems only a minor step to make
it possible on the left side of the equation - sort words of tVar.
Obviously the sorting mechanism is in place it's just the actual
presentation that needs a little thought - surely not that hard.
Here's a script for displaying how LC already does word sorting: (watch for
line wraps)
on mouseUp
put "the,quick,brown,fox," & space & ",jumped,over,the," & tab &
",lazy,dog," into tVar
put "Chars = " & the number of chars of tVar & ", Items = " & the number
of items of tVar & cr into msg
sort items of tVar
put "Chars = " & the number of chars of tVar & ", Items = " & the number
of items of tVar & cr after msg
put tVar & cr after msg
answer "Ready for Part 2?"
put "9 the quick brown fox jumped over the lazy dog"
into line 1 of tVar --spaces
put "8" & tab & "the" & tab & "quick" & tab & "brown" & tab & "fox" &
tab & "jumped" & tab & "over" & tab & "the" & tab & "lazy" & tab & "dog"
into line 2 of tVar --tabs
put "7 the" & tab & "quick brown" & tab & "fox jumped" & tab & "over
the" & tab & "lazy dog" into line 3 of tVar --spaces and tabs
put "6 the" & tab & quote & "slick brown" & tab & "fox" & quote & "
jumped" & tab & "over the" & tab & "lazy dog" into line 4 of
tVar --quotes, tabs, spaces
put "5 the" & tab & quote & "quick brown" & tab & "fox" & quote & "
jumped" & tab & "over the" & tab & "lazy dog" into line 5 of
tVar --quotes, tabs, spaces
put "4 the" & tab & quote & " slick brown" & tab & "fox" & quote & "
jumped" & tab & "over the" & tab & "lazy dog" into line 6 of
tVar --quotes, tabs, spaces
put "3 the" & tab & quote & " quick brown" & tab & "fox" & quote & "
jumped" & tab & "over the" & tab & "lazy dog" into line 7 of
tVar --quotes, tabs, spaces
put "2 the" & tab & quote & tab & "slick brown" & tab & "fox" & quote &
" jumped" & tab & "over the" & tab & "lazy dog" into line 8 of
tVar --quotes, tabs, spaces
put "1 the" & tab & quote & tab & "quick brown" & tab & "fox" & quote &
" jumped" & tab & "over the" & tab & "lazy dog" into line 9 of
tVar --quotes, tabs, spaces
put tVar & cr & cr into msg
put tVar into tVar2
put "This is sorted by word 3" & cr after msg
sort lines of tVar by word 3 of each
put tVar & cr & cr after msg
put "This is sorted by char 3 of word 3" & cr after msg
sort lines of tVar2 by char 3 of word 3 of each
put tVar2 & cr & cr after msg
end mouseUp
More information about the use-livecode
mailing list