problem with counting words

Richard Gaskin ambassador at fourthworld.com
Mon Oct 13 11:03:24 EDT 2014


Good post, Kay.  Each of the examples you provided is among the reasons 
I like xTalk.

But even though they demonstrate useful features of the language, 
neither is dependent on xTalk's trait of counting quoted text as a 
single word when using the word chunk type.

Perhaps I should preface this by noting that I very much enjoy xTalk in 
general and LiveCode in particular, a love that's only grown in my 27 
years with this family of languages.

But all programming languages have historical anomalies, and xTalk is 
not the world's only exception to this.  Programming languages are, by 
nature, somewhat funky, attempting to communicate the richness of human 
thought to a machine too stupid to count past 1.  All of them require 
trade-offs.

In the first example you provided, the list of names, none of them 
includes quoted text.  And even with the broader support of treating 
words as white-space delimited (breaking from the English rule of 
usually not including punctuation), as you noted at least one of the 
examples there will fail (sorry, Mr. Van Damme).

Many other languages also provide means of dealing with multi-character 
white space (sed, awk, and Python come to mind), and none of them, not 
even xTalk, will reliably sort by last name unless we separate the first 
and last more explicitly, such in separate fields or with a tab 
character, as is commonly done in any language where a last-name sort is 
important, even in LiveCode.

In the second example in which a multi-word value is used as an object 
identifier, once again we're not asked to parse that using xTalk's 
"word" chunk type, but instead get to rely on the engine's expression 
evaluator, which works very much like JavaScript's and others' in which 
literal strings can be used as object identifiers.  Useful as it is, 
it's neither unique to xTalk nor necessarily dependent on how we use the 
"word" chunk type.

Object identifiers *can* become dependent on the word chunk type if you 
need to parse them yourself, as others have noted along with many other 
good examples to justify the HyperTalk team's implementation (though we 
might ask why we need to do this so often, such as why we don't have 
objectType or ownerStack functions).

No matter how useful the current implementation is, the choice still 
requires justification.  Even if that justification is sound, favoring a 
certain utility, it's still a trade-off, the downside being a 
redefinition of the word "word" from its more common definition in 
natural language.

Larry's initial confusion is far from rare.  xTalk's reliance on a 
unique definition of "word" that differs from its use in natural 
language is something we all had to learn.  We may accept it, we may 
like it, we may even prefer it, but it's by no means intuitive to the 
native English speaker.

xTalk was born more than a decade before Unicode was invented, so it 
couldn't have taken advantage of the vast pool of collective knowledge 
embodied in the Unicode spec, nor was there the luxury of having the 
computational horsepower needed to use such a spec efficiently.

Today the LiveCode team has at last corrected this with the introduction 
of the "trueWord" token type, though I have to shrug my shoulders with 
an acknowledging chuckle in sharing Larry's initial observation that if 
xTalk were being designed today, with it's ostensible emphasis on 
"English-like" syntax, the order is backwards:

If we didn't have 27 years of code dependent on xTalk's unique 
redefinition of "word", to support the claim of "English-like" it might 
be more intuitive to have "word" act as "trueWord" does, and have some 
other token do what "word" currently does in xTalks unique redefinition.

But that's not the world we live in.  Like every other language, 
LiveCode is a product of its unique history.  Useful as its conventions 
are, they will from time to time require us to learn new ways of doing 
things.

This is just one of many reasons I generally don't use the phrase 
"English-like" when giving talks on LiveCode.  Our favorite language 
brings to the world's programming choices a uniquely valuable blend of 
features, but while it's certainly more readable than most it isn't 
particularly "English-like", nor does it really even try all that hard 
to be.

And that's a good thing.

Natural language is really tough stuff to parse, full of its own even 
longer and more nuanced history, and intended for a very different 
audience (the cognitive complexity of the human mind rather than the 
logical simplicity of computers).

I think most of us (except Geoff Canyon who has a rare mind for this 
sort of stuff <g>) would agree that we're all glad this isn't a valid 
statement in xTalk, even though it's a perfectly valid sentence in English:

"Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo."

<http://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo>

:)



Kay C Lan wrote:

> On Mon, Oct 13, 2014 at 7:45 AM, Richard Gaskin wrote:
>>
>> I hear ya', but like so many other oddities in the language this one came
>> from Apple,
>>
>
> Sheer brilliance! One of the first analogies of HyperCard was that it was a
> an electronic rolodex. Here is a list of names:
>
> Abu Musab    Al-Zarqawi
> Camilla Parker-Bowles
> Catherine    Zeta-Jones
> Claude Levi-Strauss
> D'Arcy    Corrigan
> Daniel  Day-Lewis
> David    Ben-Gurion
> Dodi Al-Fayed
> Florence    Griffith-Joyner
> Gilbert  O'Sullivan
> Gloria    Macapagal-Arroyo
> Jean-Claude Van Damme
> Jimmy    O'Dea
> Justine  Henin-Hardenne
> Kareem    Abdul-Jabbar
> Karim Abdul-Jabbar
> Kristin    Scott-Thomas
> Maddox  Jolie-Pitt
> Michael    O'Leary
> Olivia Newton-John
> Peter    O'Toole
> Sinéad O'Connor
> Tim    Brooke-Taylor
> Ralph Twistleton-Wykham-Fiennes
>
> So lets say you want to sort these by surname - a kind of rolodex thing to
> do.
>
> sort lines of myListOfNames by word  of  -1 each
>
> will result in only one mistake
>
> sort lines of myListOfnames by trueword -1 of each --if you are on LC7.0
>
> will result in basically the same messed up result most other programming
> languages will give you. Put it in and word processor and see how you go.
>
> Please feel free to try and write your own function that is more successful
> and more efficient than the beautiful one liner Bill Atkinson gave us. Even
> if you had wordDel it wouldn't help much. I can't imagine the amount of
> hours that have been wasted, especially on genealogical websites, trying to
> unfathom why double barrelled names never sort correctly. This is also
> compounded by the certain fact that some people will put a space between
> the last given name and the Surname, some a tab, and some will 'format' the
> data by placing multiple spaces in between names so that things 'line up
> nicely' - and are then confused as to why it only looks that way on their
> screen an not on someone else's. One of the reasons double barrelled names
> have picked up the '-' is to help computers recognise them as a single word.
>
> Also;
>
> put myVariable into fld Not A Variable
>
> doesn't work
>
> put myVariable into fld "Not A Variable"
>
> does. The ability to recognise words in quote as a single entity is
> extremely important. Yes, we don't typically think of such as a single
> word, but when we understand that computers don't think like us, and we do
> understand why things are the way they are, such oddities can be
> manipulated in many powerful ways to our own advantage. It is also helpful
> when we understand such things that we don't go around replacing one
> character willy nilly with another character. ~ [tilde] for instance is one
> character I'd never use as it has a special meaning in many computer
> languages; as does / \ < > . * and many others. If we had some text that
> contained both straight and curly quotes and replaced the straight quotes
> with curly quotes so we could get a word count, and then changed the curly
> quotes back to straight quotes, the finL text is not the same as it started
> - and this could cause problems. Today your function might work perfectly
> for today's problem, but next month, or next year, when you start expanding
> your LC skills and try working with SQL databases, or Servers and network
> connections, every now and then someone will report a bug that your app
> does something strange. You may never be able to track it down because it
> just happens that once every million DB calls a random user happens to use
> data that contains a character that you never use yourself and thought no
> one else would. I have a particular liking to numToChar(127) myself.
>
> Yep, no other programming language might define a word like LC defines a
> word, but I for one am EXTREMELY thankful for that.


-- 
  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  ____________________________________________________________________
  Ambassador at FourthWorld.com                http://www.FourthWorld.com




More information about the use-livecode mailing list