problem with counting words

Mon Oct 13 13:07:41 EDT 2014

Hi Richard,
in a word...
"I really enjoyed reading your post and I learned a lot!"

Larry

----- Original Message ----- 
From: "Richard Gaskin" <ambassador at fourthworld.com>
To: <use-livecode at lists.runrev.com>
Sent: Monday, October 13, 2014 9:03 AM
Subject: Re: problem with counting words

> Good post, Kay.  Each of the examples you provided is among the reasons I 
> like xTalk.
>
> But even though they demonstrate useful features of the language, neither 
> is dependent on xTalk's trait of counting quoted text as a single word 
> when using the word chunk type.
>
> Perhaps I should preface this by noting that I very much enjoy xTalk in 
> general and LiveCode in particular, a love that's only grown in my 27 
> years with this family of languages.
>
> But all programming languages have historical anomalies, and xTalk is not 
> the world's only exception to this.  Programming languages are, by nature, 
> somewhat funky, attempting to communicate the richness of human thought to 
> a machine too stupid to count past 1.  All of them require trade-offs.
>
> In the first example you provided, the list of names, none of them 
> includes quoted text.  And even with the broader support of treating words 
> as white-space delimited (breaking from the English rule of usually not 
> including punctuation), as you noted at least one of the examples there 
> will fail (sorry, Mr. Van Damme).
>
> Many other languages also provide means of dealing with multi-character 
> white space (sed, awk, and Python come to mind), and none of them, not 
> even xTalk, will reliably sort by last name unless we separate the first 
> and last more explicitly, such in separate fields or with a tab character, 
> as is commonly done in any language where a last-name sort is important, 
> even in LiveCode.
>
> In the second example in which a multi-word value is used as an object 
> identifier, once again we're not asked to parse that using xTalk's "word" 
> chunk type, but instead get to rely on the engine's expression evaluator, 
> which works very much like JavaScript's and others' in which literal 
> strings can be used as object identifiers.  Useful as it is, it's neither 
> unique to xTalk nor necessarily dependent on how we use the "word" chunk 
> type.
>
> Object identifiers *can* become dependent on the word chunk type if you 
> need to parse them yourself, as others have noted along with many other 
> good examples to justify the HyperTalk team's implementation (though we 
> might ask why we need to do this so often, such as why we don't have 
> objectType or ownerStack functions).
>
> No matter how useful the current implementation is, the choice still 
> requires justification.  Even if that justification is sound, favoring a 
> certain utility, it's still a trade-off, the downside being a redefinition 
> of the word "word" from its more common definition in natural language.
>
> Larry's initial confusion is far from rare.  xTalk's reliance on a unique 
> definition of "word" that differs from its use in natural language is 
> something we all had to learn.  We may accept it, we may like it, we may 
> even prefer it, but it's by no means intuitive to the native English 
> speaker.
>
> xTalk was born more than a decade before Unicode was invented, so it 
> couldn't have taken advantage of the vast pool of collective knowledge 
> embodied in the Unicode spec, nor was there the luxury of having the 
> computational horsepower needed to use such a spec efficiently.
>
> Today the LiveCode team has at last corrected this with the introduction 
> of the "trueWord" token type, though I have to shrug my shoulders with an 
> acknowledging chuckle in sharing Larry's initial observation that if xTalk 
> were being designed today, with it's ostensible emphasis on "English-like" 
> syntax, the order is backwards:
>
> If we didn't have 27 years of code dependent on xTalk's unique 
> redefinition of "word", to support the claim of "English-like" it might be 
> more intuitive to have "word" act as "trueWord" does, and have some other 
> token do what "word" currently does in xTalks unique redefinition.
>
> But that's not the world we live in.  Like every other language, LiveCode 
> is a product of its unique history.  Useful as its conventions are, they 
> will from time to time require us to learn new ways of doing things.
>
> This is just one of many reasons I generally don't use the phrase 
> "English-like" when giving talks on LiveCode.  Our favorite language 
> brings to the world's programming choices a uniquely valuable blend of 
> features, but while it's certainly more readable than most it isn't 
> particularly "English-like", nor does it really even try all that hard to 
> be.
>
> And that's a good thing.
>
> Natural language is really tough stuff to parse, full of its own even 
> longer and more nuanced history, and intended for a very different 
> audience (the cognitive complexity of the human mind rather than the 
> logical simplicity of computers).
>
> I think most of us (except Geoff Canyon who has a rare mind for this sort 
> of stuff <g>) would agree that we're all glad this isn't a valid statement 
> in xTalk, even though it's a perfectly valid sentence in English:
>
> "Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo."
>
> <http://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo>
>
> :)
>
>
>
> Kay C Lan wrote:
>
>> On Mon, Oct 13, 2014 at 7:45 AM, Richard Gaskin wrote:
>>>
>>> I hear ya', but like so many other oddities in the language this one 
>>> came
>>> from Apple,
>>>
>>
>> Sheer brilliance! One of the first analogies of HyperCard was that it was 
>> a
>> an electronic rolodex. Here is a list of names:
>>
>> Abu Musab    Al-Zarqawi
>> Camilla Parker-Bowles
>> Catherine    Zeta-Jones
>> Claude Levi-Strauss
>> D'Arcy    Corrigan
>> Daniel  Day-Lewis
>> David    Ben-Gurion
>> Dodi Al-Fayed
>> Florence    Griffith-Joyner
>> Gilbert  O'Sullivan
>> Gloria    Macapagal-Arroyo
>> Jean-Claude Van Damme
>> Jimmy    O'Dea
>> Justine  Henin-Hardenne
>> Kareem    Abdul-Jabbar
>> Karim Abdul-Jabbar
>> Kristin    Scott-Thomas
>> Maddox  Jolie-Pitt
>> Michael    O'Leary
>> Olivia Newton-John
>> Peter    O'Toole
>> Sinad O'Connor
>> Tim    Brooke-Taylor
>> Ralph Twistleton-Wykham-Fiennes
>>
>> So lets say you want to sort these by surname - a kind of rolodex thing 
>> to
>> do.
>>
>> sort lines of myListOfNames by word  of  -1 each
>>
>> will result in only one mistake
>>
>> sort lines of myListOfnames by trueword -1 of each --if you are on LC7.0
>>
>> will result in basically the same messed up result most other programming
>> languages will give you. Put it in and word processor and see how you go.
>>
>> Please feel free to try and write your own function that is more 
>> successful
>> and more efficient than the beautiful one liner Bill Atkinson gave us. 
>> Even
>> if you had wordDel it wouldn't help much. I can't imagine the amount of
>> hours that have been wasted, especially on genealogical websites, trying 
>> to
>> unfathom why double barrelled names never sort correctly. This is also
>> compounded by the certain fact that some people will put a space between
>> the last given name and the Surname, some a tab, and some will 'format' 
>> the
>> data by placing multiple spaces in between names so that things 'line up
>> nicely' - and are then confused as to why it only looks that way on their
>> screen an not on someone else's. One of the reasons double barrelled 
>> names
>> have picked up the '-' is to help computers recognise them as a single 
>> word.
>>
>> Also;
>>
>> put myVariable into fld Not A Variable
>>
>> doesn't work
>>
>> put myVariable into fld "Not A Variable"
>>
>> does. The ability to recognise words in quote as a single entity is
>> extremely important. Yes, we don't typically think of such as a single
>> word, but when we understand that computers don't think like us, and we 
>> do
>> understand why things are the way they are, such oddities can be
>> manipulated in many powerful ways to our own advantage. It is also 
>> helpful
>> when we understand such things that we don't go around replacing one
>> character willy nilly with another character. ~ [tilde] for instance is 
>> one
>> character I'd never use as it has a special meaning in many computer
>> languages; as does / \ < > . * and many others. If we had some text that
>> contained both straight and curly quotes and replaced the straight quotes
>> with curly quotes so we could get a word count, and then changed the 
>> curly
>> quotes back to straight quotes, the finL text is not the same as it 
>> started
>> - and this could cause problems. Today your function might work perfectly
>> for today's problem, but next month, or next year, when you start 
>> expanding
>> your LC skills and try working with SQL databases, or Servers and network
>> connections, every now and then someone will report a bug that your app
>> does something strange. You may never be able to track it down because it
>> just happens that once every million DB calls a random user happens to 
>> use
>> data that contains a character that you never use yourself and thought no
>> one else would. I have a particular liking to numToChar(127) myself.
>>
>> Yep, no other programming language might define a word like LC defines a
>> word, but I for one am EXTREMELY thankful for that.
>
>
> -- 
>  Richard Gaskin
>  Fourth World Systems
>  Software Design and Development for the Desktop, Mobile, and the Web
>  ____________________________________________________________________
>  Ambassador at FourthWorld.com                http://www.FourthWorld.com
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your 
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode