char as word boundary

Mark Waddingham mark at livecode.com
Mon Jun 13 12:13:37 EDT 2022


Hi Jean-Jacques,

On 2022-06-03 14:56, Jean-Jacques Wagner via use-livecode wrote:
> Hi,
> Version 6.7    word boudary are char number 09,10,11,12,13,32
> version 9.67  word boudary are char number 09,10,11,12,13,32,202
> 
> Hypercard and livecode 6.7:  the number of chars (numtochar(32)&
> numtochar(202)&numtochar(32)& numtochar(202)&numtochar(32)) = 2
> livecode 9.67                      :   the number of chars
> (numtochar(32)& numtochar(202)&numtochar(32)&
> numtochar(202)&numtochar(32)) = 0
> 
> Is it a change or a bug considering now numtochar(202) as word
> boundary, as it is with numtochar(32)

This is something we will need to consider - please do file a bug about 
it at quality.livecode.com (so you can track any further discussion 
about it).

I can see how this change occurred, and it is perhaps more a 
'side-effect of implementation' rather than an intended change.

Prior to 7.0 - the word chunk used the C library 'ctype' isspace 
function - which returns true if a character is 'whitespace'. However, 
the engine *also* tweaked the C library character tables to make it so 
that NBSP (202 on MacRoman - something else on Windows/Linux - 160 
maybe?) was *not* a space character. This was primarily a very dirty 
hack (which was done before my time!) to allow non-breaking spaces to 
prevent word breaks in fields (I strongly suspect the effect on the word 
chunk was never considered!).

When we moved to Unicode - we changed the word-breaking detection in 
fields to use a simplified version of the Unicode algorithm and Unicode 
character properties (NBSP has the, unsurprisingly, no-break property!). 
Similarly, we changed the word chunk to use the Unicode 'whitespace' 
property. In the unicode world - being whitespace, and non-breaking are 
two separate properties... Hence the difference in behavior since 7.

The reason this is 'of interest' is that the word chunk has had quite a 
hefty performance regression since 7.0 due to the switch to Unicode, so 
re-looking at what it should *actually* do (taking into account what it 
would be most useful in the widest possible circumstances) is definitely 
on the cards.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps



More information about the use-livecode mailing list