Finding invisible/non printable characters in a string

Curry Kenworthy curry at pair.com
Mon May 10 13:23:48 EDT 2021


David:

 > Would I be right in thinking if codepoint count > the number of chars
 > in a text string, then it probably contains invisible characters?

Negative; there are other possibilities. Including....

Paul:

 > There are characters that consist of more than one codepoint -
 > composite versions of characters for accents. See
 > https://unicode-table.com/en/blocks/combining-diacritical-marks/

Yes! An example would be Hindi: नमस्ते
Which per LC has 6 codepoints and 4 chars.

Another fun accent example is using a "Zalgo" generator:

H̴̱̞͔̺̣̀ĩ̴̱̣̉̀͛̂ͅ ̵͆̒ͅt̷̹̖͖̍͠ḩ̸̛̤̃̄̑̾͝e̵̻̤͙̐̽r̵̙̩̀͂̚̕e̶̗͓̲̞͍̻̎͋͘͠͠

68 codepoints vs 8 chars! Zalgo heaps random accents onto characters.
But as you can see, many languages and notation systems have modifiers.

Typically such accents combine with another character, but thanks to 
the Magic of Bugs you can also see them breaking free and "doing their 
own thing" sometimes, as if they were separate characters, by pasting 
some Thai or Myanmar text into an LC field and resizing that field:

https://quality.livecode.com/show_bug.cgi?id=22373

On the possibility of invisible characters, there is also an LC bug 
which inserts one or more nulls after pasted text on Windows:

https://quality.livecode.com/show_bug.cgi?id=22172

But the nulls count as characters, so the codepoint count still matches. 
I thought there was another category of Unicode values affected besides 
the combining modifiers, but if so, it's eluding me at the moment. :)

Best wishes,

Curry Kenworthy

Custom Software Development
"Better Methods, Better Results"
LiveCode Training and Consulting
http://livecodeconsulting.com/




More information about the use-livecode mailing list