Finding invisible/non printable characters in a string
Curry Kenworthy
curry at pair.com
Mon May 10 13:23:48 EDT 2021
David:
> Would I be right in thinking if codepoint count > the number of chars
> in a text string, then it probably contains invisible characters?
Negative; there are other possibilities. Including....
Paul:
> There are characters that consist of more than one codepoint -
> composite versions of characters for accents. See
> https://unicode-table.com/en/blocks/combining-diacritical-marks/
Yes! An example would be Hindi: नमस्ते
Which per LC has 6 codepoints and 4 chars.
Another fun accent example is using a "Zalgo" generator:
H̴̱̞͔̺̣̀ĩ̴̱̣̉̀͛̂ͅ ̵͆̒ͅt̷̹̖͖̍͠ḩ̸̛̤̃̄̑̾͝e̵̻̤͙̐̽r̵̙̩̀͂̚̕e̶̗͓̲̞͍̻̎͋͘͠͠
68 codepoints vs 8 chars! Zalgo heaps random accents onto characters.
But as you can see, many languages and notation systems have modifiers.
Typically such accents combine with another character, but thanks to
the Magic of Bugs you can also see them breaking free and "doing their
own thing" sometimes, as if they were separate characters, by pasting
some Thai or Myanmar text into an LC field and resizing that field:
https://quality.livecode.com/show_bug.cgi?id=22373
On the possibility of invisible characters, there is also an LC bug
which inserts one or more nulls after pasted text on Windows:
https://quality.livecode.com/show_bug.cgi?id=22172
But the nulls count as characters, so the codepoint count still matches.
I thought there was another category of Unicode values affected besides
the combining modifiers, but if so, it's eluding me at the moment. :)
Best wishes,
Curry Kenworthy
Custom Software Development
"Better Methods, Better Results"
LiveCode Training and Consulting
http://livecodeconsulting.com/
More information about the use-livecode
mailing list