Finding invisible/non printable characters in a string

Curry Kenworthy curry at pair.com
Tue May 11 15:34:33 EDT 2021


David:

 > I set out to learn some Livecode, but instead learned about Zalgo.

Patience, dear chap! :)

Indeed you learned a relevant fact about Zalgo, and also Hindi, to 
demonstrate the answer to your first assumption and first question:

 > Would I be right in thinking if codepoint count > the number of chars
 > in a text string, then it probably contains invisible characters?

I'll try to remember that for you, answering simply "No" might be 
preferable to "No, and here's the reason why, along with some proof."

(But I may still need to mention the reason for the benefit of others.)

 > and that invisible characters have to go in the ‘maybe when I have
 > some time’ drawer, with datagrids, arrays and regex.

I would have the same response ("negative") for this new assumption.
And if I ended it there, just "no" with no reason, it might be strange!

So here's why: this list still exists, you and I are here. Many people.
Answering your first question did not preclude answering your second.
I don't see any reason why we might not be able to squeeze that in....

Once again, I'm following in Paul's accurate footsteps:

 > I think the best way is to scan the codepoints looking for
 > codePointToNum values that are 0-31 (exclude tab and cr/lfs
 > if you like) and 127 (DEL). There may be some others in the
 > 128-255 range that are not printable.

To build on that answer, usually I would aim to use regex. (Unless you 
know what type of text and which potential invisible chars you are 
expecting in your particular app; for 1 or 2 I'd use normal replace.)

More specifically, there are named Unicode categories implemented in 
regex. I would use those whenever possible. That way most of the hard 
work is already done for you, rather than painstakingly building up a 
list of possible chars and ranges from scratch for both ANSI and 
Unicode. But you should expect to tweak an expression for your purposes!

I'll stop there for now. Too much info, or not enough?

I have to admit that I love languages, scripts, formats, Unicode, and 
working with LC texts and fields! Always a fascinating subject. Thanks 
for bringing it up.

Back to work, take care everyone....

Best wishes,

Curry Kenworthy

Custom Software Development
"Better Methods, Better Results"
LiveCode Training and Consulting
http://livecodeconsulting.com/




More information about the use-livecode mailing list