Unicode and chunk expressions
Dar Scott
dsc at swcp.com
Tue May 17 17:52:48 EDT 2005
On May 17, 2005, at 3:06 PM, Dar Scott wrote:
> You can convert to UTF8 and then work with the chunk expression for
> line and item and (maybe word).
I forgot to say not char.
In UTF8, all characters in the ASCII range including the chunking
syntax characters have the high-bit zero. All other characters consist
of bytes with the high-bit one. That means you can't get any false
syntax characters.
I made this little handler:
on mouseUp
get the unicodeText of field "field"
put uniDecode(it,"UTF8") into utf8Text
get binaryDecode("H*",utf8Text,h)
put utf8Text & lf & h
end mouseUp
I put this into field "field":
a 3, b 21,
I clicked the button and got this:
a 3, b 21, ü
6120332c20622032312c20c3bc
Broken up by characters
61
20
33
2c
20
62
20
32
31
2c
20
c3bc
As you can see, words just look longer because some characters are two
to 4 bytes. You can spot them with the high bytes.
If you get item 3, you get the right text.
You can then convert results back to UTF16 (host order).
Unfortunately, a BOM can interfere with this so remove it when you
convert to UTF8.
This will work with characters that require surrogates in UTF16, too,
so this is a nice general solution.
If you need to work with mostly chars, then leave it in UTF16 and work
with that taking two bytes (char in your script) at a time. You can
very often assume you are working with characters in the primary plane
and you have no surrogates and thus every two bytes is a Unicode
character.
Dar
--
**********************************************
DSC (Dar Scott Consulting & Dar's Lab)
http://www.swcp.com/dsc/
Programming and software
**********************************************
More information about the use-livecode
mailing list