Unicode and chunk expressions

Dar Scott dsc at swcp.com
Tue May 17 17:52:48 EDT 2005


On May 17, 2005, at 3:06 PM, Dar Scott wrote:

> You can convert to UTF8 and then work with the chunk expression for 
> line and item and (maybe word).

I forgot to say not char.

In UTF8, all characters in the ASCII range including the chunking 
syntax characters have the high-bit zero.  All other characters consist 
of bytes with the high-bit one.  That means you can't get any false 
syntax characters.

I made this little handler:

on mouseUp
   get the unicodeText of field "field"
   put uniDecode(it,"UTF8") into utf8Text
   get binaryDecode("H*",utf8Text,h)
   put utf8Text & lf & h
end mouseUp

I put this into field "field":
a 3, b 21, Ÿ

I clicked the button and got this:
a 3, b 21, ü
6120332c20622032312c20c3bc

Broken up by characters
61
20
33
2c
20
62
20
32
31
2c
20
c3bc

As you can see, words just look longer because some characters are two 
to 4 bytes.  You can spot them with the high bytes.

If you get item 3, you get the right text.

You can then convert results back to UTF16 (host order).

Unfortunately, a BOM can interfere with this so remove it when you 
convert to UTF8.

This will work with characters that require surrogates in UTF16, too, 
so this is a nice general solution.

If you need to work with mostly chars, then leave it in UTF16 and work 
with that taking two bytes (char in your script) at a time.  You can 
very often assume you are working with characters in the primary plane 
and you have no surrogates and thus every two bytes is a Unicode 
character.

Dar

-- 
**********************************************
     DSC (Dar Scott Consulting & Dar's Lab)
     http://www.swcp.com/dsc/
     Programming and software
**********************************************



More information about the use-livecode mailing list