best/fastest way to tell if a field contains unicode text?

Fraser Gordon fraser.gordon at runrev.com
Thu Mar 20 14:10:30 EDT 2014


On 20 Mar 2014, at 17:39, Mark Wieder <mwieder at ahsoftware.net> wrote:

> put unidecode("hello bucko")
> 
> converts the text to 敨汬Ɐ戠捵潫.

Thinking about this a bit more, I ought to write something up about how text and binary work in the 7.0 engine and how this relates to the existing ways of doing things.

The short version is that text and binary data are now very different things and some unexpected things can happen when the engine converts between them. As a rough guide:

unicodeText of ...:    binary data, encoded in UTF-16
text of …:   text (unicode but transparent)
I/O:   expects and produces binary data
uniEncode/uniDecode: accept and produce binary data

When the engine implicitly converts binary data -> Unicode, it treats the binary data as native characters.

When the engine implicitly converts Unicode -> binary data, it converts to native characters and changes unrepresentable characters to '?'

The "byte" chunk expression operates on binary data.

The "word", "char", etc chunk expressions operate on text.

To convert from text to binary, use textEncode e.g. textEncode("Hello, World!", "UTF-8")

To convert from binary to text, use textDecode e.g. textDecode(url(...), "UTF-8")

Hope that helps explain what is going on. I'll write it up a bit more thoroughly so people have a guide to using Unicode (i.e it is transparent except where it can't be, like dealing with files*).

*except in some cases ;)

Regards,
Fraser



More information about the use-livecode mailing list