best/fastest way to tell if a field contains unicode text?

Ben Rubinstein benr_mc at cogapp.com
Thu Mar 20 14:40:00 EDT 2014


On 20/03/2014 15:37, Geoff Canyon wrote:
> I have a field that has been populated by setting the unicodetext. Some
> lines actually need unicode -- umlauts, enye, etc. -- and others are plain
> ascii.
>
> What's the most efficient way to count how many lines are plain and how
> many actually need unicode?

Could you (when all the uni-7 stuff has settled down and we have proper 
conversions etc) convert text from unicode to UTF8, and also to an 8- or 7-bit 
representation, and compare the number of bytes in these two representations?

If the lengths are the same in both the UTF8 and ISO-8859-1 versions, then all 
the characters could be represented in a single byte in UTF8.

That probably means in fact that all the characters are in ISO-8859-1 (I think 
that the one-byte characters in UTF8 approximately correspond to ISO-8859-1, 
but I'm prepared to be corrected).

Depending your definition of 'plain', that may suffice.  If your API actually 
needs plain ASCII, then you can convert one more time, to ASCII, and compare 
the actual text of the ISO-8859-1 and ASCII versions - if they differ that 
should be because some characters that aren't in ASCII have been replaced with 
"?", so it ain't ASCII.  (Unless the textDecode system is cute and eg tries to 
replace 'smart' quotes with plain ones...)

Ben




More information about the use-livecode mailing list