best/fastest way to tell if a field contains unicode text?
Ben Rubinstein
benr_mc at cogapp.com
Thu Mar 20 14:40:00 EDT 2014
On 20/03/2014 15:37, Geoff Canyon wrote:
> I have a field that has been populated by setting the unicodetext. Some
> lines actually need unicode -- umlauts, enye, etc. -- and others are plain
> ascii.
>
> What's the most efficient way to count how many lines are plain and how
> many actually need unicode?
Could you (when all the uni-7 stuff has settled down and we have proper
conversions etc) convert text from unicode to UTF8, and also to an 8- or 7-bit
representation, and compare the number of bytes in these two representations?
If the lengths are the same in both the UTF8 and ISO-8859-1 versions, then all
the characters could be represented in a single byte in UTF8.
That probably means in fact that all the characters are in ISO-8859-1 (I think
that the one-byte characters in UTF8 approximately correspond to ISO-8859-1,
but I'm prepared to be corrected).
Depending your definition of 'plain', that may suffice. If your API actually
needs plain ASCII, then you can convert one more time, to ASCII, and compare
the actual text of the ISO-8859-1 and ASCII versions - if they differ that
should be because some characters that aren't in ASCII have been replaced with
"?", so it ain't ASCII. (Unless the textDecode system is cute and eg tries to
replace 'smart' quotes with plain ones...)
Ben
More information about the use-livecode
mailing list