Some thoughts on duck typing
Peter W A Wood
peterwawood at gmail.com
Wed Jan 12 18:26:33 EST 2011
On 13 Jan 2011, at 01:55, Jeff Massung wrote:
> - Next, determine text vs. binary. This is usually done by just grabbing the
> first N (where N is ~1000) bytes and look for any that are < 10 or > 127. If
> you find any, it's binary - or unicode.
This is only true if the text is 7-bit encoded which is very, very rare these days. (In fact, it isn't totally true as 0 to 9 are valid ASCII characters though not often found in files). The default text encoding on Mac Classic (MacRoman) and Windows (Codepage 1252 in US & Western Europe) are both 8-bit encoded. The above test would only work if no accented characters were used in text.
> Remember that while UTF8 is not ASCII, it's designed to be indistinguishable
> from ASCII most of the time. I don't have any advice to give you here on how
> to determine if the file is unicode text or not... as I understand it this
> is really a difficult problem to solve. I'm sure Google can help, though.
UTF-8 is designed to be indistinguishable from 7-bit encoded ASCII (characters 0 - 127 are identical in both encoding systems). However, the use of characters coded in the range 128 - 255 is very different between UTF-8, Windows Codepages and MacRoman).
More information about the Use-livecode