Distinguishing between ASCII and UTF8
massung at gmail.com
Wed Oct 6 16:29:37 EDT 2010
On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin
<ambassador at fourthworld.com>wrote:
> I have an app that needs to auto-detect Unicode and plain text, and render
> them correctly based on that auto-detection.
> I have the UTF16 stuff working, but with UTF8 I have a problem: there is
> no BOM to let me know if it's Unicode, and some plain text files will
> occasionally have high-ASCII values in them (like the dagger symbol).
> What patterns should I be looking for in the binary data of a file to
> distinguish UTF8 from plain text?
Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8
is that it's indistinguishable from ASCII (0-127). You may be able to scan
the files, and if they are large enough, try and deduce some thing from them
to know which they are. For example:
On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a
In ASCII there will never be a NULL terminator anywhere (byte 0). There's
likely many 0-byte values in any appreciably large Unicode file. This would
also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few
If the number of bytes that have the high bit (0x80) set is extremely low
(<<< 1%) then most likely it's ASCII.
More information about the Use-livecode