Distinguishing between ASCII and UTF8

Jeff Massung massung at gmail.com
Wed Oct 6 16:29:37 EDT 2010


On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin
<ambassador at fourthworld.com>wrote:

> I have an app that needs to auto-detect Unicode and plain text, and render
> them correctly based on that auto-detection.
>
> I have the UTF16 stuff working, but with UTF8 I have a problem:  there is
> no BOM to let me know if it's Unicode, and some plain text files will
> occasionally have high-ASCII values in them (like the dagger symbol).
>
> What patterns should I be looking for in the binary data of a file to
> distinguish UTF8 from plain text?
>
>
Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8
is that it's indistinguishable from ASCII (0-127). You may be able to scan
the files, and if they are large enough, try and deduce some thing from them
to know which they are. For example:

On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a
text file.

In ASCII there will never be a NULL terminator anywhere (byte 0). There's
likely many 0-byte values in any appreciably large Unicode file. This would
also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few
others.

If the number of bytes that have the high bit (0x80) set is extremely low
(<<< 1%) then most likely it's ASCII.

HTH,

Jeff M.



More information about the use-livecode mailing list