Distinguishing between ASCII and UTF8

Peter W A Wood peterwawood at gmail.com
Wed Oct 6 19:32:42 EDT 2010


> I have an app that needs to auto-detect Unicode and plain text, and render them correctly based on that auto-detection.
> I have the UTF16 stuff working, but with UTF8 I have a problem:  there is no BOM to let me know if it's Unicode, and some plain text files will occasionally have high-ASCII values in them (like the dagger symbol).
> What patterns should I be looking for in the binary data of a file to distinguish UTF8 from plain text?

These are the "Rules of Thumb" that I have used to try to determine the encoding type of text files. I feel that I achieved more than 90 per cent success but that may because most of the files only included true ASCII characters (0 -127). The script only tries to distinguish between ASCII, UTF-8, MacRoman and Windows 1252 Codepage (the US default for Windows).

Rules of Thumb, applied in the following order:

1. If the string starts with a BOM, the encoding infered by the BOM will be returned.

2. If the string contains only characters in the range 0x00 - 0x7F, it is an ASCII string.

3. If the string contains more UTF-8 multi-byte characters than it does invalid utf-8 characters and invalid multi-byte sequences, it is a UTF-8 string.

4. If the string contains characters in the range 0xA0 - 0xFF but none in the range 0x80 - 0x9F, it is an ISO-8859-1 string.

5. If the string contains any of 0x81, 0x8D, 0x8F, 0x90 or 0x9D, it is a MacRoman string. .

6. If the string contains carriage returns but no line feeds, it is a MacRoman string.

7. It is a Windows 1252 Codepage string.

The approach I take in the script is to count the different types of characters in the text and then apply the rules of thumb. The script is written in REBOL so will probably not be even be of help as a guide. However, the documentation includes a table of the differences between UTF-8, Windows 1252 and MacRoman which you may find useful. You can find it at http://www.rebol.org/documentation.r?script=str-enc-utils.r



More information about the Use-livecode mailing list