Distinguishing between ASCII and UTF8
Bob Sneidar
bobs at twft.com
Thu Oct 7 12:59:25 EDT 2010
Okay, so that begs the question, if there is no difference between UTF8 and ASCII, why make the distinction? I mean, what would be the point to converting from ASCII to UTF8 or vis versa if the results were always the same?
Just being practical.
Bob
On Oct 6, 2010, at 1:29 PM, Jeff Massung wrote:
> On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin
> <ambassador at fourthworld.com>wrote:
>
>> I have an app that needs to auto-detect Unicode and plain text, and render
>> them correctly based on that auto-detection.
>>
>> I have the UTF16 stuff working, but with UTF8 I have a problem: there is
>> no BOM to let me know if it's Unicode, and some plain text files will
>> occasionally have high-ASCII values in them (like the dagger symbol).
>>
>> What patterns should I be looking for in the binary data of a file to
>> distinguish UTF8 from plain text?
>>
>>
> Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8
> is that it's indistinguishable from ASCII (0-127). You may be able to scan
> the files, and if they are large enough, try and deduce some thing from them
> to know which they are. For example:
>
> On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a
> text file.
>
> In ASCII there will never be a NULL terminator anywhere (byte 0). There's
> likely many 0-byte values in any appreciably large Unicode file. This would
> also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few
> others.
>
> If the number of bytes that have the high bit (0x80) set is extremely low
> (<<< 1%) then most likely it's ASCII.
>
> HTH,
>
> Jeff M.
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
More information about the use-livecode
mailing list