Distinguishing between ASCII and UTF8

Bob Sneidar bobs at twft.com
Thu Oct 7 12:59:25 EDT 2010


Okay, so that begs the question, if there is no difference between UTF8 and ASCII, why make the distinction? I mean, what would be the point to converting from ASCII to UTF8 or vis versa if the results were always the same?

Just being practical. 

Bob


On Oct 6, 2010, at 1:29 PM, Jeff Massung wrote:

> On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin
> <ambassador at fourthworld.com>wrote:
> 
>> I have an app that needs to auto-detect Unicode and plain text, and render
>> them correctly based on that auto-detection.
>> 
>> I have the UTF16 stuff working, but with UTF8 I have a problem:  there is
>> no BOM to let me know if it's Unicode, and some plain text files will
>> occasionally have high-ASCII values in them (like the dagger symbol).
>> 
>> What patterns should I be looking for in the binary data of a file to
>> distinguish UTF8 from plain text?
>> 
>> 
> Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8
> is that it's indistinguishable from ASCII (0-127). You may be able to scan
> the files, and if they are large enough, try and deduce some thing from them
> to know which they are. For example:
> 
> On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a
> text file.
> 
> In ASCII there will never be a NULL terminator anywhere (byte 0). There's
> likely many 0-byte values in any appreciably large Unicode file. This would
> also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few
> others.
> 
> If the number of bytes that have the high bit (0x80) set is extremely low
> (<<< 1%) then most likely it's ASCII.
> 
> HTH,
> 
> Jeff M.
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution




More information about the use-livecode mailing list