Distinguishing between ASCII and UTF8

Richmond Mathewson richmondmathewson at gmail.com
Thu Oct 7 13:14:00 EDT 2010


  On 10/7/10 7:59 PM, Bob Sneidar wrote:
> Okay, so that begs the question, if there is no difference between UTF8 and ASCII, why make the distinction? I mean, what would be the point to converting from ASCII to UTF8 or vis versa if the results were always the same?
>
> Just being practical.

Some of us grew up in Britain in the 60s and 70s (Oh, how depressing) 
and remember the feeling of moving from
short trousers to long trousers; as far as I understand ASCII and UTF8 
are somehow the same without the place
being trashed by the . . . . . (whoops, no politics) . . . those of you 
who want to understand my reference should
watch "Carry On At Your Convenience"; a light, easily digestible 
introduction to the politics of the early 70s.

> Bob
>
>
> On Oct 6, 2010, at 1:29 PM, Jeff Massung wrote:
>
>> On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin
>> <ambassador at fourthworld.com>wrote:
>>
>>> I have an app that needs to auto-detect Unicode and plain text, and render
>>> them correctly based on that auto-detection.
>>>
>>> I have the UTF16 stuff working, but with UTF8 I have a problem:  there is
>>> no BOM to let me know if it's Unicode, and some plain text files will
>>> occasionally have high-ASCII values in them (like the dagger symbol).
>>>
>>> What patterns should I be looking for in the binary data of a file to
>>> distinguish UTF8 from plain text?
>>>
>>>
>> Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8
>> is that it's indistinguishable from ASCII (0-127). You may be able to scan
>> the files, and if they are large enough, try and deduce some thing from them
>> to know which they are. For example:
>>
>> On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a
>> text file.
>>
>> In ASCII there will never be a NULL terminator anywhere (byte 0). There's
>> likely many 0-byte values in any appreciably large Unicode file. This would
>> also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few
>> others.
>>
>> If the number of bytes that have the high bit (0x80) set is extremely low
>> (<<<  1%) then most likely it's ASCII.
>>
>> HTH,
>>
>> Jeff M.
>> _______________________________________________
>> use-revolution mailing list
>> use-revolution at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-revolution
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution




More information about the use-livecode mailing list