Distinguishing between ASCII and UTF8
richmondmathewson at gmail.com
Thu Oct 7 13:14:00 EDT 2010
On 10/7/10 7:59 PM, Bob Sneidar wrote:
> Okay, so that begs the question, if there is no difference between UTF8 and ASCII, why make the distinction? I mean, what would be the point to converting from ASCII to UTF8 or vis versa if the results were always the same?
> Just being practical.
Some of us grew up in Britain in the 60s and 70s (Oh, how depressing)
and remember the feeling of moving from
short trousers to long trousers; as far as I understand ASCII and UTF8
are somehow the same without the place
being trashed by the . . . . . (whoops, no politics) . . . those of you
who want to understand my reference should
watch "Carry On At Your Convenience"; a light, easily digestible
introduction to the politics of the early 70s.
> On Oct 6, 2010, at 1:29 PM, Jeff Massung wrote:
>> On Wed, Oct 6, 2010 at 3:23 PM, Richard Gaskin
>> <ambassador at fourthworld.com>wrote:
>>> I have an app that needs to auto-detect Unicode and plain text, and render
>>> them correctly based on that auto-detection.
>>> I have the UTF16 stuff working, but with UTF8 I have a problem: there is
>>> no BOM to let me know if it's Unicode, and some plain text files will
>>> occasionally have high-ASCII values in them (like the dagger symbol).
>>> What patterns should I be looking for in the binary data of a file to
>>> distinguish UTF8 from plain text?
>> Sorry, Richard, but I believe you are out of luck here. The idea behind UTF8
>> is that it's indistinguishable from ASCII (0-127). You may be able to scan
>> the files, and if they are large enough, try and deduce some thing from them
>> to know which they are. For example:
>> On Windows, "\r\n" (13, 10) should terminate lines. Could very well be a
>> text file.
>> In ASCII there will never be a NULL terminator anywhere (byte 0). There's
>> likely many 0-byte values in any appreciably large Unicode file. This would
>> also be true of byte 8 (backspace) and byte 7 (the bell) and probably a few
>> If the number of bytes that have the high bit (0x80) set is extremely low
>> (<<< 1%) then most likely it's ASCII.
>> Jeff M.
>> use-revolution mailing list
>> use-revolution at lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
More information about the Use-livecode