Distinguishing between ASCII and UTF8

Dave Cragg dave.cragg at lacscentre.co.uk
Wed Oct 6 18:31:29 EDT 2010


Below is a function that was translated from a PHP script. It is intended to determine whether the passed in string "could be" utf8. I have tested it in a limited way  and it seems to work. But maybe someone else can see the flaws.

If it returns false, then it is not UTF8. If it returns true, it fits the pattern of utf8, but it could be something else like some random binary.

If it doesn't work, you could perhaps use it to scare children.

function couldBeUtf8 pString
   put "(?is)^([\x09\x0A\x0D\x20-\x7E]" into tRE
   put "|[\xC2-\xDF][\x80-\xBF]" after tRE
   put "|\xE0[\xA0-\xBF][\x80-\xBF]" after tRE
   put "|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" after tRE
   put "|\xED[\x80-\x9F][\x80-\xBF]" after tRE
   put "|\xF0[\x90-\xBF][\x80-\xBF]{2}" after tRE
   put "|[\xF1-\xF3][\x80-\xBF]{3}" after tRE 
   put "|\xF4[\x80-\x8F][\x80-\xBF]{2})*$" after tRE
   return matchText(pString, tRE)

end couldBeUtf8


On 6 Oct 2010, at 21:23, Richard Gaskin wrote:

> I have an app that needs to auto-detect Unicode and plain text, and render them correctly based on that auto-detection.
> I have the UTF16 stuff working, but with UTF8 I have a problem:  there is no BOM to let me know if it's Unicode, and some plain text files will occasionally have high-ASCII values in them (like the dagger symbol).
> What patterns should I be looking for in the binary data of a file to distinguish UTF8 from plain text?
> --
> Richard Gaskin
> Fourth World
> LiveCode training and consulting: http://www.fourthworld.com
> Webzine for LiveCode developers: http://www.LiveCodeJournal.com
> LiveCode Journal blog: http://LiveCodejournal.com/blog.irv
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution

More information about the Use-livecode mailing list