Distinguishing between ASCII and UTF8
Dave Cragg
dave.cragg at lacscentre.co.uk
Wed Oct 6 18:31:29 EDT 2010
Richard
Below is a function that was translated from a PHP script. It is intended to determine whether the passed in string "could be" utf8. I have tested it in a limited way and it seems to work. But maybe someone else can see the flaws.
If it returns false, then it is not UTF8. If it returns true, it fits the pattern of utf8, but it could be something else like some random binary.
If it doesn't work, you could perhaps use it to scare children.
function couldBeUtf8 pString
put "(?is)^([\x09\x0A\x0D\x20-\x7E]" into tRE
put "|[\xC2-\xDF][\x80-\xBF]" after tRE
put "|\xE0[\xA0-\xBF][\x80-\xBF]" after tRE
put "|[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}" after tRE
put "|\xED[\x80-\x9F][\x80-\xBF]" after tRE
put "|\xF0[\x90-\xBF][\x80-\xBF]{2}" after tRE
put "|[\xF1-\xF3][\x80-\xBF]{3}" after tRE
put "|\xF4[\x80-\x8F][\x80-\xBF]{2})*$" after tRE
return matchText(pString, tRE)
end couldBeUtf8
Cheers
Dave
On 6 Oct 2010, at 21:23, Richard Gaskin wrote:
> I have an app that needs to auto-detect Unicode and plain text, and render them correctly based on that auto-detection.
>
> I have the UTF16 stuff working, but with UTF8 I have a problem: there is no BOM to let me know if it's Unicode, and some plain text files will occasionally have high-ASCII values in them (like the dagger symbol).
>
> What patterns should I be looking for in the binary data of a file to distinguish UTF8 from plain text?
>
> --
> Richard Gaskin
> Fourth World
> LiveCode training and consulting: http://www.fourthworld.com
> Webzine for LiveCode developers: http://www.LiveCodeJournal.com
> LiveCode Journal blog: http://LiveCodejournal.com/blog.irv
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
More information about the use-livecode
mailing list