How to determine if a text file is UTF8 ?

Wed Oct 30 06:17:01 EDT 2024

Thanks for the handy tip re validating UTF8, which I didn't know.

Can I use this opportunity to make the plea once again to support the basic 
encodings on any platform, rather than relying on the hated "native", i.e.
https://quality.livecode.com/show_bug.cgi?id=12205

and bearing in mind the comments of one M*rk W*dd*ingh*m in 2014:
> In any case, I can't argue with the suggestion that the language parameter should enable at least the most common charsets to be leveraged and converted to/from Unicode.
...
> At least for release we are aiming to have the list above working on all platforms. (UTF-8, UTF-16 (BE and LE), UTF-32 (BE and LE), MacRoman, ISO8859-1, Windows-1252, ASCII)
(comments on https://quality.livecode.com/show_bug.cgi?id=3674)

Thanks for listening!

Ben

On 29/10/2024 16:23, Mark Waddingham via use-livecode wrote:
> On 2024-10-29 08:53, jbv via use-livecode wrote:
>> Hi list,
>>
>> How to determine if a text file is UTF8 or just plain ASCII ?
>> In other words, how to know if one should use
>>   open file myfile.txt for UTF8 read
>> or
>>   open file myfile.txt for read
> 
> If it is really plain ASCII then it doesn't matter - UTF8 is a strict superset 
> of ASCII.
> 
> All ASCII chars are identical - they are codes 0-127 so only use 7-bits (as 
> ASCII does).
> 
> Any UTF-8 chars will start with a byte which has the high bit set so will be 
> in the range 128-255 - UTF-8 encoded chars are always at least two bytes, and 
> all those bytes have the top bit set.
> 
> If by 'ASCII' you mean 'some native encoding' like MacRoman or Latin-1 
> (Windows 1252), then things are a bit more tricky. Unless the text file has a 
> byte-order-mark (BOM) at the front (which these days are becoming much less 
> common) you can only really tell by guessing.
> 
> The simplest guess is to see if it 'roundtrips' as utf-8 and if it does then 
> it is almost certainly utf-8; if it does not, then it is either unicode 
> encoding (e.g. UTF-16 - which is often found on Windows), or some other 
> encoding (typically on mac this will be MacRoman, and on Windows this will be 
> Latin-1 - but that's typically, there are 100's of region specific encodings 
> so generally it depends on where the file came from / the locale of the 
> computer it was created on - obviously with unicode this is not really an 
> issue for new stuff, its more legacy stuff).
> 
> So if you are faced with text files may either be the 'platform native' 
> encoding (as LiveCode sees it) or utf-8 without a BOM:
> 
>    local tBinText, tText
>    put url ("binfile:myfile.txt") into tBinText
>    put textDecode(tBinText, "utf-8") into tText
>    if textEncode(tText, "utf-8") is not tBinText then
>       -- If tText does not encode back to utf-8 identically, then it means 
> there are invalid utf-8
>       -- byte sequences in it which means that it is either a corrupted utf-8 
> file (unlikely) or
>       -- not utf-8
>       put textDecode(tBinText, "native") into tText
>    else
>      -- If the first char is the unicode 'zero width no-break space' then that 
> was a BOM which we
>      -- don't want (the logic here is that that char makes no sense at the 
> start of a file so is
>      -- reserved in that specific case to be used as a marker for unicode 
> encoding)
>      if codeunit 1 of tText is numToCodepoint(0xFEFF) then
>        delete codeunit 1 of tText
>      end if
>    end if
> 
>    -- Perform the general EOL conversion the engine would do reading text
>    replace crlf with return in tText
>    replace numToCodepoint(13) with return in tText
> 
> I'd estimate this probably 99% reliable - in order for a native encoded file 
> to *also* be valid UTF-8 is quite unlikely as for that to be the case you'd 
> need some very strange sequences of non-ascii characters (which tend to always 
> be surrounded by ASCII - e.g. accented chars, math symbols, indices, quote 
> variants).
> 
> Warmest Regards,
> 
> Mark.
>