How to determine if a text file is UTF8 ?

Tue Oct 29 12:23:58 EDT 2024

On 2024-10-29 08:53, jbv via use-livecode wrote:
> Hi list,
> 
> How to determine if a text file is UTF8 or just plain ASCII ?
> In other words, how to know if one should use
>   open file myfile.txt for UTF8 read
> or
>   open file myfile.txt for read

If it is really plain ASCII then it doesn't matter - UTF8 is a strict 
superset of ASCII.

All ASCII chars are identical - they are codes 0-127 so only use 7-bits 
(as ASCII does).

Any UTF-8 chars will start with a byte which has the high bit set so 
will be in the range 128-255 - UTF-8 encoded chars are always at least 
two bytes, and all those bytes have the top bit set.

If by 'ASCII' you mean 'some native encoding' like MacRoman or Latin-1 
(Windows 1252), then things are a bit more tricky. Unless the text file 
has a byte-order-mark (BOM) at the front (which these days are becoming 
much less common) you can only really tell by guessing.

The simplest guess is to see if it 'roundtrips' as utf-8 and if it does 
then it is almost certainly utf-8; if it does not, then it is either 
unicode encoding (e.g. UTF-16 - which is often found on Windows), or 
some other encoding (typically on mac this will be MacRoman, and on 
Windows this will be Latin-1 - but that's typically, there are 100's of 
region specific encodings so generally it depends on where the file came 
from / the locale of the computer it was created on - obviously with 
unicode this is not really an issue for new stuff, its more legacy 
stuff).

So if you are faced with text files may either be the 'platform native' 
encoding (as LiveCode sees it) or utf-8 without a BOM:

   local tBinText, tText
   put url ("binfile:myfile.txt") into tBinText
   put textDecode(tBinText, "utf-8") into tText
   if textEncode(tText, "utf-8") is not tBinText then
      -- If tText does not encode back to utf-8 identically, then it 
means there are invalid utf-8
      -- byte sequences in it which means that it is either a corrupted 
utf-8 file (unlikely) or
      -- not utf-8
      put textDecode(tBinText, "native") into tText
   else
     -- If the first char is the unicode 'zero width no-break space' then 
that was a BOM which we
     -- don't want (the logic here is that that char makes no sense at 
the start of a file so is
     -- reserved in that specific case to be used as a marker for unicode 
encoding)
     if codeunit 1 of tText is numToCodepoint(0xFEFF) then
       delete codeunit 1 of tText
     end if
   end if

   -- Perform the general EOL conversion the engine would do reading text
   replace crlf with return in tText
   replace numToCodepoint(13) with return in tText

I'd estimate this probably 99% reliable - in order for a native encoded 
file to *also* be valid UTF-8 is quite unlikely as for that to be the 
case you'd need some very strange sequences of non-ascii characters 
(which tend to always be surrounded by ASCII - e.g. accented chars, math 
symbols, indices, quote variants).

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Build Amazing Things