How to determine if a text file is UTF8 ?
Ben Rubinstein
benr_mc at cogapp.com
Wed Oct 30 06:17:01 EDT 2024
Thanks for the handy tip re validating UTF8, which I didn't know.
Can I use this opportunity to make the plea once again to support the basic
encodings on any platform, rather than relying on the hated "native", i.e.
https://quality.livecode.com/show_bug.cgi?id=12205
and bearing in mind the comments of one M*rk W*dd*ingh*m in 2014:
> In any case, I can't argue with the suggestion that the language parameter should enable at least the most common charsets to be leveraged and converted to/from Unicode.
...
> At least for release we are aiming to have the list above working on all platforms. (UTF-8, UTF-16 (BE and LE), UTF-32 (BE and LE), MacRoman, ISO8859-1, Windows-1252, ASCII)
(comments on https://quality.livecode.com/show_bug.cgi?id=3674)
Thanks for listening!
Ben
On 29/10/2024 16:23, Mark Waddingham via use-livecode wrote:
> On 2024-10-29 08:53, jbv via use-livecode wrote:
>> Hi list,
>>
>> How to determine if a text file is UTF8 or just plain ASCII ?
>> In other words, how to know if one should use
>> open file myfile.txt for UTF8 read
>> or
>> open file myfile.txt for read
>
> If it is really plain ASCII then it doesn't matter - UTF8 is a strict superset
> of ASCII.
>
> All ASCII chars are identical - they are codes 0-127 so only use 7-bits (as
> ASCII does).
>
> Any UTF-8 chars will start with a byte which has the high bit set so will be
> in the range 128-255 - UTF-8 encoded chars are always at least two bytes, and
> all those bytes have the top bit set.
>
> If by 'ASCII' you mean 'some native encoding' like MacRoman or Latin-1
> (Windows 1252), then things are a bit more tricky. Unless the text file has a
> byte-order-mark (BOM) at the front (which these days are becoming much less
> common) you can only really tell by guessing.
>
> The simplest guess is to see if it 'roundtrips' as utf-8 and if it does then
> it is almost certainly utf-8; if it does not, then it is either unicode
> encoding (e.g. UTF-16 - which is often found on Windows), or some other
> encoding (typically on mac this will be MacRoman, and on Windows this will be
> Latin-1 - but that's typically, there are 100's of region specific encodings
> so generally it depends on where the file came from / the locale of the
> computer it was created on - obviously with unicode this is not really an
> issue for new stuff, its more legacy stuff).
>
> So if you are faced with text files may either be the 'platform native'
> encoding (as LiveCode sees it) or utf-8 without a BOM:
>
> local tBinText, tText
> put url ("binfile:myfile.txt") into tBinText
> put textDecode(tBinText, "utf-8") into tText
> if textEncode(tText, "utf-8") is not tBinText then
> -- If tText does not encode back to utf-8 identically, then it means
> there are invalid utf-8
> -- byte sequences in it which means that it is either a corrupted utf-8
> file (unlikely) or
> -- not utf-8
> put textDecode(tBinText, "native") into tText
> else
> -- If the first char is the unicode 'zero width no-break space' then that
> was a BOM which we
> -- don't want (the logic here is that that char makes no sense at the
> start of a file so is
> -- reserved in that specific case to be used as a marker for unicode
> encoding)
> if codeunit 1 of tText is numToCodepoint(0xFEFF) then
> delete codeunit 1 of tText
> end if
> end if
>
> -- Perform the general EOL conversion the engine would do reading text
> replace crlf with return in tText
> replace numToCodepoint(13) with return in tText
>
> I'd estimate this probably 99% reliable - in order for a native encoded file
> to *also* be valid UTF-8 is quite unlikely as for that to be the case you'd
> need some very strange sequences of non-ascii characters (which tend to always
> be surrounded by ASCII - e.g. accented chars, math symbols, indices, quote
> variants).
>
> Warmest Regards,
>
> Mark.
>
More information about the use-livecode
mailing list