How to determine if a text file is UTF8 ?
Bob Sneidar
bobsneidar at iotecdigital.com
Tue Oct 29 11:17:12 EDT 2024
There is a Wikipedia article on this. Turns out it is not straightforward. There can be a Byte Order Mark that the file begins with but not all vendors use it. And I do not think you can make the determination simply by examining the contents of the file.
Byte-order mark[edit]
If the Unicode byte-order mark U+FEFF is at the start of a UTF-8 file, the first three bytes will be 0xEF, 0xBB, 0xBF.
The Unicode Standard neither requires nor recommends the use of the BOM for UTF-8, but warns that it may be encountered at the start of a file trans-coded from another encoding.[23] While ASCII text encoded using UTF-8 is backward compatible with ASCII, this is not true when Unicode Standard recommendations are ignored and a BOM is added. A BOM can confuse software that isn't prepared for it but can otherwise accept UTF-8, e.g. programming languages that permit non-ASCII bytes in string literals but not at the start of the file. Nevertheless, there was and still is software that always inserts a BOM when writing UTF-8, and refuses to correctly interpret UTF-8 unless the first character is a BOM (or the file only contains ASCII).[24]
https://en.wikipedia.org/wiki/UTF-8#
Bob S
> On Oct 29, 2024, at 1:53 AM, jbv via use-livecode <use-livecode at lists.runrev.com> wrote:
>
> Hi list,
>
> How to determine if a text file is UTF8 or just plain ASCII ?
> In other words, how to know if one should use
> open file myfile.txt for UTF8 read
> or
> open file myfile.txt for read
>
> Thank you.
> jbv
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list