Guessing the encoding of a test file...
mark at livecode.com
Fri Mar 20 13:22:03 EDT 2020
On 2020-03-20 15:34, Paul Dupuis via use-livecode wrote:
> Why did I ask this? Because I am interested in comparing the accuracy
> of our current handler to any other that may be available as, users
> being users, we recently have a user reveal a bug (mis named variable)
> in our current function that meant it was missing certain edge cases (
> and this user has hundreds of text files that need this edge case to
> be properly recognized as MAcRoman encoding. So that bug has been
> fixed, but I am still interested in comparing any other giessEncoding
> routines to our current one to see if we can do better that we current
Sounds like it uses similar statistical (perhaps even an ML) model to
detect charsets as Mozilla's 'UCD' (as mentioned by someone else in this
> As always, thank for reading and responding Mark. We're actually doing
> what you suggest. We had a set of QA test cases (text files in many
> different line endings and encodings), some intended to fail (such as
> Windows Code Page's we don't support). We're expanding these and doing
> a review on macOS and Windows with our app. Ones that fail, that we
> think shouldn't fail, we will step through the code to see why they
> fail and if our algorithm can be further enhanced. I can's foresee any
> algorithm tweaks we can't code ourselves that we'd need LC or USE-LIST
> assistance for.
My main reason for asking was to see if it seemed a reasonable
assumption (to me, at least) that there would be any algorithm which
would be able to determine the char encoding correctly. e.g. MacRoman
and Windows-1252, are very very similar, and so telling the difference
would come with a reasonably high degree of error.
> Back around LiveCode 7, Fraiser said, in response to some
> correspondence I had with him, that he would consider creating a
> "guessEncoding" to go along with the Unicode Everywhere work and the
> new textEncode/textDecode functions. I do understand the reluctance,
> as a business, to do so, as inevitably there will be some instances
> where it guesses wrong.
I can't recall exactly - but I think Fraser was thinking along the lines
of being able to tell the difference between the utf-8, utf-16, utf-32
and native encodings. That can be done with a high-degree of confidence,
and indeed is straightforward enough to code in LiveCode Script. (e.g.
You can be almost 100% sure something is utf-8 if it roundtrips
As I'm sure you are acutely aware, the difficult problem is telling the
difference between very dense shift-sequence encodings (those which
don't have some redundancy in their encodings to help with validation),
and single-char encodings (e.g. between MacRoman and Latin-1). There is
no algorithm for that per-se, just lots of heuristics (based on
statistical models) and potential dictionary lookup to help distinguish
edge cases. Implementing something such as that is no small endeavour...
> I am under the, perhaps false, impression that isoToMac and macToIso
> are sort of viewed as functions that may become deprecated and no
> longer updated in the future. However, they are still essential for us
> until I can textDecode(someData,"MacRoman") on a Windows system and
> vice versa.
They've not been deprecated yet so they aren't going anywhere - the
internal functions those wrap are actually used to charset-swap strings
in pre-v7 binary stackfiles (from v7, strings are serialized as utf-8 in
We probably will deprecate them when we make textDecode/Encode accept
more encodings (as suggested in the enhancement request) - but only
because the latter is a much neater way to do things... I believe the
code you use at the moment gives identical results as textDecode/Encode
native support would do doesn't it?
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
More information about the use-livecode