Guessing the encoding of a test file...

Mark Waddingham mark at livecode.com
Fri Mar 20 13:22:03 EDT 2020


On 2020-03-20 15:34, Paul Dupuis via use-livecode wrote:
> Why did I ask this? Because I am interested in comparing the accuracy
> of our current handler to any other that may be available as, users
> being users, we recently have a user reveal a bug (mis named variable)
> in our current function that meant it was missing certain edge cases (
> and this user has hundreds of text files that need this edge case to
> be properly recognized as MAcRoman encoding. So that bug has been
> fixed, but I am still interested in comparing any other giessEncoding
> routines to our current one to see if we can do better that we current
> are.

Perhaps:

https://pypi.org/project/chardet/

Sounds like it uses similar statistical (perhaps even an ML) model to 
detect charsets as Mozilla's 'UCD' (as mentioned by someone else in this 
thread).

> As always, thank for reading and responding Mark. We're actually doing
> what you suggest. We had a set of QA test cases (text files in many
> different line endings and encodings), some intended to fail (such as
> Windows Code Page's we don't support). We're expanding these and doing
> a review on macOS and Windows with our app. Ones that fail, that we
> think shouldn't fail, we will step through the code to see why they
> fail and if our algorithm can be further enhanced. I can's foresee any
> algorithm tweaks we can't code ourselves that we'd need LC or USE-LIST
> assistance for.

My main reason for asking was to see if it seemed a reasonable 
assumption (to me, at least) that there would be any algorithm which 
would be able to determine the char encoding correctly. e.g. MacRoman 
and Windows-1252, are very very similar, and so telling the difference 
would come with a reasonably high degree of error.

> Back around LiveCode 7, Fraiser said, in response to some
> correspondence I had with him, that he would consider creating a
> "guessEncoding" to go along with the Unicode Everywhere work and the
> new textEncode/textDecode functions. I do understand the reluctance,
> as a business, to do so, as inevitably there will be some instances
> where it guesses wrong.

I can't recall exactly - but I think Fraser was thinking along the lines 
of being able to tell the difference between the utf-8, utf-16, utf-32 
and native encodings. That can be done with a high-degree of confidence, 
and indeed is straightforward enough to code in LiveCode Script. (e.g. 
You can be almost 100% sure something is utf-8 if it roundtrips 
identically).

As I'm sure you are acutely aware, the difficult problem is telling the 
difference between very dense shift-sequence encodings (those which 
don't have some redundancy in their encodings to help with validation), 
and single-char encodings (e.g. between MacRoman and Latin-1). There is 
no algorithm for that per-se, just lots of heuristics (based on 
statistical models) and potential dictionary lookup to help distinguish 
edge cases. Implementing something such as that is no small endeavour...

> I am under the, perhaps false, impression that isoToMac and macToIso
> are sort of viewed as functions that may become deprecated and no
> longer updated in the future. However, they are still essential for us
> until I can textDecode(someData,"MacRoman") on a Windows system and
> vice versa.

They've not been deprecated yet so they aren't going anywhere - the 
internal functions those wrap are actually used to charset-swap strings 
in pre-v7 binary stackfiles (from v7, strings are serialized as utf-8 in 
stackfiles).

We probably will deprecate them when we make textDecode/Encode accept 
more encodings (as suggested in the enhancement request) - but only 
because the latter is a much neater way to do things... I believe the 
code you use at the moment gives identical results as textDecode/Encode 
native support would do doesn't it?

Warmest Regards,

Mark

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list