Guessing the encoding of a test file...

Paul Dupuis paul at researchware.com
Fri Mar 20 11:34:42 EDT 2020


To Sean and Bob,

Thank you for your replies. I may not have been clear enough in my 
original post:

We make and sell an App for macOS and Windows. It's uses around the 
world by researchers (not a lot of them as it is a niche product) on 
their computers. The research applications allows input of data from 
text files. The sources of those text files are from various source 
those researcher have. It would negatively impact our competitiveness in 
our market if we forced the users to convert their data all to some 
specific text encoding, so we need to try to "guess" the encoding of 
those text files.

There are many published algorithms for doing this and we have a past 
contractor of ours take a "best practice" algorithm and create a LCS 
"guessEncoding function. This replaced a previous guessEncoding function 
we had that from Richard Gaskin, which while quite good, did not cover 
as many test cases and the newer more robust one.

My main question to the list was: Has anyone out there ALSO written a 
guessEncoding function they might like to share or license?

Why did I ask this? Because I am interested in comparing the accuracy of 
our current handler to any other that may be available as, users being 
users, we recently have a user reveal a bug (mis named variable) in our 
current function that meant it was missing certain edge cases ( and this 
user has hundreds of text files that need this edge case to be properly 
recognized as MAcRoman encoding. So that bug has been fixed, but I am 
still interested in comparing any other giessEncoding routines to our 
current one to see if we can do better that we current are.

To Mark,

As always, thank for reading and responding Mark. We're actually doing 
what you suggest. We had a set of QA test cases (text files in many 
different line endings and encodings), some intended to fail (such as 
Windows Code Page's we don't support). We're expanding these and doing a 
review on macOS and Windows with our app. Ones that fail, that we think 
shouldn't fail, we will step through the code to see why they fail and 
if our algorithm can be further enhanced. I can's foresee any algorithm 
tweaks we can't code ourselves that we'd need LC or USE-LIST assistance for.

Back around LiveCode 7, Fraiser said, in response to some correspondence 
I had with him, that he would consider creating a "guessEncoding" to go 
along with the Unicode Everywhere work and the new textEncode/textDecode 
functions. I do understand the reluctance, as a business, to do so, as 
inevitably there will be some instances where it guesses wrong. Other 
than LC adding a guessEncoding function using some open source library, 
I would say the area where LC could be the most help would be with this 
enhancement https://quality.livecode.com/show_bug.cgi?id=22391

I am under the, perhaps false, impression that isoToMac and macToIso are 
sort of viewed as functions that may become deprecated and no longer 
updated in the future. However, they are still essential for us until I 
can textDecode(someData,"MacRoman") on a Windows system and vice versa.






More information about the use-livecode mailing list