Guessing the encoding of a test file...
paul at researchware.com
Fri Mar 20 11:34:42 EDT 2020
To Sean and Bob,
Thank you for your replies. I may not have been clear enough in my
We make and sell an App for macOS and Windows. It's uses around the
world by researchers (not a lot of them as it is a niche product) on
their computers. The research applications allows input of data from
text files. The sources of those text files are from various source
those researcher have. It would negatively impact our competitiveness in
our market if we forced the users to convert their data all to some
specific text encoding, so we need to try to "guess" the encoding of
those text files.
There are many published algorithms for doing this and we have a past
contractor of ours take a "best practice" algorithm and create a LCS
"guessEncoding function. This replaced a previous guessEncoding function
we had that from Richard Gaskin, which while quite good, did not cover
as many test cases and the newer more robust one.
My main question to the list was: Has anyone out there ALSO written a
guessEncoding function they might like to share or license?
Why did I ask this? Because I am interested in comparing the accuracy of
our current handler to any other that may be available as, users being
users, we recently have a user reveal a bug (mis named variable) in our
current function that meant it was missing certain edge cases ( and this
user has hundreds of text files that need this edge case to be properly
recognized as MAcRoman encoding. So that bug has been fixed, but I am
still interested in comparing any other giessEncoding routines to our
current one to see if we can do better that we current are.
As always, thank for reading and responding Mark. We're actually doing
what you suggest. We had a set of QA test cases (text files in many
different line endings and encodings), some intended to fail (such as
Windows Code Page's we don't support). We're expanding these and doing a
review on macOS and Windows with our app. Ones that fail, that we think
shouldn't fail, we will step through the code to see why they fail and
if our algorithm can be further enhanced. I can's foresee any algorithm
tweaks we can't code ourselves that we'd need LC or USE-LIST assistance for.
Back around LiveCode 7, Fraiser said, in response to some correspondence
I had with him, that he would consider creating a "guessEncoding" to go
along with the Unicode Everywhere work and the new textEncode/textDecode
functions. I do understand the reluctance, as a business, to do so, as
inevitably there will be some instances where it guesses wrong. Other
than LC adding a guessEncoding function using some open source library,
I would say the area where LC could be the most help would be with this
I am under the, perhaps false, impression that isoToMac and macToIso are
sort of viewed as functions that may become deprecated and no longer
updated in the future. However, they are still essential for us until I
can textDecode(someData,"MacRoman") on a Windows system and vice versa.
More information about the use-livecode