Guessing the encoding of a test file...
peterwawood at gmail.com
Fri Mar 20 20:49:23 EDT 2020
PaulI wrote a simple function to guess the encoding of a file but in Rebol not LiveCode. I'm not sure how it compares with your current function in terms of accuracy. It is being used by a company which does a lot of text processing. (Though I don't know if that is a good reccomendation or not). The method I used is explained in the brief documentation - http://www.rebol.org/documentation.r?script=str-enc-utils.r]. The rules could be used to create a LiveCode function.PeterPS Sorry for top posting, I'm replying from a mobile app.
-------- Original message --------From: Paul Dupuis via use-livecode <use-livecode at lists.runrev.com> Date: 20/03/2020 23:35 (GMT+08:00) To: use-livecode at lists.runrev.com Cc: Paul Dupuis <paul at researchware.com> Subject: Re: Guessing the encoding of a test file... To Sean and Bob,Thank you for your replies. I may not have been clear enough in my original post:We make and sell an App for macOS and Windows. It's uses around the world by researchers (not a lot of them as it is a niche product) on their computers. The research applications allows input of data from text files. The sources of those text files are from various source those researcher have. It would negatively impact our competitiveness in our market if we forced the users to convert their data all to some specific text encoding, so we need to try to "guess" the encoding of those text files.There are many published algorithms for doing this and we have a past contractor of ours take a "best practice" algorithm and create a LCS "guessEncoding function. This replaced a previous guessEncoding function we had that from Richard Gaskin, which while quite good, did not cover as many test cases and the newer more robust one.My main question to the list was: Has anyone out there ALSO written a guessEncoding function they might like to share or license?Why did I ask this? Because I am interested in comparing the accuracy of our current handler to any other that may be available as, users being users, we recently have a user reveal a bug (mis named variable) in our current function that meant it was missing certain edge cases ( and this user has hundreds of text files that need this edge case to be properly recognized as MAcRoman encoding. So that bug has been fixed, but I am still interested in comparing any other giessEncoding routines to our current one to see if we can do better that we current are.To Mark,As always, thank for reading and responding Mark. We're actually doing what you suggest. We had a set of QA test cases (text files in many different line endings and encodings), some intended to fail (such as Windows Code Page's we don't support). We're expanding these and doing a review on macOS and Windows with our app. Ones that fail, that we think shouldn't fail, we will step through the code to see why they fail and if our algorithm can be further enhanced. I can's foresee any algorithm tweaks we can't code ourselves that we'd need LC or USE-LIST assistance for.Back around LiveCode 7, Fraiser said, in response to some correspondence I had with him, that he would consider creating a "guessEncoding" to go along with the Unicode Everywhere work and the new textEncode/textDecode functions. I do understand the reluctance, as a business, to do so, as inevitably there will be some instances where it guesses wrong. Other than LC adding a guessEncoding function using some open source library, I would say the area where LC could be the most help would be with this enhancement https://quality.livecode.com/show_bug.cgi?id=22391I am under the, perhaps false, impression that isoToMac and macToIso are sort of viewed as functions that may become deprecated and no longer updated in the future. However, they are still essential for us until I can textDecode(someData,"MacRoman") on a Windows system and vice versa._______________________________________________use-livecode mailing listuse-livecode at lists.runrev.comPlease visit this url to subscribe, unsubscribe and manage your subscription preferences:http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode