Guessing the encoding of a test file...
ambassador at fourthworld.com
Fri Mar 20 13:44:47 EDT 2020
Paul Dupuis wrote:
> There are many published algorithms for doing this and we have a past
> contractor of ours take a "best practice" algorithm and create a LCS
> "guessEncoding function. This replaced a previous guessEncoding
> function we had that from Richard Gaskin, which while quite good, did
> not cover as many test cases and the newer more robust one.
The algo I wrote for you a decade ago was an amalgam of best efforts
culled throughout this community at the time. It even included a
variant, refined in our testing, of statistical analysis of certain
patterns identified by Peter Haworth for files without explicit declaration.
At the time, running the algo through the test collection of some ~200
widely varying sample documents, some of which even mixed different
encodings, we compared our results with those from Apple's TextEdit and
found that our algo correctly identified encoding at least 15% more
often than TextEdit.
Once we bested Apple on that by an appreciable margin, all of us on the
team reviewed the results and determined that we were clearly looking at
a case of diminishing returns in terms of cost-to-further-refine vs
actual percentage of documents in use requiring such refinement.
I would be interested to learn more about the details of the subsequent
refinements over the decade since, but also the ROI proposition for today:
Given that another ten years has passed with modern encoding, and that
older encodings like CP1252 (premiered in Windows 1.0 and popularized in
Windows 95) are rarely seen in modern usage (as of March 2020 Wikipedia
notes only 0.4% of web pages using that encoding), what percentage of
documents your customers need to work with will benefit from further
investment in refining that algo?
Fourth World Systems
Software Design and Development for the Desktop, Mobile, and the Web
Ambassador at FourthWorld.com http://www.FourthWorld.com
More information about the use-livecode