Guessing the encoding of a test file...

Richard Gaskin ambassador at
Fri Mar 20 13:44:47 EDT 2020

Paul Dupuis wrote:

 > There are many published algorithms for doing this and we have a past
 > contractor of ours take a "best practice" algorithm and create a LCS
 > "guessEncoding function. This replaced a previous guessEncoding
 > function we had that from Richard Gaskin, which while quite good, did
 > not cover as many test cases and the newer more robust one.

The algo I wrote for you a decade ago was an amalgam of best efforts 
culled throughout this community at the time. It even included a 
variant, refined in our testing, of statistical analysis of certain 
patterns identified by Peter Haworth for files without explicit declaration.

At the time, running the algo through the test collection of some ~200 
widely varying sample documents, some of which even mixed different 
encodings, we compared our results with those from Apple's TextEdit and 
found that our algo correctly identified encoding at least 15% more 
often than TextEdit.

Once we bested Apple on that by an appreciable margin, all of us on the 
team reviewed the results and determined that we were clearly looking at 
a case of diminishing returns in terms of cost-to-further-refine vs 
actual percentage of documents in use requiring such refinement.

I would be interested to learn more about the details of the subsequent 
refinements over the decade since, but also the ROI proposition for today:

Given that another ten years has passed with modern encoding, and that 
older encodings like CP1252 (premiered in Windows 1.0 and popularized in 
Windows 95) are rarely seen in modern usage (as of March 2020 Wikipedia 
notes only 0.4% of web pages using that encoding), what percentage of 
documents your customers need to work with will benefit from further 
investment in refining that algo?

  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  Ambassador at      

More information about the use-livecode mailing list