Guessing the encoding of a test file...

Quentin Long cubist at aol.com
Sat Mar 21 04:20:17 EDT 2020


I strongly suspect that the desired goal, to have a nice, robust algorithm which automagically identifies the encoding of *ABSOLUTELY ANY* text document with zero need for human involvement, simply isn't possible. Because text encoding is intrinsically arbitrary—see also: the many variations on extended (8-bit) ASCII, the various mutually-incompatible versions of EBCDIC, etc ad nauseam.
Seems to me, therefore, that in the general case, human involvement is an *unavoidable necessity* in determining which encoding an arbitrary text document uses. So the goal of any encoding-ID algorithm should *not* be the impossible task of determining that encoding *without* human involvement. Rather, the goal should be to *minimize* that human involvement, make that human involvement as *simple and painless* as practically feasible. So, here goes with some semi-random rambling…  Pretty sure the best, most nearly bulletproof way to ID a document's text-encoding involves applying that encoding to the bits of the document, and showing the resulting character-sequence to a human. If there's more than one possibility for the document's encoding, apply all of the possible encodings, and show a human all of the resulting character-sequences. I'm thinking that a good way to do this might be to put up N different text fields in a window, with all of the text fields controlled by one scrollbar, and the human clicks on all of the fields whose content looks good to them. Or maybe the human clicks on all the fields whose output looks *bad* to them? Whichever way works; as long as there *is* some human judgement in there somewhere.
Can we assume that once a particular document's text-encoding has been identified, that *all* documents which came from the same source as that document use that particular encoding? If so, that might simplify the continuing workflow; tell the software "This document came from Source X", and the software then uses whichever text-encoding it associates with that source. Even if there's more than one such text-encoding in play, that's at least easier to work with than having to sort thru an arbitrarily large number of text-encodings.
Is it possible to tell the software "hey, no character in $ThisSetOfChars will ever appear in this document"? If so, the software should be able to rule out any encoding which ends up putting one of the Forbidden Chars into the decoded character-sequence.

Given human error, it may be that the human's input ends up ruling out *any possible* text-encoding. Prolly a good idea to use something akin to fuzzy logic rather than strict Boolean operations.




More information about the use-livecode mailing list