Guessing the encoding of a test file...
    Mark Waddingham 
    mark at livecode.com
       
    Sun Mar 22 08:41:33 EDT 2020
    
    
  
On 2020-03-21 14:09, Paul Dupuis via use-livecode wrote:
> So far the only person who has read my post and replied with what I
> was looking for was Peter - and although the routine was written in
> Rebol rather than LiveCode, he kindly provided a link to information
> about it.
It might have got lost in amongst other replies but I did suggest:
<https://pypi.org/project/chardet/>
It even comes with a command-line script (chardetect) which would allow 
to compare your detector with that one.
However, on further digging it appears that this does not (as it stands) 
detect MacRoman which is obviously a key requirement here.
There is a stale PR for that though 
<https://github.com/chardet/chardet/pull/5> so the method used here is 
obviously possible to extend to that.
 From what I have read the Python one is a python reimplementation of 
Mozilla's 'Universal Charset Detector' which, from what I have read, 
is/was pretty much state of the art - reading through the chardet docs 
(https://chardet.readthedocs.io/en/latest/how-it-works.html#single-byte-encodings 
is perhaps the most pertinent) it sounds like its single-byte detectors 
use 2-byte sequences to try and distinguish.
There is a special case for Latin-1 (1252) which is needed because 
English text looks the same in a large number of encodings - this works 
by looking for curly quotes and other special symbols by the look of it. 
(The MacRoman addition in the stale PR above, is also a Latin-1 like 
special-case - which makes sense as Latin-1 and MacRoman are almost just 
permutations of each other).
My general feeling is that if you already have a process which works to 
detect the differences between MacRoman and Latin-1, then it is likely 
largely equivalent to any other means which exists (the accepted answer 
here 
<https://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and> 
sounds like it pretty much sums up the situation!) so beyond fixing the 
bug(s) you found recently, you might find that there is nothing more you 
can do.
Warmest Regards,
Mark.
-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
    
    
More information about the use-livecode
mailing list