Guessing the encoding of a test file...

Paul Dupuis paul at researchware.com
Sun Mar 22 11:15:42 EDT 2020


On 3/22/2020 8:41 AM, Mark Waddingham via use-livecode wrote:
> On 2020-03-21 14:09, Paul Dupuis via use-livecode wrote:
>> So far the only person who has read my post and replied with what I
>> was looking for was Peter - and although the routine was written in
>> Rebol rather than LiveCode, he kindly provided a link to information
>> about it.
>
> It might have got lost in amongst other replies but I did suggest:
>
> <https://pypi.org/project/chardet/>

Thank Mark.

I apologize. I did miss the reference to Chardet. At one point we looked 
at wrapping a C++ interface to the Mozilla code, but we don't have 
anyone here who has had the time to learn LCB and FFI, I know! We should 
make the time!


>
> It even comes with a command-line script (chardetect) which would 
> allow to compare your detector with that one.
>
> However, on further digging it appears that this does not (as it 
> stands) detect MacRoman which is obviously a key requirement here.

MacRoman detection is a essential requirement. One of our selling 
points, since many Universities these days have people on mixed 
platforms, is that our tool is nearly identical across macOS and Windows 
to facility researcher collaborations, so we do have people sending 
files created on their Macs to Windows team members and vice versa, so 
we have to detect MacRoman and CP1252 on both platforms.

>
> There is a stale PR for that though 
> <https://github.com/chardet/chardet/pull/5> so the method used here is 
> obviously possible to extend to that.
>
> From what I have read the Python one is a python reimplementation of 
> Mozilla's 'Universal Charset Detector' which, from what I have read, 
> is/was pretty much state of the art - reading through the chardet docs 
> (https://chardet.readthedocs.io/en/latest/how-it-works.html#single-byte-encodings 
> is perhaps the most pertinent) it sounds like its single-byte 
> detectors use 2-byte sequences to try and distinguish.
>
> There is a special case for Latin-1 (1252) which is needed because 
> English text looks the same in a large number of encodings - this 
> works by looking for curly quotes and other special symbols by the 
> look of it. (The MacRoman addition in the stale PR above, is also a 
> Latin-1 like special-case - which makes sense as Latin-1 and MacRoman 
> are almost just permutations of each other).
>
> My general feeling is that if you already have a process which works 
> to detect the differences between MacRoman and Latin-1, then it is 
> likely largely equivalent to any other means which exists (the 
> accepted answer here 
> <https://stackoverflow.com/questions/4198804/how-to-reliably-guess-the-encoding-between-macroman-cp1252-latin1-utf-8-and> 
> sounds like it pretty much sums up the situation!) so beyond fixing 
> the bug(s) you found recently, you might find that there is nothing 
> more you can do.

And we arrive at the same place! Our review of our code, which failed to 
handle a particular MacRoman detection, and comparing to other encoding 
guessing algorithms, turned up a couple issues - all fixable in our 
code, and only one was a encoding guessing issue.

In our guessEncoding routine, there was a misspelled variable that was 
preventing the detection of MacRoman from line ending comparisons from 
working properly. I'm not sure how this got past our QA, but - as you 
know - sometimes things do and it did. With that fixed, we getting 
accurate detection of C1252, MacRoman, ASCII, UTF8, UTF16 BE/LE, and 
UTF32 BE/LE on our suite of about 30 different test files.

We also ran into an edge case of Mac cr (ASCII 13) line ending in UTF8 
or UTF16 file needed an adjustment to convert the line ending to linefeeds.

So at this point our code is detecting the encoding for and reading text 
files into LC with a pretty high rate of accuracy.

For anyone else needed such code, I will try to pulling into a single 
library and somehow make it available. All I will ask is that if anyone 
does us it and improved upon it to share the improvement back.





More information about the use-livecode mailing list