Guess encoding for text file...
Dar Scott Consulting
dsc at swcp.com
Thu Sep 19 12:25:32 EDT 2019
UTF-16 and UTF-32 are not needed in your list. Those are BE unless indicated otherwise by a leading BOM. That is, the BE and LE versions are sufficient.
ASCII encoding is a subset of CP1252, MacRoman and UTF-8, so that can be classified as UTF-8 if there is no advantage to knowing that it is ASCII. (Printable ASCII is a subset of ISO-8859-1).
A couple thoughts in creating a custom function. Your special codes in ASCII files of 1, 2, 3 and 4 can be considered in a custom function. You might have a good idea in just 128 bytes or maybe a few iterations of 32 bytes. You can consider an a priori ordering of likelihood, related to the question of which tests provide the most information in the least time. And if you can't tell the difference, then maybe it doesn't matter.
I considered some methods of adjusting probabilities but the overhead means the test chunks should not be trivial. Also, the probability might be simplified to "maybe" and "nope". (However, if there might be errors in the text or discernment needs to rely on text probabilities, the numbers might be best.) Tests move probabilities from maybe to nope.
One method might do a batch of unsigned 32-bit int decodes and do logic operations on each of those. That can only do partial elimination tests on UTF-8, but detailed tests can be done afterward. I am not sure about performance, it might be that byteToNum() would be much faster.
I'm guessing that one can get some good probabilities from the first four bytes.
So, I agree with Curry. He might not use anything I mentioned, but he can optimize your code for longer files, if you need full checking.
> On Sep 17, 2019, at 2:05 PM, Paul Dupuis via use-livecode <use-livecode at lists.runrev.com> wrote:
> I started this post of the DEV-LIST. Mark Waddingham kindly responded and smartly suggested I should move it to the USE-LIST, so that is what I am doing. I have also pasted Lark's reply below my original post.
> ---------------------- ORIGINAL POST ----------------------------------------
> I have a LiveCode Script (LCS) routine that attempts to follow industry common algorithms for guessing the encoding of a text file.
> It's performance can be slower than I would like.
> This has led me to wonder in a LiveCode Builder (LCB) library may be the route to go. Does anyone know the OSX and/or Windows APIs for guessing a text file's encoding?
> I have done a number of google searches, but I am not a C programmer (not in many decades) and wading through the huge doc sets at MSDN or Apple is daunting.
> I found reference to a windows API:
> BOOL IsTextUnicode( const VOID *lpv, int iSize, LPINT lpiResult );
> Which suggests to me that such APIs may exists. Does anyone who is better at finding OS APIs know where to find such APIs? Can you point me to the right online documentation?
> I also found this: https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
> Of course, it would be wonderful if the mothership delivered this. At one point Frasier said he would back around LC7 something.
> It seems an LCB library that uses OS APIs to return best guess for file encoding that match up with the textEncode/Decode functions would be a great addition to LC
> * "ASCII"
> * "UTF-16"
> * "UTF-16BE"
> * "UTF-16LE"
> * "UTF-32"
> * "UTF-32BE"
> * "UTF-32LE"
> * "UTF-8"
> * "CP1252"
> * "ISO-8859-1"
> * "MacRoman"
> and I suppose "Binary" as the default if none of the above can be detected
> ----------------- MARK'S REPLY ----------------------------------------
> On 2019-09-13 16:44, Paul Dupuis wrote:
> > I have a LiveCode Script (LCS) routine that attempts to follow
> > industry common algorithms for guessing the encoding of a text file.
> > It's performance can be slower than I would like.
> If you share your code perhaps we can help speed it up...
> > This has led me to wonder in a LiveCode Builder (LCB) library may be
> > the route to go. Does anyone know the OSX and/or Windows APIs for
> > guessing a text file's encoding?
> > I have done a number of google searches, but I am not a C programmer
> > (not in many decades) and wading through the huge doc sets at MSDN or
> > Apple is daunting.
> > I found reference to a windows API:
> > BOOL IsTextUnicode(
> > const VOID *lpv,
> > int iSize,
> > LPINT lpiResult
> > );
> > Which suggests to me that such APIs may exists. Does anyone who is
> > better at finding OS APIs know where to find such APIs? Can you point
> > me to the right online documentation?
> Libraries certainly exist: Mozilla has a 'universal charset detector library' for example, which appears to use various statistical heuristics to tell between all kinds of encodings.
> The 'IsTextUnicode' API seems to just tell you whether a sequence of bytes is likely to be UTF-16 or not UTF-16; so probably won't be all that helpful if that isn't all you are wanting to distinguish between.
> Do you have a list of encodings you are needing to guess between? That will generally influence how fast (and accurate) you can make such a function (its almost trivial to detect UTF-8 with a high degree of confidence, UTF-32 I think as well, UTF-16 is somewhat harder, and distinguishing between single-byte and legacy multi-byte charsets is, relatively speaking, very hard).
> Warmest Regards,
> P.S. This might be a better discussion to have on the use-list unless there is a reason not to, it might be of interest to others in that wider group.
> Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
> LiveCode: Everyone can create apps
> livecode-dev mailing list
> livecode-dev at lists.runrev.com
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
More information about the use-livecode