Guess encoding for text file...
Paul Dupuis
paul at researchware.com
Tue Sep 17 16:05:45 EDT 2019
I started this post of the DEV-LIST. Mark Waddingham kindly responded
and smartly suggested I should move it to the USE-LIST, so that is what
I am doing. I have also pasted Lark's reply below my original post.
---------------------- ORIGINAL POST
----------------------------------------
I have a LiveCode Script (LCS) routine that attempts to follow industry
common algorithms for guessing the encoding of a text file.
It's performance can be slower than I would like.
This has led me to wonder in a LiveCode Builder (LCB) library may be the
route to go. Does anyone know the OSX and/or Windows APIs for guessing a
text file's encoding?
I have done a number of google searches, but I am not a C programmer
(not in many decades) and wading through the huge doc sets at MSDN or
Apple is daunting.
I found reference to a windows API:
BOOL IsTextUnicode( const VOID *lpv, int iSize, LPINT lpiResult );
Which suggests to me that such APIs may exists. Does anyone who is
better at finding OS APIs know where to find such APIs? Can you point me
to the right online documentation?
I also found this:
https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding
Of course, it would be wonderful if the mothership delivered this. At
one point Frasier said he would back around LC7 something.
https://quality.livecode.com/show_bug.cgi?id=14474
It seems an LCB library that uses OS APIs to return best guess for file
encoding that match up with the textEncode/Decode functions would be a
great addition to LC
* "ASCII"
* "UTF-16"
* "UTF-16BE"
* "UTF-16LE"
* "UTF-32"
* "UTF-32BE"
* "UTF-32LE"
* "UTF-8"
* "CP1252"
* "ISO-8859-1"
* "MacRoman"
and I suppose "Binary" as the default if none of the above can be detected
----------------- MARK'S REPLY ----------------------------------------
On 2019-09-13 16:44, Paul Dupuis wrote:
> I have a LiveCode Script (LCS) routine that attempts to follow
> industry common algorithms for guessing the encoding of a text file.
>
> It's performance can be slower than I would like.
If you share your code perhaps we can help speed it up...
> This has led me to wonder in a LiveCode Builder (LCB) library may be
> the route to go. Does anyone know the OSX and/or Windows APIs for
> guessing a text file's encoding?
>
> I have done a number of google searches, but I am not a C programmer
> (not in many decades) and wading through the huge doc sets at MSDN or
> Apple is daunting.
>
> I found reference to a windows API:
>
> BOOL IsTextUnicode(
> const VOID *lpv,
> int iSize,
> LPINT lpiResult
> );
>
> Which suggests to me that such APIs may exists. Does anyone who is
> better at finding OS APIs know where to find such APIs? Can you point
> me to the right online documentation?
Libraries certainly exist: Mozilla has a 'universal charset detector
library' for example, which appears to use various statistical
heuristics to tell between all kinds of encodings.
The 'IsTextUnicode' API seems to just tell you whether a sequence of
bytes is likely to be UTF-16 or not UTF-16; so probably won't be all
that helpful if that isn't all you are wanting to distinguish between.
Do you have a list of encodings you are needing to guess between? That
will generally influence how fast (and accurate) you can make such a
function (its almost trivial to detect UTF-8 with a high degree of
confidence, UTF-32 I think as well, UTF-16 is somewhat harder, and
distinguishing between single-byte and legacy multi-byte charsets is,
relatively speaking, very hard).
Warmest Regards,
Mark.
P.S. This might be a better discussion to have on the use-list unless
there is a reason not to, it might be of interest to others in that
wider group.
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
_______________________________________________
livecode-dev mailing list
livecode-dev at lists.runrev.com
http://lists.runrev.com/mailman/listinfo/livecode-dev
More information about the use-livecode
mailing list