Guess encoding for text file...

Paul Dupuis paul at researchware.com
Tue Sep 17 16:05:45 EDT 2019


I started this post of the DEV-LIST. Mark Waddingham kindly responded 
and smartly suggested I should move it to the USE-LIST, so that is what 
I am doing. I have also pasted Lark's reply below my original post.

---------------------- ORIGINAL POST 
----------------------------------------

I have a LiveCode Script (LCS) routine that attempts to follow industry 
common algorithms for guessing the encoding of a text file.

It's performance can be slower than I would like.

This has led me to wonder in a LiveCode Builder (LCB) library may be the 
route to go. Does anyone know the OSX and/or Windows APIs for guessing a 
text file's encoding?

I have done a number of google searches, but I am not a C programmer 
(not in many decades) and wading through the huge doc sets at MSDN or 
Apple is daunting.

I found reference to a windows API:

BOOL IsTextUnicode( const VOID *lpv, int iSize, LPINT lpiResult );

Which suggests to me that such APIs may exists. Does anyone who is 
better at finding OS APIs know where to find such APIs? Can you point me 
to the right online documentation?

I also found this: 
https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding

Of course, it would be wonderful if the mothership delivered this. At 
one point Frasier said he would back around LC7 something.

https://quality.livecode.com/show_bug.cgi?id=14474

It seems an LCB library that uses OS APIs to return best guess for file 
encoding that match up with the textEncode/Decode functions would be a 
great addition to LC

  * "ASCII"
  * "UTF-16"
  * "UTF-16BE"
  * "UTF-16LE"
  * "UTF-32"
  * "UTF-32BE"
  * "UTF-32LE"
  * "UTF-8"
  * "CP1252"
  * "ISO-8859-1"
  * "MacRoman"

and I suppose "Binary" as the default if none of the above can be detected

----------------- MARK'S REPLY ----------------------------------------
On 2019-09-13 16:44, Paul Dupuis wrote:
 > I have a LiveCode Script (LCS) routine that attempts to follow
 > industry common algorithms for guessing the encoding of a text file.
 >
 > It's performance can be slower than I would like.

If you share your code perhaps we can help speed it up...

 > This has led me to wonder in a LiveCode Builder (LCB) library may be
 > the route to go. Does anyone know the OSX and/or Windows APIs for
 > guessing a text file's encoding?
 >
 > I have done a number of google searches, but I am not a C programmer
 > (not in many decades) and wading through the huge doc sets at MSDN or
 > Apple is daunting.
 >
 > I found reference to a windows API:
 >
 > BOOL IsTextUnicode(
 >   const VOID *lpv,
 >   int        iSize,
 >   LPINT      lpiResult
 > );
 >
 >  Which suggests to me that such APIs may exists. Does anyone who is
 > better at finding OS APIs know where to find such APIs? Can you point
 > me to the right online documentation?

Libraries certainly exist: Mozilla has a 'universal charset detector 
library' for example, which appears to use various statistical 
heuristics to tell between all kinds of encodings.

The 'IsTextUnicode' API seems to just tell you whether a sequence of 
bytes is likely to be UTF-16 or not UTF-16; so probably won't be all 
that helpful if that isn't all you are wanting to distinguish between.

Do you have a list of encodings you are needing to guess between? That 
will generally influence how fast (and accurate) you can make such a 
function (its almost trivial to detect UTF-8 with a high degree of 
confidence, UTF-32 I think as well, UTF-16 is somewhat harder, and 
distinguishing between single-byte and legacy multi-byte charsets is, 
relatively speaking, very hard).

Warmest Regards,

Mark.

P.S. This might be a better discussion to have on the use-list unless 
there is a reason not to, it might be of interest to others in that 
wider group.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps

_______________________________________________
livecode-dev mailing list
livecode-dev at lists.runrev.com
http://lists.runrev.com/mailman/listinfo/livecode-dev




More information about the use-livecode mailing list