Guessing the encoding of a test file...

Sean Cole (Pi) sean at pidigital.co.uk
Thu Mar 19 21:47:47 EDT 2020


You won't want to hear this but unfortunately for Windows you are out of
luck. Text files of themselves do not have the encoding embedded in them in
any form. Once it is written it is stored as a series of one or two byte
characters. If you open it as a binfile or a straight file it appears the
same. It is the lowest common denominator of storage formats. Text encoding
is one of those things that either has to be handled by a human or AI/ML.

All the best

Sean Cole
*Pi Digital *


On Thu, 19 Mar 2020 at 23:46, Paul Dupuis via use-livecode <
use-livecode at lists.runrev.com> wrote:

> Users of our application may use text files any whatever encoding their
> local system creates them in. We can not tell them to only create such
> files with a specific encoding. So, we need to detect the encoding of
> the text file the user selects.
>
> As I mentioned, I have an LC script that implements a encoding guessing
> algorithm. I am looking for an alternative or better one if someone out
> there happened to have created one they might like to share or license.
>
> Any such routine needs to work on macOS and Windows and return the types
> used by the LC textDecode function.
>
> I already knew about file on OSX, but I needs a x-platform solution.
>
>
> On 3/19/2020 6:15 PM, Pi Digital via use-livecode wrote:
> > On a mac it’s easy. Use
> > file -I “MyFile.txt”
> >   as a shell script.
> >
> > On Windows it’s near impossible without running a whole bunch or
> arbitrary tests that may or may not be correct - certainly not accurate.
> >
> > What kind of text were you hoping to see? Was you looking for a
> particular encoding? If it is grammatical text there’s are a bunch or runs
> you can do to see what character sets are used but even then it’s only a
> ‘probably’/‘possibly’ response.
> >
> > Sean Cole
> > Pi Digital
> >
> >
> >> On 19 Mar 2020, at 20:31, Paul Dupuis via use-livecode <
> use-livecode at lists.runrev.com> wrote:
> >>
> >> This has come up many times before, but I'll ask once again in case
> something has changed or someone new sees this.
> >>
> >>
> >> Does anyone have a routine that will take a filespec to a text file and
> return the guessed encoding of the text file?
> >>
> >>
> >> First, please don't respond with your should know the encoding or the
> users should know the encoding of their files. Not possible in the widely
> uncontrolled real world.
> >>
> >> I do already have a routine to guess file encodings. It was written by
> someone else. There are instances where it should work and does not. I fear
> there may be errors in the algorithm and I do not have the original
> algorithm to check it against. Hence, I am looking for an alternative that
> is either free to use or to be licensed for a modest fee.
> >>
> >> My current routine attempts to return the encoding as a string that can
> be directly passed to textDecode(binaryData,encoding)
> >>
> >> "ASCII"
> >> "UTF-16"
> >> "UTF-16BE"
> >> "UTF-16LE"
> >> "UTF-32"
> >> "UTF-32BE"
> >> "UTF-32LE"
> >> "UTF-8"
> >> "CP1252" *
> >> "MacRoman" *
> >>
> >> * for these last 2, if the file is MacRoman on a Windows system, you
> actually have to textDecode(macToISO(data),"CP1252") and if you have CP1252
> on the Mac, you need to do textDecode(isoToMac(data),"MacRoman"). There is
> an enhancement request to support MacRoman decoding under WIndows and vice
> versa at https://quality.livecode.com/show_bug.cgi?id=22391 if you want
> to CC yourself to show interest.
> >>
> >>
> >> _______________________________________________
> >> use-livecode mailing list
> >> use-livecode at lists.runrev.com
> >> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> >> http://lists.runrev.com/mailman/listinfo/use-livecode
> > _______________________________________________
> > use-livecode mailing list
> > use-livecode at lists.runrev.com
> > Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> > http://lists.runrev.com/mailman/listinfo/use-livecode
>
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



More information about the use-livecode mailing list