Loading and 'normalising' text from UTF files

Ben Rubinstein benr_mc at cogapp.com
Fri Jun 27 06:15:10 EDT 2008


I'm thinking this is a frequent requirement, such that I'm hoping someone may
have a standard routine for it.

I'm dealing with basically plain text files.  But basically plain text means
English with a few things such as smart quotes and possibly a few of the more
common western European accented characters; in other words, characters that
are outside ASCII, but within Mac Roman and Windows Latin 1/ISO-8859-1.

My app is getting these files from various sources; but they've all ended up
on Mac.  However, some are MacRoman, some are UTF-8.    To date I've been
loading all the files with "URL "file:...", which of course messes up with
UTF-8.  These particular files are UTF-8 with no BOM.  I can probably code a
routine to deal with this particular case; eg by opportunistically searching
for the UTF-8 character sequence for a smart apostrophe that I happen to know
will appear in all the instances I'm currently dealing with.

But it made me wonder if there is a general algorithm^H^H^H heuristic I'd
guess for recognising the encoding of a file, and whether anyone's code a
general "load text file" routine that loads a file as binary, establishes the
encoding, and normalises the content so that it can be called on files in Mac
Roman or Windows Latin 1, UTF-8 or UTF-16 with or without BOM, etc, returning
the same result (or as close as can be) in each case?

If nobody has an actual routine that I can just steal, does anyone have tips
for how to guess the encoding?

TIA,

- Ben




More information about the use-livecode mailing list