Loading and 'normalising' text from UTF files
benr_mc at cogapp.com
Fri Jun 27 05:15:10 CDT 2008
I'm thinking this is a frequent requirement, such that I'm hoping someone may
have a standard routine for it.
I'm dealing with basically plain text files. But basically plain text means
English with a few things such as smart quotes and possibly a few of the more
common western European accented characters; in other words, characters that
are outside ASCII, but within Mac Roman and Windows Latin 1/ISO-8859-1.
My app is getting these files from various sources; but they've all ended up
on Mac. However, some are MacRoman, some are UTF-8. To date I've been
loading all the files with "URL "file:...", which of course messes up with
UTF-8. These particular files are UTF-8 with no BOM. I can probably code a
routine to deal with this particular case; eg by opportunistically searching
for the UTF-8 character sequence for a smart apostrophe that I happen to know
will appear in all the instances I'm currently dealing with.
But it made me wonder if there is a general algorithm^H^H^H heuristic I'd
guess for recognising the encoding of a file, and whether anyone's code a
general "load text file" routine that loads a file as binary, establishes the
encoding, and normalises the content so that it can be called on files in Mac
Roman or Windows Latin 1, UTF-8 or UTF-16 with or without BOM, etc, returning
the same result (or as close as can be) in each case?
If nobody has an actual routine that I can just steal, does anyone have tips
for how to guess the encoding?
More information about the use-livecode