Detecting Unicode text

Sarah Reichelt sarah.reichelt at gmail.com
Thu Nov 17 22:18:07 EST 2005


On 11/18/05, Trevor DeVore <lists at mangomultimedia.com> wrote:
> On Nov 17, 2005, at 5:37 PM, Sarah Reichelt wrote:
> > If I UniDecode the text, it comes good except for a weird character at
> > the start which I can handle, but is there a neat way to detect the
> > encoding of text before I start? I suppose I can just look for the
> > word "Subject" and if it isn't there, uniDecode and try again, but it
> > seems there should be a way to detect the encoding of the text itself.
> >
> > Does the weird stuff at the start give me any clues? Checking the
> > ASCII codes, the text starts with ASCII 254, ASCII 255, space and then
> > the first character of my text. Perhaps that's my answer, but will
> > they always be 254 & 255 or does that vary with the encoding?
> >
> > Any ideas?
>
> Hi Sarah,
>
> The "weird stuff" at the beginning is the BOM.  This tells
> applications opening the file what kind of UTF file you are dealing
> with.  Now, I'm not sure how to decipher each BOM but perhaps Google
> will know the answer.
>

Thanks Trevor, that told me what to look for and provided the answer.
Here is a quote from <http://www.unicode.org/versions/Unicode4.0.0/>
book chapter 15.

"In the UTF-16 encoding scheme, U+FEFF at the very beginning of a file
or stream explicitly
signals the byte order.
The byte sequence <FE16 FF16> may serve as a signature to identify a
file as containing Uni-
code text. This sequence is exceedingly rare at the outset of text
files using other character
encodings, whether single- or multiple-byte, and therefore not likely
to be confused with
real text data."

So I think I can be quite safe if I look for charToNum(254) &
charToNum(255) at the start of a file and UniDecode the text if they
are found.

ATB,
Sarah



More information about the use-livecode mailing list