Detecting UTF-8 Encoded Files

Klaus Major klaus at major.on-rev.com
Fri Aug 7 10:01:13 EDT 2009


Hi Ken,

I do not see any problem (and wouldn't if there were ;-)
but Mark Waddingham once helped me out with a working function exactly  
for determining
how a VCARD is encoded!

Here it is including Marks (very helpful)comments:

# vCards are stored as a text file, however, the text encoding used  
varies depending on the program that exported them.
# We use the following heuristic to detect encoding:
# 1) If there is the byte order mark 0xFEFF then we assume UTF-16BE
# 2) If there is the byte order mark 0xFFFE then we assume UTF-16LE
# 3) If the first byte is 0x00 then we assume UTF-16BE (compatibility  
with Tiger Address Book)
# 4) Otherwise we assume UTF-8
function vcf_convert3format tBinaryVCard
   # First load the vCard as binary data - at this stage we don't know  
the text encoding of the file and loading
   # as text would cause inappropriate line ending conversion.
   # This variable will hold the vCard encoded in MacRoman (the  
default text encoding Revolution uses on Mac OS X)
   local tNativeVCard

   # We now do our checks to detect text encoding
   switch
   case charToNum(char 1 of tBinaryVCard) = 0
     put "UTF16BE" into tTextEncoding
     break
   case charToNum(char 1 of tBinaryVCard) = 0xFE and charToNum(char 2  
of tBinaryVCard) = 0xFF
     delete char 1 to 2 of tBinaryVCard
     put "UTF16BE" into tTextEncoding
     break
   case charToNum(char 1 of tBinaryVCard) = 0xFF and charToNum(char 2  
of tBinaryVCard) = 0xFE
     delete char 1 to 2 of tBinaryVCard
     put "UTF16LE" into tTextEncoding
     break
   default
     put "UTF8" into tTextEncoding
     break
   end switch

   if tTextEncoding begins with "UTF16" then
     # Work out the processors byte order
     local tHostByteOrder
     if the processor is "x86" then
       put "LE" into tHostByteOrder
     else
       put "BE" into tHostByteOrder
     end if

     # If the byte orders don't match, switch the order of pairs of  
bytes
     if char -2 to -1 of tTextEncoding <> tHostByteOrder then
       put swapbytes(tBinaryVCard) into tBinaryVCard
     end if

     # Decode the UTF-16 to native
     put uniDecode(tBinaryVCard) into tNativeVCard
   else
     # Use the standard uniDecode/uniEncode pair to decode the UTF-8  
encoding
     put uniDecode(uniEncode(tBinaryVCard, "UTF8")) into tNativeVCard
   end if

   # We now need to normalize line endings to make sure all lines  
terminate in 'return' (numToChar(10)).
   put tNativeVCard into tTextVCard

   # First replace Windows CR-LF style endings
   replace numToChar(13) & numToChar(10) with return in tTextVCard

   # Now replace Mac OS CR style endings
   replace numToChar(13) with return in tTextVCard
   return mac2win(tTextVCard)
end vcf_convert3format

***
Here is my function "mac2win" that we use in our crossplatform project  
werhe we store EVERYTHING in ISO format!
function mac2win was
   if the platform = "MacOS" then
     return mactoiso(was)
   else
     return was
   end if
end mac2win

Hope that helps!


Best

Klaus

--
Klaus Major
http://www.major-k.de
klaus at major.on-rev.com




More information about the use-livecode mailing list