xml and utf-8 attribute

Mark Waddingham 36degrees at runrev.com
Sun Jan 15 14:50:37 EST 2006


> I'm having a lot of trouble getting the right characters to appear for an
> XML attribute. The file loads correctly and displays the English words with
> no problems but the translation attribute returns garbled characters. The
> XML is encoded as UTF-8. I've tried implementing the various suggestions
> I've found in the archives re uniDecode / htmlText etc but am having no
> luck. The languages I am trying to use are Spanish, French and Polish. Any
> pointers or links to tutorials would be very helpful. This is the typical
> format of the XML file:
> 
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <player>
> <snd path="mp3/4.mp3" title="Spain"  desc="Member Countries of the European
> Union" trans="España" pic="img/es.jpg"/>
> </player>
> 
> This is the tanscript code that pushes the 'trans' attribute into the field:
> put revXMLAttribute(tDocID2,"/player/snd["&sndID&"]","trans") into field
> "trans"
> 
> And this is the typical output for non English characters:
> 
> España

This is what you should expect to get - this string is valid UTF-8
rendered as the native 1-byte encoding.

Try
  put revXMLAttribute(tDocID2,"/player/snd["&sndID&"]","trans") \
    into tUTF8Text
  put uniDecode(uniEncode(tUTF8Text, "UTF8")) into field "trans"

Now the explanation...

Your XML file is specified to have the encoding 'UTF-8'. This means that
the XML parser will attempt to interpret the contents of the file as
being valid UTF-8 text - in particular, numeric character references
(such as #&241), which are taken as unicode code points, will be
resolved to the appropriate sequence of characters for UTF-8.

revXML doesn't mess with the encoding at all - so what you get back from
the revXML* functions (in your case) will be UTF-8 encoded strings.

Therefore, to convert to the platform native 1-byte encoding (MacRoman
on MacOS, Latin-1 on Windows and ISO8859-1 on Unix/Linux) you need to
convert to UTF-16:
  put uniEncode(tUTF8Text, "UTF8") into tUTF16Text
and then convert back to the standard 1-byte encoding:
  put uniDecode(tUTF16Text) into tNativeText

(The uniDecode function, without a second parameter, indicates that the
target encoding is one of the native 1-byte ones appropriate to the
running platform)

Warmest Regards,

Mark.

------------------------------------------------------------------
 Mark Waddingham ~ 36degrees at runrev.com ~ http://www.runrev.com
       Runtime Revolution ~ User-Centric Development Tools




More information about the use-livecode mailing list