XML and Unicode

Dar Scott dsc at swcp.com
Thu Jun 13 11:33:20 EDT 2013


You can often find the encoding of an XML file in the XML declaration.  Look for the encoding attribute:

<?xml version="1.0" encoding="UTF-8"?>

Or, since you say "specific", assume what you think is right.

Converting to your native encoding is lossy.  You might lose characters.  You can check that by converting back and seeing if you get the same thing.  If that is a concern.  However, if you are careful, that might work OK.  I would lean toward processing UTF-8.  

Whether you read that in with binfile: or file: depends on how you are using the data.  


I recommend against using the native encoding for your machine for HTML if this tool is to be portable or handle a wide range of characters.  Be explicit and specific.  It is easy to get things working on your machine and then have it fail on other people's computers.

Here are a couple approaches for HTML:

1.  Use ASCII for the HTML file.  Convert characters not in ASCII to HTML character references.  If HTML readability (even for testing) is important then use character entity names for the common ones.  You can display the ASCII directly in fields.  If the source or intermediate strings are UTF-8, display those by converting to Unicode and setting the unicodeText property of the field.

2.  Specify the character encoding of the HTML file in the file.  Like this:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If your file is UTF-8, you might want to consider that.  The HTTP server might remove that and replace it with a similar HTTP header.  You might note have to worry about special characters this way.


If it turns out that working with UTF-8 for intermediate values is the right thing, you can always display that by converting it to Unicode (LiveCode) and then setting the unicodeText property of the appropriate field with that.  I don't know how available XML tools in LiveCode handle UTF-8.

Dar


On Jun 13, 2013, at 6:54 AM, Pascal Lehner wrote:

> Hi everyone,
> 
> I am working on small tool to convert a specific XML file into several html
> pages.
> The whole thing looks good and should be working fine eventually. However,
> there is one thing I did not figure out yet: Unicode.
> 
> My XML is partially french and has a lot of symbols like this:
>       <sitename>Personnes du monde rural aux 19ème et 20ème
> siècles</sitename>
> 
> however, when I read this file into Livecode and take it apart, I get stuff
> like this in my variables:
> 
>        Personnes du monde rural aux 19ème et 20ème siècles
> Can someone help me changin these symols back to their originals
> respecitvely I will need them in a html file in the end. I guess that's
> some Unicode issue but I am somehow lost when and how to best change the
> strings..
> 
> Thanks a lot :-)
> 
> Pascal
> 
> --
> 
> Pascal Lehner
> 
> 147/1 St Leonards Street
> Edinburgh
> EH8 9RB
> United Kingdom

---------------------------
Dar Scott
dba 
Dar Scott Consulting
8637 Horacio Place NE
Albuquerque, NM 87111

Lab, home, office phone: +1 505 299 9497
For Skype and fax, please contact.
dsc at swcp.com

Computer programming and tinkering,
usually in supporting those developing in 
LiveCode--typically by making LiveCode 
controls, libraries and externals, and
sometimes by writing associated
microcontroller firmware.  
---------------------------

We can not force our goodwill on anyone, we can only set good examples and hope people wish to emulate us. --Ron Paul









More information about the use-livecode mailing list