Passing UTF-8 through variables

Dar Scott dsc at swcp.com
Sun Feb 27 22:57:09 EST 2005


On Feb 27, 2005, at 7:11 PM, Sivakatirswami wrote:

> Ok, I have a text transformation challenge: One member of our team is 
> working on an index for a book that has diacritical fonts, in plain 
> ascii,

There are no diacritical marks in plain ASCII.

> set to "any old font" like Geneva, Arial or Verdana, which are the 
> defaults for her processing environment (a RAD tool built with 
> Revolution) the end result of her work flow prior to importing into 
> InDesign CS  is a very simple XML file... where a single entry looks 
> like this:

Is this using an 8-bit encoding that contains ASCII in the lower half?  
Which?

Or is is a UTF-8 file?

If it is UTF-8, some characters will be represented by multiple bytes.


> <indexPara><boldItalicEntry>ºava m‰rga:</boldItalicEntry> youth 
> susceptibility, 394</indexPara>
>
> Now, in Quark Express, if we simply passed this text to a type box, 
> selected it (or set the font in a style sheet, and applied the style 
> sheet) to "MinionD" (a diacritical font) we get all the proper 
> international standards marks: dash over the top of long vowels, dot 
> underneath retroflex consonants etc. very smooth and predicatable.
>
> But, not so with Adobe's InDesign CS. When we import the file are 
> getting weird strings for certain ones...

Does InDesign know what the encoding is for the input file?

> If we set a BBEdit file to UTF-8, and the encoding for the XML file to 
> UTF-8... these strings appear on screen as singular glyphs and a   few 
> black squares (meaning BBEdit can't display it).

Looks like InDesign is expecting one encoding and is getting some other 
encoding.

Since BBEdit at UTF-8 is seeing a similar problem, then I would suspect 
that InDesign is expecting UTF-8 and is getting something else.

> OK so one of our team here identified those characters where were 
> "bad" i.e. not transforming as expected into the expect characters.. 
> and he gave me a small array consisting of 16 lines, as follows (I 
> have no idea how this will show in email)  ... some characters are not 
> even passed to email!
>
...
>
>   # create an array from the conversion file
>   split tConversionArray with cr and tab

You can't do that with UTF-8.  The bytes for cr and tab might show up 
in the additional bytes per character.


> I am *way* out of my depth here.. any clues from anyone? What are 
> these multi-byte strings..and how to we make them back to the char 
> (129-255) set? (which is where they appear on the font map for 
> MinionD)

Look at the Revolution uniEncode() and uniDecode() functions.

If you are expecting the indexing application to output UTF-8, you can 
use these functions to convert to that before saving the file.

This might help:
    http://www.cs.tut.fi/~jkorpela/chars.html

You need to decide what encoding to use and stick with that when you 
can.


-- 
**********************************************
     DSC (Dar Scott Consulting & Dar's Lab)
     http://www.swcp.com/dsc/
     Programming Services and Software
**********************************************



More information about the use-livecode mailing list