Passing UTF-8 through variables
Dar Scott
dsc at swcp.com
Sun Feb 27 22:57:09 EST 2005
On Feb 27, 2005, at 7:11 PM, Sivakatirswami wrote:
> Ok, I have a text transformation challenge: One member of our team is
> working on an index for a book that has diacritical fonts, in plain
> ascii,
There are no diacritical marks in plain ASCII.
> set to "any old font" like Geneva, Arial or Verdana, which are the
> defaults for her processing environment (a RAD tool built with
> Revolution) the end result of her work flow prior to importing into
> InDesign CS is a very simple XML file... where a single entry looks
> like this:
Is this using an 8-bit encoding that contains ASCII in the lower half?
Which?
Or is is a UTF-8 file?
If it is UTF-8, some characters will be represented by multiple bytes.
> <indexPara><boldItalicEntry>ºava mrga:</boldItalicEntry> youth
> susceptibility, 394</indexPara>
>
> Now, in Quark Express, if we simply passed this text to a type box,
> selected it (or set the font in a style sheet, and applied the style
> sheet) to "MinionD" (a diacritical font) we get all the proper
> international standards marks: dash over the top of long vowels, dot
> underneath retroflex consonants etc. very smooth and predicatable.
>
> But, not so with Adobe's InDesign CS. When we import the file are
> getting weird strings for certain ones...
Does InDesign know what the encoding is for the input file?
> If we set a BBEdit file to UTF-8, and the encoding for the XML file to
> UTF-8... these strings appear on screen as singular glyphs and a few
> black squares (meaning BBEdit can't display it).
Looks like InDesign is expecting one encoding and is getting some other
encoding.
Since BBEdit at UTF-8 is seeing a similar problem, then I would suspect
that InDesign is expecting UTF-8 and is getting something else.
> OK so one of our team here identified those characters where were
> "bad" i.e. not transforming as expected into the expect characters..
> and he gave me a small array consisting of 16 lines, as follows (I
> have no idea how this will show in email) ... some characters are not
> even passed to email!
>
...
>
> # create an array from the conversion file
> split tConversionArray with cr and tab
You can't do that with UTF-8. The bytes for cr and tab might show up
in the additional bytes per character.
> I am *way* out of my depth here.. any clues from anyone? What are
> these multi-byte strings..and how to we make them back to the char
> (129-255) set? (which is where they appear on the font map for
> MinionD)
Look at the Revolution uniEncode() and uniDecode() functions.
If you are expecting the indexing application to output UTF-8, you can
use these functions to convert to that before saving the file.
This might help:
http://www.cs.tut.fi/~jkorpela/chars.html
You need to decide what encoding to use and stick with that when you
can.
--
**********************************************
DSC (Dar Scott Consulting & Dar's Lab)
http://www.swcp.com/dsc/
Programming Services and Software
**********************************************
More information about the use-livecode
mailing list