Passing UTF-8 through variables

Sivakatirswami katir at hindu.org
Sun Feb 27 21:11:53 EST 2005


Ok, I have a text transformation challenge: One member of our team is 
working on an index for a book that has diacritical fonts, in plain 
ascii, set to "any old font" like Geneva, Arial or Verdana, which are 
the defaults for her processing environment (a RAD tool built with 
Revolution) the end result of her work flow prior to importing into 
InDesign CS  is a very simple XML file... where a single entry looks 
like this:

<indexPara><boldItalicEntry>Å∫ava mârga:</boldItalicEntry> youth 
susceptibility, 394</indexPara>

Now, in Quark Express, if we simply passed this text to a type box, 
selected it (or set the font in a style sheet, and applied the style 
sheet) to "MinionD" (a diacritical font) we get all the proper 
international standards marks: dash over the top of long vowels, dot 
underneath retroflex consonants etc. very smooth and predicatable.

But, not so with Adobe's InDesign CS. When we import the file are 
getting weird strings for certain ones... If we set a BBEdit file to 
UTF-8, and the encoding for the XML file to UTF-8... these strings 
appear on screen as singular glyphs and a   few black squares (meaning 
BBEdit can't display it).

OK so one of our team here identified those characters where were "bad" 
i.e. not transforming as expected into the expect characters.. and he 
gave me a small array consisting of 16 lines, as follows (I have no 
idea how this will show in email)  ... some characters are not even 
passed to email!

Õ	’
Þ	fi
‰	â
”	î
ž	û
¨	®
ö	˜
–	ñ
 	†
¶	∂
º	∫
ú	˙
§	ß
	Å
¨	Â
ê	Í

So, I wrote the following simple script.

on mouseUp

   # set up paths to source files

   put "/Volumes/Varuna/Books/LWS Pocketbook/lws_pocket_book utf-8.xml" 
into tSourceTxt
   put "/Volumes/Varuna/Books/LWS Pocketbook/Bad diacriticals UTF-8" 
into tConversionTxt

   # load the source file and the conversion files

   put url ("binfile:"&tSourceTxt) into tOldFile
   put url ("binfile:"&tConversionTxt) into tConversionArray

   # create an array from the conversion file
   split tConversionArray with cr and tab

   # let's take a look at what we are getting...
   # the "keys" of the array should be the initial char on each line
   # of Bad diacriticals UTF-8
   put the keys of tConversionArray


   repeat for each char x in tOldFile
     if tConversionArray[x] is not empty then
       put tConversionArray[x] into x
     end if
     put x after tOutPut
   end repeat

   set the itemdel to "."
   put "-g2" after item 1 of tSourceTxt

  put tOutput into url ("binfile:" & tSourceTxt)

end mouseUp

I get the following strange results for the keys of the array created 
from the UTF-8 file:

¬û
Õ
ö
 

ê
√∫
”
¶
–
§
√û
¨
‰
º

But, these are not actually to be found in the source file. and so the 
script make no changes...

I am *way* out of my depth here.. any clues from anyone? What are these 
multi-byte strings..and how to we make them back to the char (129-255) 
set? (which is where they appear on the font map for MinionD)

TIA

Sivakatirswami
Himalayan Academy Publications
at Kauai's Hindu Monastery
katir at hindu.org

www.HimalayanAcademy.com,
www.HinduismToday.com
www.Gurudeva.org
www.Hindu.org


More information about the use-livecode mailing list