Passing UTF-8 through variables
katir at hindu.org
Sun Feb 27 21:11:53 EST 2005
Ok, I have a text transformation challenge: One member of our team is
working on an index for a book that has diacritical fonts, in plain
ascii, set to "any old font" like Geneva, Arial or Verdana, which are
the defaults for her processing environment (a RAD tool built with
Revolution) the end result of her work flow prior to importing into
InDesign CS is a very simple XML file... where a single entry looks
<indexPara><boldItalicEntry>Å∫ava mârga:</boldItalicEntry> youth
Now, in Quark Express, if we simply passed this text to a type box,
selected it (or set the font in a style sheet, and applied the style
sheet) to "MinionD" (a diacritical font) we get all the proper
international standards marks: dash over the top of long vowels, dot
underneath retroflex consonants etc. very smooth and predicatable.
But, not so with Adobe's InDesign CS. When we import the file are
getting weird strings for certain ones... If we set a BBEdit file to
UTF-8, and the encoding for the XML file to UTF-8... these strings
appear on screen as singular glyphs and a few black squares (meaning
BBEdit can't display it).
OK so one of our team here identified those characters where were "bad"
i.e. not transforming as expected into the expect characters.. and he
gave me a small array consisting of 16 lines, as follows (I have no
idea how this will show in email) ... some characters are not even
passed to email!
So, I wrote the following simple script.
# set up paths to source files
put "/Volumes/Varuna/Books/LWS Pocketbook/lws_pocket_book utf-8.xml"
put "/Volumes/Varuna/Books/LWS Pocketbook/Bad diacriticals UTF-8"
# load the source file and the conversion files
put url ("binfile:"&tSourceTxt) into tOldFile
put url ("binfile:"&tConversionTxt) into tConversionArray
# create an array from the conversion file
split tConversionArray with cr and tab
# let's take a look at what we are getting...
# the "keys" of the array should be the initial char on each line
# of Bad diacriticals UTF-8
put the keys of tConversionArray
repeat for each char x in tOldFile
if tConversionArray[x] is not empty then
put tConversionArray[x] into x
put x after tOutPut
set the itemdel to "."
put "-g2" after item 1 of tSourceTxt
put tOutput into url ("binfile:" & tSourceTxt)
I get the following strange results for the keys of the array created
from the UTF-8 file:
But, these are not actually to be found in the source file. and so the
script make no changes...
I am *way* out of my depth here.. any clues from anyone? What are these
multi-byte strings..and how to we make them back to the char (129-255)
set? (which is where they appear on the font map for MinionD)
Himalayan Academy Publications
at Kauai's Hindu Monastery
katir at hindu.org
More information about the Use-livecode