char(4) not replaceable?

Sannyasin Sivakatirswami katir at hindu.org
Thu Apr 22 20:23:35 EDT 2004


That does help, a lot... I was kind of coming to that conclusion after 
doing my home work with the Rev docs and reading all the unicode 
entries and testing a number of the onboard rev unicode functions. But 
it still didn't get me from a to b

Here's the challenge.

If the clipboard from InDesign contains a two byte character and when I 
paste it into Rev (or BBEdit for that matter) and it appears in Osaka 
as a Japanese character, I think we know we have a two byte character. 
Why it looks one way InDesign and another way in Rev ...don't know...

In order to "downsize" that two-byte character to a suitable 0-127 char 
equivalent string (In this context which is lang:English alpha:Roman, I 
want ALL text to be super dumb and pass painlessly through any and all 
future user agents in any hardware/software context) how do I do that?

e.g. our editors use some odd glyph in InDesign and our web guy is 
repurposing this for the web and he pastes it into my little web pager 
rev app, and sees wierd characters... In theory, if I knew what the two 
values were, what I usually do is, in the background, clean it first

put char(26) into tStringToReplace
replace tStringToReplace with quote in tIncomingText

so he never see anything but 1-127 from the start.

So challenge is: find any way to, programatically, identify
a) that an incoming character *is* two-byte and
b) if it is, then to know what it is and replace it with  lo-ascii 
range equivalent.

If it could be translated would it look like char(204,218)  or what? 
Then, do you cat the two?

put char(204) & char(218) into tStringToReplace
replace tStringToReplace with "Y"
  ## where this could be some two-byte character "Y" with marks above it 
of some kind

I know if I actually paste some wierd string into the script editor, 
assuming I know for sure what it's equivalent is... this does work:

replace "[paste 2-byte char here]" with "sh"

but, i won't always know what the incoming wierd character is...  Also, 
since examining every single incoming char might slow operations down 
considerably... I might just let the user fix these manually: so I need 
at least for the user to be able to select the two-byte character in a 
rev field and then have a script that will examine the selected chunk 
and do the necessary replacement. This could work for small articles in 
our magazine, but I'm about to embark on repurposing 1000 page books 
from InDesign to web so I'll like to get a better handle on this from 
inside Rev.

I already have a matrix for HTML entities that looks like this:

Ä	A
Å	A
Ç	Ch
É	E
Ñ	N

etc. (with every possible >127 character in the fonts in use)

So, if I could identify the two-byte characters I would just extend 
this...

Sannyasin Sivakatirswami
Himalayan Academy Publications
at Kauai's Hindu Monastery
katir at hindu.org

www.HimalayanAcademy.com,
www.HinduismToday.com
www.Gurudeva.org
www.Hindu.org

On Apr 21, 2002, at 1:18 PM, Brian Yennie wrote:

> Sannyasin,
>
> I don't know if this is something you already have a handle on, but 
> the first thing to know about Unicode is that each character is _two_ 
> bytes instead of one, so some of this weird pasting behavior happens 
> because the receiving application treats the two bytes as two 
> consecutive characters.
>
> The reason why, most likely, you think you are getting a valid ASCII 
> number but not seeing a valid ASCII character is because you are 
> actually testing the charToNum() of a two character string- and 
> charToNum() only considers the first character.
>
> For example, charToNum("apple") is the same as charToNum("a"), even 
> though they are obviously different strings to the human eye.
>
> HTH!



More information about the use-livecode mailing list