Unicode and Chunks

Dar Scott dsc at swcp.com
Mon Sep 29 12:03:54 EDT 2003


Welcome, Dean!

On Monday, September 29, 2003, at 06:59 AM, Dean Snyder wrote:

> I've been enjoying using Unicode in Revolution 2.1 for the most part. 
> The
> only problem I've encountered so far is that chunk evaluation doesn't
> seem to work correctly with Unicode characters. For example, if any 
> byte
> of a double byte Unicode character is "09" that will increment the item
> count in chunk evaluation if you have set the itemDelimiter to "tab",
> ASCII 09; but, of course, the character is not a tab.

At this time it seems that Revolution values are still byte sequences 
and as long as you are using one-byte characters those are char 
sequences.  Unicode will be UTF-16 (16-bit chars with perhaps something 
special for 32).  Those double-byte characters are flattened into a 
byte sequence based on host ordering (ick).  So, at this time, you are 
working with bytes.

Here are some ideas:

1  Convert to UTF-8.  Each character is one to four bytes (for unicode 
version 4).  This has the cool property that tab or even comma or the 
Revolution line end will not show up in the extension bytes.  This 
should work with split and combine, too.

2  Highly experimental:  Maybe there is an undocumented feature of 
useUnicode that will allow this to work.  You might have to create a 
unicode tab char, 0009.

Dar Scott
unicode sophomore




More information about the use-livecode mailing list