Unicode and Chunks
Dar Scott
dsc at swcp.com
Mon Sep 29 12:03:54 EDT 2003
Welcome, Dean!
On Monday, September 29, 2003, at 06:59 AM, Dean Snyder wrote:
> I've been enjoying using Unicode in Revolution 2.1 for the most part.
> The
> only problem I've encountered so far is that chunk evaluation doesn't
> seem to work correctly with Unicode characters. For example, if any
> byte
> of a double byte Unicode character is "09" that will increment the item
> count in chunk evaluation if you have set the itemDelimiter to "tab",
> ASCII 09; but, of course, the character is not a tab.
At this time it seems that Revolution values are still byte sequences
and as long as you are using one-byte characters those are char
sequences. Unicode will be UTF-16 (16-bit chars with perhaps something
special for 32). Those double-byte characters are flattened into a
byte sequence based on host ordering (ick). So, at this time, you are
working with bytes.
Here are some ideas:
1 Convert to UTF-8. Each character is one to four bytes (for unicode
version 4). This has the cool property that tab or even comma or the
Revolution line end will not show up in the extension bytes. This
should work with split and combine, too.
2 Highly experimental: Maybe there is an undocumented feature of
useUnicode that will allow this to work. You might have to create a
unicode tab char, 0009.
Dar Scott
unicode sophomore
More information about the use-livecode
mailing list