ambassador at fourthworld.com
Thu Mar 9 16:24:50 EST 2017
Thanks for that background, Mark. I always appreciate your informal
I'm copying only the most relevant parts here - others looking for a
good reach will want the full post if you missed it:
Mark Waddingham wrote:
> This approach means that any multi-codepoint character in Unicode
> still maps to a single byte - and any non-updated code which
> manipulates strings as if they are data will still work (albeit with
> some data loss in regards the original Unicode string - which it
> wasn't written to understand anyway).
I'm not sure I follow that, but it almost sounds like no matter what the
encoding each char is mapped to one byte, so a 5-chart string like
"hello" will take up 5 bytes - is that right?
Doesn't feel right, but there's so much to both Unicode and how LC
handles it that I've lost my confidence with things like this.
Your guidance is appreciated, and perhaps it may help if I describe the
use-case at hand:
I have some large files I want to open and read as binary (for speed
mostly; if there's a reason I should be doing that as text let me know),
then I'll work my way through it looking for substrings, keeping track
of the byte offsets within the data where those can be found.
Once I have my list of byte offsets, I can save that as a sort of index
file, and use "seek" or "read at" to go directly to that portion of the
larger files whenever I need to access that data.
The data files may use a variety of encodings, mostly UTF-8 but I can
expect Latin-ISO or perhaps even UTF-16. In short, encoding will may be
known in advance.
But since I'm working with binary data the whole time, the encoding
shouldn't matter, should it?
Earlier you wrote:
the number of bytes in textEncode(tText, kEncoding)
...which implies that I would need to know the encoding (kEncoding), but
do I really need textEncode for the use-case described here?
Fourth World Systems
Software Design and Development for the Desktop, Mobile, and the Web
Ambassador at FourthWorld.com http://www.FourthWorld.com
More information about the use-livecode