byteLen()?

Mark Waddingham mark at livecode.com
Fri Mar 10 04:45:22 EST 2017


On 2017-03-09 22:24, Richard Gaskin via use-livecode wrote:
> I'm not sure I follow that, but it almost sounds like no matter what
> the encoding each char is mapped to one byte, so a 5-chart string like
> "hello" will take up 5 bytes - is that right?

In the case of the implicit conversion the engine does between text
and binary data - yes it is. The number of bytes in the generated
data will be the same as the number of chars in the original text.

However that only relates to the implicit 'compatibility' conversion
the engine does. In new code, it is better to make sure the conversion
is explicit by using textEncode / textDecode.

> I have some large files I want to open and read as binary (for speed
> mostly; if there's a reason I should be doing that as text let me
> know), then I'll work my way through it looking for substrings,
> keeping track of the byte offsets within the data where those can be
> found.
> 
> Once I have my list of byte offsets, I can save that as a sort of
> index file, and use "seek" or "read at" to go directly to that portion
> of the larger files whenever I need to access that data.
> 
> The data files may use a variety of encodings, mostly UTF-8 but I can
> expect Latin-ISO or perhaps even UTF-16.  In short, encoding will may
> be known in advance.
> 
> But since I'm working with binary data the whole time, the encoding
> shouldn't matter, should it?

It depends on whether you need to convert a text string into a byte 
sequence
to search for, and whether you are wanting an exact text match or a 
caseless
text match.

If the file you are searching is just a text file which you want to 
search
as binary then you need to know the encoding of said text file so you 
can
encode the text you are searching for in the same way. For example, if 
you
are search for "foó" and encode it as UTF-16 (which would generate 6 
bytes)
and the (text) file you are searching is UTF-8 encoded then it won't 
work.
The UTF-8 encoding of "foó" is different from the UTF-16 encoding.

If the file you are searching is some binary file containing text then 
things
are decidedly more tricky as to do the search accurately you need to 
know the
exact format of the binary file so you know precisely where the 
(encoded) text
strings within it sit. This is presuming you are not happy with 'false 
positives'.

(A stackfile, for example, contains encoded text and sequences of bytes 
which
were and never will be text - however, it is perfectly possible for the 
latter
to match encoded text, just by chance.)

If you are wanting a caseless match rather than an exact match then you 
pretty
much have to treat the file as text - you can't do caseless matching on 
arbitrary
bytes as it makes no sense (as they are just bytes with no meaning).

> Earlier you wrote:
> 
>   the number of bytes in textEncode(tText, kEncoding)
> 
> ...which implies that I would need to know the encoding (kEncoding),
> but do I really need textEncode for the use-case described here?

Strictly speaking that depends on the encoding:

For native encoding - number of bytes == number of codeunits

For UTF-16 - number of bytes = 2 * number of codeunits

For UTF-32 - number of bytes = 4 * number of codeunits

However, UTF-8 is a multibyte encoding based on the codepoints in the
text. A single codepoint can be encoded as 1, 2, 3 or 4 bytes.

The point here being, in order to compute the byte length of a piece of
text encoded as UTF-8 you need to look at each character. Since 
textEncode
does that, it is a reasonably clear way of working such things out.

By the way, here I've mentioned three things - codeunit, codepoint and
char:

   - a codeunit is the smallest element in UTF-16 and represents unicode
     codepoints 0-65535 (i.e. fits in a 16-bit unsigned int).

   - a codepoint is the natural 'unit' of Unicode - a 21-bit quantity 
which
     indexes into the Unicode char tables. (UTF-16 encodes the 21-bit 
quantity
     by using 'surrogate' pairs of codeunits - meaning that, in that 
encoding
     a codepoint can take 1 or 2 codeunits).

   - a char is a sequence of codepoints which are generally considered to
     be a single (human-processable) character.

I'm not sure if the above helps or not - it might be helpful to explain 
the
problem you are trying to solve more deeply. I still can't quite see how
the byte length of a piece of text (encoded in a particular encoding) is 
useful
since surely you need the byte sequence to search for anyway, in which 
case
the number of bytes is the length of that byte sequence that you already 
have...

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list