First 1000 characters without loop?

Mark Waddingham mark at livecode.com
Fri Jun 23 04:35:52 EDT 2017


On 2017-06-23 03:19, Richard Gaskin via use-livecode wrote:
> Seems murky.  I'd much rather at least have something like a byteLen
> function, which returns the number of bytes for a given string.  With
> that I can maintain byte offsets into a file with good performance and
> no ambiguity.

You do:

   the number of bytes in textEncode(tString, <encoding>)

The 'number of bytes in a string' makes no sense as there is no direct 
relationship between bytes and strings. I appreciate why this idea hangs 
around - it used to be true - char and byte where the same concept prior 
to 7.0 but that's only because the concept of 'char' was that of 
ISO8859-1/Latin-1 which can only represent the following written 
languages:

Western Europe and Americas: Afrikaans, Basque, Catalan, Danish, Dutch, 
English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, 
Italian, Norwegian, Portuguese, Spanish and Swedish.

If you step outside of that 'area', then it wasn't very much help (see 
https://www.terena.org/activities/multiling/ml-docs/iso-8859.html - for 
the historical encodings covering different sets of written languages).

The question you have to ask is 'how many bytes are in a string after it 
has been encoded in <encoding>' - when a string is written to disk an 
encoding *has* to be chosen. Sometimes the encoding is ASCII, sometimes 
it is UTF-8, sometimes it is UTF-16, sometimes it is something more 
exotic.

For any file format, an encoding of text always has to be defined - so 
you always 'know' if you know the file format (although some, the 
encoding might be indicated by a byte prefixing the encoded string, or 
as a piece of information in the header of the encoded file - e.g. Byte 
Order Marks).

> How do I find a substring in binary data in a what that will tell me
> the number of bytes of the offset?

If you have loaded binary data, and want to find the offset of a 
sequence of bytes within it then use 'byteOffset'.

If your binary data is actually encoded text data, then you need to 
textEncode the 'needle' (the thing you are searching for) first, making 
sure you do so with the encoding which the encoded text data requires:

   - put the encoded/raw data you want to search into tHaystackData
   put textEncode(tNeedleText, <encoding of target data>) into 
tNeedleData
   put byteOffset(tNeedleData, tHaystackData) into tOffset

However, it is important to note that this only allows an exact match - 
you can't do caseless searches like this (or searches where you want 
'e-acute' to match both 'e-acute' and 'e,combining-acute').

In the case of wanting to do caseless searches, then you need to do 
something like this:

    put textDecode(tHaystackData, <encoding of data>) into tHaystackText
    put offset(tNeedleText, tHaystackText) into tNeedleOffset
    put the number of bytes in textEncode(char 1 to tNeedleOffset of 
tHaystackText) into tNeedleByteOffset

i.e. The operation you are wanting to perform is 'offset of <needle> in 
<data> when using encoding <encoding>' which might make a useful engine 
addition - feel free to file an enhancement, although the above snippet 
should work in script with the operations we currently have. (Similar, 
your 'byteLen' function, is actually 'length of string in encoding 
<encoding>' - that also might be a useful engine addition, but can also 
be done in script now, as outlined above).

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list