First 1000 characters without loop?
Mark Waddingham
mark at livecode.com
Fri Jun 23 04:35:52 EDT 2017
On 2017-06-23 03:19, Richard Gaskin via use-livecode wrote:
> Seems murky. I'd much rather at least have something like a byteLen
> function, which returns the number of bytes for a given string. With
> that I can maintain byte offsets into a file with good performance and
> no ambiguity.
You do:
the number of bytes in textEncode(tString, <encoding>)
The 'number of bytes in a string' makes no sense as there is no direct
relationship between bytes and strings. I appreciate why this idea hangs
around - it used to be true - char and byte where the same concept prior
to 7.0 but that's only because the concept of 'char' was that of
ISO8859-1/Latin-1 which can only represent the following written
languages:
Western Europe and Americas: Afrikaans, Basque, Catalan, Danish, Dutch,
English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish,
Italian, Norwegian, Portuguese, Spanish and Swedish.
If you step outside of that 'area', then it wasn't very much help (see
https://www.terena.org/activities/multiling/ml-docs/iso-8859.html - for
the historical encodings covering different sets of written languages).
The question you have to ask is 'how many bytes are in a string after it
has been encoded in <encoding>' - when a string is written to disk an
encoding *has* to be chosen. Sometimes the encoding is ASCII, sometimes
it is UTF-8, sometimes it is UTF-16, sometimes it is something more
exotic.
For any file format, an encoding of text always has to be defined - so
you always 'know' if you know the file format (although some, the
encoding might be indicated by a byte prefixing the encoded string, or
as a piece of information in the header of the encoded file - e.g. Byte
Order Marks).
> How do I find a substring in binary data in a what that will tell me
> the number of bytes of the offset?
If you have loaded binary data, and want to find the offset of a
sequence of bytes within it then use 'byteOffset'.
If your binary data is actually encoded text data, then you need to
textEncode the 'needle' (the thing you are searching for) first, making
sure you do so with the encoding which the encoded text data requires:
- put the encoded/raw data you want to search into tHaystackData
put textEncode(tNeedleText, <encoding of target data>) into
tNeedleData
put byteOffset(tNeedleData, tHaystackData) into tOffset
However, it is important to note that this only allows an exact match -
you can't do caseless searches like this (or searches where you want
'e-acute' to match both 'e-acute' and 'e,combining-acute').
In the case of wanting to do caseless searches, then you need to do
something like this:
put textDecode(tHaystackData, <encoding of data>) into tHaystackText
put offset(tNeedleText, tHaystackText) into tNeedleOffset
put the number of bytes in textEncode(char 1 to tNeedleOffset of
tHaystackText) into tNeedleByteOffset
i.e. The operation you are wanting to perform is 'offset of <needle> in
<data> when using encoding <encoding>' which might make a useful engine
addition - feel free to file an enhancement, although the above snippet
should work in script with the operations we currently have. (Similar,
your 'byteLen' function, is actually 'length of string in encoding
<encoding>' - that also might be a useful engine addition, but can also
be done in script now, as outlined above).
Warmest Regards,
Mark.
--
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps
More information about the use-livecode
mailing list