Translate metadata to field content

Fri Feb 21 03:51:41 EST 2020

On 2020-02-21 00:29, J. Landman Gay via use-livecode wrote:
> So glad you chimed in, Mark. This is pretty impressive. I'll need to
> use the "for each element" structure because my tags are not unique,
> but it still is much faster. When clicking a tag at the top of the
> document that links to the last anchor at the bottom of the text, I
> get a timing of about 25ms. If I omit the timing for loading the
> htmltext and the selection of the text at the end of the handler it
> brings the timing to almost 0. The test text is long, but not nearly
> as long as Bernd's sample.

Glad I could help - although to be fair, all I did was optimize what
Bernd (and Richard) had already proposed.

One thing I did notice through testing was that the actual styled 
content
makes a great deal of difference to performance. I also tried against 
the
DataGrid behavior (replicated several times), and then also against some
styled 'Lorem Ipsum' (https://loripsum.net/) of about the same length 
(around
8Mb of htmlText, with the anchor being search for on the last word). The
difference is that the DG has many more style runs (unsurprisingly) and
almost all are single words. So timings need to be taken against a
representative sample of the data you are actually working with.

> I need to select the entire range of text covered by the metadata
> span, not just a single word. I've got that working, but since we're
> on a roll here, I wonder if there's a more optimal way to do it.

I did wonder if that would be the case...

> I'm using chars instead of codepoints because when I tried it, they
> both gave the same number. Should I change that?

Both characters and codepoints run the risk of requiring a linear scan 
of
the string to calculate the length - strictly speaking his will occur if
the engine is not sure whether character / codepoint have a 1-1 map to
codeunits (for example if your string has Unicode chars and it hasn't
analysed it). Therefore you should definitely use codeunits.

> Also, I had to add 3 to tStartChar to get the right starting point but
> I can't figure out why. Otherwise it selects the last character before
> the metadata span as the starting point.

Was the anchor in the third paragraph by any chance?

The styledText representation makes the paragraph separator (return 
char)
implicit (as it is in the field object internally) - so you need to bump
the tTotalChars by one before the final end repeat to account for that 
(as the
codeunit ranges the field uses *include* the implicit return char)

So I couldn't help but fettle with this a little more. You mention that 
your
'anchors' are not unique in a document. This raises the question of what
happens if there is more than one match...

This handler finds all occurrences of a given anchor in the text. As we 
are
searching for all of them, it can use repeat for each key iteration in 
both
loops:

function FindAllAnchors pStyledText, pAnchor
    /* Return-delimited list of results, each line is of the form:
    *     start,finish,line
    * Each of these corresponds to a chunk of the form:
    *      CODEUNIT start TO finish OF LINE line OF field
    */
    local tResults

    /* Iterate over the lines of the text in arbitrary order - the order 
doesn't
    * matter as we keep the reference to the line any match is in. */
    repeat for each key tLineIndex in pStyledText
       /* Fetch the runs in the line, so we don't have to keep looking it 
up */
       local tRuns
       put pStyledText[tLineIndex]["runs"] into tRuns

       /* Iterate over the runs in arbitrary order - assuming that the 
number
       * of potentially matching runs is miniscule compared to the number 
of
       * non-matching runs, it is faster to iterate in hash-order. */
       repeat for each key tRunIndex in tRuns
          /* If we find a match, work out its offset in the line */
          if tRuns[tRunIndex]["metadata"] is pAnchor then
             /* Calculate the number of codeunits before this run */
             local tCodeunitCount
             put 0 into tCodeunitCount
             repeat with tPreviousRunIndex = 1 to tRunIndex - 1
                add the number of codeunits in 
tRuns[tPreviousRunIndex]["text"] to tCodeunitCount
             end repeat

             /* Append the result to the results list. */
             put tCodeunitCount + 1, \
                   tCodeunitCount + the number of codeunits in 
tRuns[tRunIndex]["text"], \
                   tLineIndex & \
                   return after tResults
          end if
       end repeat
    end repeat

    /* We want the results sorted first by line index, then by starting 
codeunit
    * within the line (so we get a top-to-bottom, left-to-right order). 
As the
    * 'sort' command is stable, we can do this by first sorting by the 
secondary
    * factor (codeunit start), then sorting again by the primary factor 
(line
    * index). */
    sort lines of tResults ascending numeric by item 1 of each
    sort lines of tResults ascending numeric by item 3 of each

    /* Return the set of results. */
    return tResults
end FindAllAnchors

Testing this on 8Mb of styled Lorem Ipsum text, with the same anchor at:
   word 1
   word 1000
   the middle word
   word -1000
   word -1

Then this handler takes slightly less time then searching for a single 
anchor
at word -1 of the field using 'repeat with' loops.

Whether this is helpful or not depends if you need to 'do something' 
when there
is more than one matching anchor in a document :)

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps