Translate metadata to field content

Fri Feb 21 15:10:11 EST 2020

On 2/21/20 2:51 AM, Mark Waddingham via use-livecode wrote:
> Both characters and codepoints run the risk of requiring a linear scan of
> the string to calculate the length - strictly speaking his will occur if
> the engine is not sure whether character / codepoint have a 1-1 map to
> codeunits (for example if your string has Unicode chars and it hasn't
> analysed it). Therefore you should definitely use codeunits.

Right now the text is all Roman but I'll use condeunits anyway to make it future-proof.

> The styledText representation makes the paragraph separator (return char)
> implicit (as it is in the field object internally) - so you need to bump
> the tTotalChars by one before the final end repeat to account for that (as the
> codeunit ranges the field uses *include* the implicit return char)

I should have noticed that, it seems so obvious now. There was no elipsis in the variable 
watcher, which there would have been if a return character was there.

> 
> So I couldn't help but fettle with this a little more. You mention that your
> 'anchors' are not unique in a document. This raises the question of what
> happens if there is more than one match...
> 
> This handler finds all occurrences of a given anchor in the text. As we are
> searching for all of them, it can use repeat for each key iteration in both
> loops:
> 
> function FindAllAnchors pStyledText, pAnchor
>     /* Return-delimited list of results, each line is of the form:
>     *     start,finish,line
>     * Each of these corresponds to a chunk of the form:
>     *      CODEUNIT start TO finish OF LINE line OF field
>     */
>     local tResults
> 
>     /* Iterate over the lines of the text in arbitrary order - the order doesn't
>     * matter as we keep the reference to the line any match is in. */
>     repeat for each key tLineIndex in pStyledText
>        /* Fetch the runs in the line, so we don't have to keep looking it up */
>        local tRuns
>        put pStyledText[tLineIndex]["runs"] into tRuns
> 
>        /* Iterate over the runs in arbitrary order - assuming that the number
>        * of potentially matching runs is miniscule compared to the number of
>        * non-matching runs, it is faster to iterate in hash-order. */
>        repeat for each key tRunIndex in tRuns
>           /* If we find a match, work out its offset in the line */
>           if tRuns[tRunIndex]["metadata"] is pAnchor then
>              /* Calculate the number of codeunits before this run */
>              local tCodeunitCount
>              put 0 into tCodeunitCount
>              repeat with tPreviousRunIndex = 1 to tRunIndex - 1
>                 add the number of codeunits in tRuns[tPreviousRunIndex]["text"] to tCodeunitCount
>              end repeat
> 
>              /* Append the result to the results list. */
>              put tCodeunitCount + 1, \
>                    tCodeunitCount + the number of codeunits in tRuns[tRunIndex]["text"], \
>                    tLineIndex & \
>                    return after tResults
>           end if
>        end repeat
>     end repeat
> 
>     /* We want the results sorted first by line index, then by starting codeunit
>     * within the line (so we get a top-to-bottom, left-to-right order). As the
>     * 'sort' command is stable, we can do this by first sorting by the secondary
>     * factor (codeunit start), then sorting again by the primary factor (line
>     * index). */
>     sort lines of tResults ascending numeric by item 1 of each
>     sort lines of tResults ascending numeric by item 3 of each
> 
>     /* Return the set of results. */
>     return tResults
> end FindAllAnchors
> 
> Testing this on 8Mb of styled Lorem Ipsum text, with the same anchor at:
>    word 1
>    word 1000
>    the middle word
>    word -1000
>    word -1
> 
> Then this handler takes slightly less time then searching for a single anchor
> at word -1 of the field using 'repeat with' loops.

Fantastic, it got the timing down to about 6ms give or take, not counting loading the 
styledtext or selecting it after.

> 
> Whether this is helpful or not depends if you need to 'do something' when there
> is more than one matching anchor in a document :)

All I require is to scroll to the correct position in the text and briefly hilite the metadata 
span to draw the user's attemtion to the found text. I can compare the results returned from 
your function to find the earliest and latest numbered instances and work out the hiliting from 
there. That's possible because the duplicate metadata instances are all grouped together rather 
than scattered around.

The only reason I have more than one instance is because there are href links inside the 
metadata spans, and LC translates that into separate metadata spans if there is more than one 
link, or there's a line break. If it would honor the entire span regardless of those, then each 
metadata tag would be unique. Some of my metadata needs to span more than one line, and/or 
contain multiple inner links.

That's also why, in my initial attempt using counters, I could exit the loop as soon as I found 
a non-match after locating the initial one. When going sequentially through the text, there 
won't be any other duplicates as soon as the metadata changes.

-- 
Jacqueline Landman Gay         |     jacque at hyperactivesw.com
HyperActive Software           |     http://www.hyperactivesw.com