Jane Austen's peculiarity

James Hale james at thehales.id.au
Tue Aug 11 10:31:42 EDT 2015


Of course I couldn't resist a tinker. I too am into text manipulation/searching and wondered how I would go about this.
I looked at the repeat loops and realised they would run much faster if they were inverted as I am sure the list of verbs would be less than the lines of text being searched.
I also wanted to use a "repeat for each" construct as this is usually orders of magnitude faster.
But this meant I needed the line count and adding a counter seemed counter productive.
So I settled on using the lineoffset.

Here was my go...

on mouseUp
   put empty into fld "COOKED"
   put empty into fld "STARTT"
   put empty into fld "STOPT"
   put empty into lCooked1
  put "started : " & the long time into fld "STARTT"
   put the milliseconds into st
   put fld "TEKST" into TEKST
   put fld "WERBS" into WERBS   
   put 0 into acounter   
   put the number of lines of TEKST into numlines
    
   repeat for each line KWERBS in WERBS
      put "was " &  KWERBS into FRAZE
      put "were " & KWERBS into FRAZE2
      put 0 into loffesta
      put 0 into loffestb
      
      put 1 into lcounta
      put 1 into lcountb
      repeat while lcounta <> 0
         put lineoffset(FRAZE,TEKST,loffesta) into lcounta
         if lcounta = 0 then
            exit repeat
         end if
         put lcounta + loffesta  into thelinea
         put thelinea & " : " &  line thelinea of TEKST & cr after lCooked1
         put lcounta into loffesta

      end repeat
      
      repeat while lcountb <> 0
         put lineoffset(FRAZE2,TEKST,loffestb) into lcountb
         if lcountb = 0 then
            exit repeat
         end if
         put lcountb + loffestb  into thelineb
         put thelineb & " : " &    line thelineb of TEKST  & cr after lCooked1
         put lcountb into loffestb    
      end repeat      
   end repeat   
   put the number of lines of lCooked1 & " found"
   put lcooked1 into fld "Cooked"
   put "finished : " & the long time into fld "STOPT"
   put the milliseconds into nd
   put nd - st into fld "TIMET"
end mouseUp


I haven't tried returning to the original repeat order to see if this was faster but running the above on Richmond's sample stack for the "WAS/WERE" case delivered a result of three lines..

2663 : officers, who in comparison with the stranger, were become "stupid,
731 : was returned in due form. Miss Bennet's pleasing manners grew on the
4116 : were returned, and to lament over his absence from the Netherfield ball.

in 89 msec on my Mac running LC7.1Dp1

I was then going to examine colourising the found chunks when I realised that the supplied text had line breaks within each paragraph.
This means none of the proposed solutions (including Richmond's own) will find the desired phrase if it falls across one of these line breaks.
For my solution using lineoffset this is a dead end WHILE these line breaks within a paragraph remain.
For the other solutions a simple expedient is to increase the number of FRAZEs to four...

put "was " &  KWERBS into FRAZE
put "was" & cr  &  KWERBS into FRAZE2
put "were " & KWERBS into FRAZE3
put "were"  & cr & KWERBS into FRAZE4

This addition makes the extra FRAZES two "lines" and thus non valid arguments for a lineoffset function.

or so I thought.
However given the unpredictability of the formatting of the text this was a much too simplistic solution.
This solution breaks down where paragraphs are indented using spaces!

So, to keep the formatting as read in is problematic without knowing the formatting used.
But if the focus is the actual text, then perhaps the "fancy" formatting is not important.

Processing the text BEFORE searching so as to remove embedded line breaks and space padding allows my original code to work fine.

inserting the following before the REPEATS does the trick (at least with the example text

  replace return with "^&*" in TEKST
   put "\s+" into lmultispace
   put replacetext (TEKST,lmultispace," ") into TEKST
   replace "^&*^&*" with return in TEKST
   replace "^&*" with " " in TEKST
   replace return with return & return in TEKST
The only downside being the time to execute went from 89 msec to 616 msec.

you mileage may vary.

NOTE: My method does not identify multiple instances of the FRAZE within a single line, however once it is found in a line it would be simple to see if it occurred again.

Thanks for the diversion Richmond.

James



More information about the use-livecode mailing list