Jane Austen's peculiarity
james at thehales.id.au
Tue Aug 11 16:31:42 CEST 2015
Of course I couldn't resist a tinker. I too am into text manipulation/searching and wondered how I would go about this.
I looked at the repeat loops and realised they would run much faster if they were inverted as I am sure the list of verbs would be less than the lines of text being searched.
I also wanted to use a "repeat for each" construct as this is usually orders of magnitude faster.
But this meant I needed the line count and adding a counter seemed counter productive.
So I settled on using the lineoffset.
Here was my go...
put empty into fld "COOKED"
put empty into fld "STARTT"
put empty into fld "STOPT"
put empty into lCooked1
put "started : " & the long time into fld "STARTT"
put the milliseconds into st
put fld "TEKST" into TEKST
put fld "WERBS" into WERBS
put 0 into acounter
put the number of lines of TEKST into numlines
repeat for each line KWERBS in WERBS
put "was " & KWERBS into FRAZE
put "were " & KWERBS into FRAZE2
put 0 into loffesta
put 0 into loffestb
put 1 into lcounta
put 1 into lcountb
repeat while lcounta <> 0
put lineoffset(FRAZE,TEKST,loffesta) into lcounta
if lcounta = 0 then
put lcounta + loffesta into thelinea
put thelinea & " : " & line thelinea of TEKST & cr after lCooked1
put lcounta into loffesta
repeat while lcountb <> 0
put lineoffset(FRAZE2,TEKST,loffestb) into lcountb
if lcountb = 0 then
put lcountb + loffestb into thelineb
put thelineb & " : " & line thelineb of TEKST & cr after lCooked1
put lcountb into loffestb
put the number of lines of lCooked1 & " found"
put lcooked1 into fld "Cooked"
put "finished : " & the long time into fld "STOPT"
put the milliseconds into nd
put nd - st into fld "TIMET"
I haven't tried returning to the original repeat order to see if this was faster but running the above on Richmond's sample stack for the "WAS/WERE" case delivered a result of three lines..
2663 : officers, who in comparison with the stranger, were become "stupid,
731 : was returned in due form. Miss Bennet's pleasing manners grew on the
4116 : were returned, and to lament over his absence from the Netherfield ball.
in 89 msec on my Mac running LC7.1Dp1
I was then going to examine colourising the found chunks when I realised that the supplied text had line breaks within each paragraph.
This means none of the proposed solutions (including Richmond's own) will find the desired phrase if it falls across one of these line breaks.
For my solution using lineoffset this is a dead end WHILE these line breaks within a paragraph remain.
For the other solutions a simple expedient is to increase the number of FRAZEs to four...
put "was " & KWERBS into FRAZE
put "was" & cr & KWERBS into FRAZE2
put "were " & KWERBS into FRAZE3
put "were" & cr & KWERBS into FRAZE4
This addition makes the extra FRAZES two "lines" and thus non valid arguments for a lineoffset function.
or so I thought.
However given the unpredictability of the formatting of the text this was a much too simplistic solution.
This solution breaks down where paragraphs are indented using spaces!
So, to keep the formatting as read in is problematic without knowing the formatting used.
But if the focus is the actual text, then perhaps the "fancy" formatting is not important.
Processing the text BEFORE searching so as to remove embedded line breaks and space padding allows my original code to work fine.
inserting the following before the REPEATS does the trick (at least with the example text
replace return with "^&*" in TEKST
put "\s+" into lmultispace
put replacetext (TEKST,lmultispace," ") into TEKST
replace "^&*^&*" with return in TEKST
replace "^&*" with " " in TEKST
replace return with return & return in TEKST
The only downside being the time to execute went from 89 msec to 616 msec.
you mileage may vary.
NOTE: My method does not identify multiple instances of the FRAZE within a single line, however once it is found in a line it would be simple to see if it occurred again.
Thanks for the diversion Richmond.
More information about the use-livecode