Jane Austen's peculiarity

Peter M. Brigham pmbrig at gmail.com
Sat Aug 8 22:32:38 EDT 2015

On Aug 8, 2015, at 6:41 PM, Richard Gaskin wrote:

> Richmond wrote:
>> function findWere pText
>>   -- returns a comma-delim list of all the line offsets matching "were *ed"
>>   --    or "were" && <a word in your preterite list>.
>>   put fld "WERBS" into pretList
>>   put wordOffsets("were", pText, true) into offList
> Unless the build you're using a custom build, wouldn't that be "wordOffset" (singular)?

I included the utility functions wordOffsets() and offsets() in one of my previous posts. I probably should have repeated them. I use them a lot -- there are many contexts in which they are useful.

function wordOffsets str, pContainer, matchWhole
   -- returns a comma-delimited list of all the wordOffsets of str in pContainer
   -- if matchWhole = true then only whole words are located
   --    else will find word matches everywhere str is part of a word in pContainer
   --    note that in LC words will include adjacent puncutation,
   --       so using matchWhole = true may exclude too many "words"
   -- duplicates are stripped out
   --    eg wordOffsets("co","the common coconut") = 2,3   not   2,3,3
   -- note: to get the last wordOffset of a string in a container (often useful)
   --    use "item -1 of wordOffsets(...)"
   -- by Peter M. Brigham, pmbrig at gmail.com — freeware
   -- requires offsets()
   if matchWhole = empty then put false into matchWhole
   put offsets(str,pContainer) into offList
   if offList = 0 then return 0
   repeat for each item i in offList
      put the number of words of (char 1 to i of pContainer) into wdNbr
      if matchWhole then
         if word wdNbr of pContainer <> str then next repeat
      end if
      put 1 into A[wdNbr]
      -- using an array avoids duplicates
   end repeat
   put the keys of A into wordList
   sort lines of wordList ascending numeric
   replace cr with comma in wordList
   return wordList
end wordOffsets

function offsets str, pContainer
   -- returns a comma-delimited list of all the offsets of str in pContainer
   -- returns 0 if not found
   -- note: offsets("xx","xxxxxx") returns "1,3,5" not "1,2,3,4,5"
   --     ie, overlapping offsets are not counted
   -- note: to get the last occurrence of a string in a container (often useful)
   --     use "item -1 of offsets(...)"
   -- by Peter M. Brigham, pmbrig at gmail.com — freeware
   if str is not in pContainer then return 0
   put 0 into startPoint
      put offset(str,pContainer,startPoint) into thisOffset
      if thisOffset = 0 then exit repeat
      add thisOffset to startPoint
      put startPoint & comma after offsetList
      add length(str)-1 to startPoint
   end repeat
   return item 1 to -1 of offsetList -- delete trailing comma
end offsets

> Also, if you're using v7 you might consider "trueWordOffset", which accounts for quote characters and omits punctuation that characterize the historic definition of "word" in xTalks.
> The Unicode libraries in v7 make many natural-language parsing tasks much simpler - there's even a new "sentence" chunk type.

Yes, with newer versions the engine now does stuff that required scripted functions in earlier LC versions. I'm still not using later versions because my work stacks don't run in them properly, so I have all these utility functions in my library.

-- Peter

Peter M. Brigham
pmbrig at gmail.com

More information about the Use-livecode mailing list