Jane Austen's peculiarity

Peter M. Brigham pmbrig at gmail.com
Sat Aug 8 14:18:19 EDT 2015


On Aug 8, 2015, at 1:56 PM, Richmond wrote:

> On 08/08/15 20:48, Peter M. Brigham wrote:
>> On Aug 8, 2015, at 12:42 PM, Richmond wrote:
>> 
>>> Jane Austen [amongst others] uses an interesting type of grammatical construction of this sort:
>>> 
>>> After breakfast, the girls walked to Meryton to inquire if Mr. Wickham
>>> _were returned_, and to lament over his absence from the Netherfield ball.
>>> 
>>> Pride and Prejudice.
>>> 
>>> I would like to analyse a million word corpus that I have been granted access to for this type of construction.
>>> 
>>> However, I don't want to find examples of only 'were returned', but all examples of
>>> 
>>> were + infinitive / preterite / past participle
>>> 
>>> and, presumably for that I shall have to use wildcards . . .
>>> 
>>> OR ???
>> I'll leave it to those who speak Regex to suggest a wildcard solution. Here's another one (not tested) that will catch past participles ending in "ed".
> 
> Looks good; however, I am really looking for ALL preterites; such as 'become', so your 'ed' trap won't catch that.
> 
> I am wondering about using a listField of all the preterites that I am looking for.

if you do that then just make the repeat loop as follows:
   repeat for each item w in offList
      put word w+1 of pText into testWord
      if testWord ends with "ed" then put w & comma after outList
      else if testWord is among the words of fld "preteritesList"
      then put w & comma after outList
   end repeat

This will be faster if you put the preteritesList field into a variable before the repeat loop, since it's significantly faster for the engine to access the contents of a variable compared with the contents of a field.

-- Peter

Peter M. Brigham
pmbrig at gmail.com
http://home.comcast.net/~pmbrig


>> Not sure how this will scale with large texts:
>> 
>> function findWere pText
>>    -- returns a comma-delim list of all the word offsets matching "were *ed"
>>    put wordOffsets("were", pText, true) into offList
>>    repeat for each item w in offList
>>       put word w+1 of pText into testWord
>>       if testWord ends with "ed" then put w & comma after outList
>>    end repeat
>>    return item 1 to -1 of outList
>> end if
>> 
>> function wordOffsets str, pContainer, matchWhole
>>    -- returns a comma-delimited list of all the wordOffsets of str in pContainer
>>    -- if matchWhole = true then only whole words are located
>>    --    else will find word matches everywhere str is part of a word in pContainer
>>    --    note that in LC words will include adjacent puncutation,
>>    --       so using matchWhole = true may exclude too many "words"
>>    -- duplicates are stripped out
>>    --    eg wordOffsets("co","the common coconut") = 2,3   not   2,3,3
>>    -- note: to get the last wordOffset of a string in a container (often useful)
>>    --    use "item -1 of wordOffsets(...)"
>>    -- by Peter M. Brigham, pmbrig at gmail.com — freeware
>>    -- requires offsets()
>>      
>>    if matchWhole = empty then put false into matchWhole
>>    put offsets(str,pContainer) into offList
>>    if offList = 0 then return 0
>>    repeat for each item i in offList
>>       put the number of words of (char 1 to i of pContainer) into wdNbr
>>       if matchWhole then
>>          if word wdNbr of pContainer <> str then next repeat
>>       end if
>>       put 1 into A[wdNbr]
>>       -- using an array avoids duplicates
>>    end repeat
>>    put the keys of A into wordList
>>    sort lines of wordList ascending numeric
>>    replace cr with comma in wordList
>>    return wordList
>> end wordOffsets
>> 
>> function offsets str, pContainer
>>    -- returns a comma-delimited list of all the offsets of str in pContainer
>>    -- returns 0 if not found
>>    -- note: offsets("xx","xxxxxx") returns "1,3,5" not "1,2,3,4,5"
>>    --     ie, overlapping offsets are not counted
>>    -- note: to get the last occurrence of a string in a container (often useful)
>>    --     use "item -1 of offsets(...)"
>>    -- by Peter M. Brigham, pmbrig at gmail.com — freeware
>>     
>>    if str is not in pContainer then return 0
>>    put 0 into startPoint
>>    repeat
>>       put offset(str,pContainer,startPoint) into thisOffset
>>       if thisOffset = 0 then exit repeat
>>       add thisOffset to startPoint
>>       put startPoint & comma after offsetList
>>       add length(str)-1 to startPoint
>>    end repeat
>>    return item 1 to -1 of offsetList -- delete trailing comma
>> end offsets
>> 
>> P.S. I love Jane Austen. One of my favorite books of all time is "Pride and Prejudice." It's so beautifully constructed.
> 
> 
> Glad to hear that another programmer doesn't spend all their time in front of a computer screen!





More information about the use-livecode mailing list