Jane Austen's peculiarity
Peter M. Brigham
pmbrig at gmail.com
Sat Aug 8 14:18:19 EDT 2015
On Aug 8, 2015, at 1:56 PM, Richmond wrote:
> On 08/08/15 20:48, Peter M. Brigham wrote:
>> On Aug 8, 2015, at 12:42 PM, Richmond wrote:
>>
>>> Jane Austen [amongst others] uses an interesting type of grammatical construction of this sort:
>>>
>>> After breakfast, the girls walked to Meryton to inquire if Mr. Wickham
>>> _were returned_, and to lament over his absence from the Netherfield ball.
>>>
>>> Pride and Prejudice.
>>>
>>> I would like to analyse a million word corpus that I have been granted access to for this type of construction.
>>>
>>> However, I don't want to find examples of only 'were returned', but all examples of
>>>
>>> were + infinitive / preterite / past participle
>>>
>>> and, presumably for that I shall have to use wildcards . . .
>>>
>>> OR ???
>> I'll leave it to those who speak Regex to suggest a wildcard solution. Here's another one (not tested) that will catch past participles ending in "ed".
>
> Looks good; however, I am really looking for ALL preterites; such as 'become', so your 'ed' trap won't catch that.
>
> I am wondering about using a listField of all the preterites that I am looking for.
if you do that then just make the repeat loop as follows:
repeat for each item w in offList
put word w+1 of pText into testWord
if testWord ends with "ed" then put w & comma after outList
else if testWord is among the words of fld "preteritesList"
then put w & comma after outList
end repeat
This will be faster if you put the preteritesList field into a variable before the repeat loop, since it's significantly faster for the engine to access the contents of a variable compared with the contents of a field.
-- Peter
Peter M. Brigham
pmbrig at gmail.com
http://home.comcast.net/~pmbrig
>> Not sure how this will scale with large texts:
>>
>> function findWere pText
>> -- returns a comma-delim list of all the word offsets matching "were *ed"
>> put wordOffsets("were", pText, true) into offList
>> repeat for each item w in offList
>> put word w+1 of pText into testWord
>> if testWord ends with "ed" then put w & comma after outList
>> end repeat
>> return item 1 to -1 of outList
>> end if
>>
>> function wordOffsets str, pContainer, matchWhole
>> -- returns a comma-delimited list of all the wordOffsets of str in pContainer
>> -- if matchWhole = true then only whole words are located
>> -- else will find word matches everywhere str is part of a word in pContainer
>> -- note that in LC words will include adjacent puncutation,
>> -- so using matchWhole = true may exclude too many "words"
>> -- duplicates are stripped out
>> -- eg wordOffsets("co","the common coconut") = 2,3 not 2,3,3
>> -- note: to get the last wordOffset of a string in a container (often useful)
>> -- use "item -1 of wordOffsets(...)"
>> -- by Peter M. Brigham, pmbrig at gmail.com — freeware
>> -- requires offsets()
>>
>> if matchWhole = empty then put false into matchWhole
>> put offsets(str,pContainer) into offList
>> if offList = 0 then return 0
>> repeat for each item i in offList
>> put the number of words of (char 1 to i of pContainer) into wdNbr
>> if matchWhole then
>> if word wdNbr of pContainer <> str then next repeat
>> end if
>> put 1 into A[wdNbr]
>> -- using an array avoids duplicates
>> end repeat
>> put the keys of A into wordList
>> sort lines of wordList ascending numeric
>> replace cr with comma in wordList
>> return wordList
>> end wordOffsets
>>
>> function offsets str, pContainer
>> -- returns a comma-delimited list of all the offsets of str in pContainer
>> -- returns 0 if not found
>> -- note: offsets("xx","xxxxxx") returns "1,3,5" not "1,2,3,4,5"
>> -- ie, overlapping offsets are not counted
>> -- note: to get the last occurrence of a string in a container (often useful)
>> -- use "item -1 of offsets(...)"
>> -- by Peter M. Brigham, pmbrig at gmail.com — freeware
>>
>> if str is not in pContainer then return 0
>> put 0 into startPoint
>> repeat
>> put offset(str,pContainer,startPoint) into thisOffset
>> if thisOffset = 0 then exit repeat
>> add thisOffset to startPoint
>> put startPoint & comma after offsetList
>> add length(str)-1 to startPoint
>> end repeat
>> return item 1 to -1 of offsetList -- delete trailing comma
>> end offsets
>>
>> P.S. I love Jane Austen. One of my favorite books of all time is "Pride and Prejudice." It's so beautifully constructed.
>
>
> Glad to hear that another programmer doesn't spend all their time in front of a computer screen!
More information about the use-livecode
mailing list