Finding common words and phrases in a large block of text?

Terry Judd terry.judd at unimelb.edu.au
Thu Oct 25 17:07:07 EDT 2018


On 26/10/2018 4:27 am, "use-livecode on behalf of Tom Glod via use-livecode" <use-livecode-bounces at lists.runrev.com on behalf of use-livecode at lists.runrev.com> wrote:

    Hi Terry, glad you found a solution.....
    
    I have a similar challenge.
    
    I did a word count, but would love to recognize the same phrases.  Did you
    just compare chunks? ... hash them? (probably redundant?)
    
    Are there any more hints you can drop about this?
    
    Thanks,
    
    Tom

Hi Tom - I've just done something like the code below, which accepts a block of text and the maximum 'phrase' length as input and provides an array with sorted counts of word runs (so not necessarily sensible phrases) of different lengths as output. I think it will be good enough for my purposes.

function getWordAndPhraseCounts pText, pMaxPhraseLength
   put empty into tA1
   set the itemDel to tab
   repeat for each sentence tSentence in pText
      put the number of words in tSentence into tMax
      repeat with i = 1 to pMaxPhraseLength
         repeat with j = 1 to (tMax-i+1)
            put word j to j+i-1 of tSentence into tPhrase
            add 1 to tA1[i][tPhrase]
         end repeat
      end repeat
   end repeat
   put empty into tA2
   repeat for each line tLength in the keys of tA1
      put empty into tList
      repeat for each line tPhrase in the keys of tA1[tLength]
         put tPhrase&tab& tA1[tLength][tPhrase]&cr after tList
      end repeat
      delete last char of tList
      sort lines of tList descending numeric by item 2 of each
      put tList into tA2[tLength]
   end repeat
   return tA2
end getWordAndPhraseCounts

    
    On Thu, Oct 25, 2018 at 4:27 AM Terry Judd via use-livecode <
    use-livecode at lists.runrev.com> wrote:
    
    > OK - was easier than I thought. I have something that works fast enough by
    > iterating through runs of words in each sentence in a block of text,
    > incrementing counts into an array and then sorting the contents of that
    > array by phrase length and frequency.
    >
    > Terry...
    >
    > On 25/10/2018 4:56 pm, "use-livecode on behalf of Terry Judd via
    > use-livecode" <use-livecode-bounces at lists.runrev.com on behalf of
    > use-livecode at lists.runrev.com> wrote:
    >
    >     Hi – I’m looking to analyse some large block of text (journal
    > abstracts from key educational technology journals over a several year
    > period) to find common words and phrases. Finding common words should be
    > easy enough but I’m not clear on what approach to take for finding common
    > phrases (iterating through the text capturing overlapping word runs of
    > various lengths?). Any ideas on how best to proceed?
    >
    >     TIA,
    >
    >     Terry...
    >     _______________________________________________
    >     use-livecode mailing list
    >     use-livecode at lists.runrev.com
    >     Please visit this url to subscribe, unsubscribe and manage your
    > subscription preferences:
    >     http://lists.runrev.com/mailman/listinfo/use-livecode
    >
    >
    > _______________________________________________
    > use-livecode mailing list
    > use-livecode at lists.runrev.com
    > Please visit this url to subscribe, unsubscribe and manage your
    > subscription preferences:
    > http://lists.runrev.com/mailman/listinfo/use-livecode
    _______________________________________________
    use-livecode mailing list
    use-livecode at lists.runrev.com
    Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
    http://lists.runrev.com/mailman/listinfo/use-livecode
    



More information about the use-livecode mailing list