Finding common words and phrases in a large block of text?
Terry Judd
terry.judd at unimelb.edu.au
Thu Oct 25 17:07:07 EDT 2018
On 26/10/2018 4:27 am, "use-livecode on behalf of Tom Glod via use-livecode" <use-livecode-bounces at lists.runrev.com on behalf of use-livecode at lists.runrev.com> wrote:
Hi Terry, glad you found a solution.....
I have a similar challenge.
I did a word count, but would love to recognize the same phrases. Did you
just compare chunks? ... hash them? (probably redundant?)
Are there any more hints you can drop about this?
Thanks,
Tom
Hi Tom - I've just done something like the code below, which accepts a block of text and the maximum 'phrase' length as input and provides an array with sorted counts of word runs (so not necessarily sensible phrases) of different lengths as output. I think it will be good enough for my purposes.
function getWordAndPhraseCounts pText, pMaxPhraseLength
put empty into tA1
set the itemDel to tab
repeat for each sentence tSentence in pText
put the number of words in tSentence into tMax
repeat with i = 1 to pMaxPhraseLength
repeat with j = 1 to (tMax-i+1)
put word j to j+i-1 of tSentence into tPhrase
add 1 to tA1[i][tPhrase]
end repeat
end repeat
end repeat
put empty into tA2
repeat for each line tLength in the keys of tA1
put empty into tList
repeat for each line tPhrase in the keys of tA1[tLength]
put tPhrase&tab& tA1[tLength][tPhrase]&cr after tList
end repeat
delete last char of tList
sort lines of tList descending numeric by item 2 of each
put tList into tA2[tLength]
end repeat
return tA2
end getWordAndPhraseCounts
On Thu, Oct 25, 2018 at 4:27 AM Terry Judd via use-livecode <
use-livecode at lists.runrev.com> wrote:
> OK - was easier than I thought. I have something that works fast enough by
> iterating through runs of words in each sentence in a block of text,
> incrementing counts into an array and then sorting the contents of that
> array by phrase length and frequency.
>
> Terry...
>
> On 25/10/2018 4:56 pm, "use-livecode on behalf of Terry Judd via
> use-livecode" <use-livecode-bounces at lists.runrev.com on behalf of
> use-livecode at lists.runrev.com> wrote:
>
> Hi – I’m looking to analyse some large block of text (journal
> abstracts from key educational technology journals over a several year
> period) to find common words and phrases. Finding common words should be
> easy enough but I’m not clear on what approach to take for finding common
> phrases (iterating through the text capturing overlapping word runs of
> various lengths?). Any ideas on how best to proceed?
>
> TIA,
>
> Terry...
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
_______________________________________________
use-livecode mailing list
use-livecode at lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list