Searching for a word when it's more than one word
Keith Clarke
keith.clarke at me.com
Sat Sep 1 04:32:24 EDT 2018
Very interesting Steve, your use case is actually very close to what I’m trying to achieve, which is to identify keywords and phrases within a corpus of text - think prioritised ’tag cloud’ metadata.
My original plan (as a non-programmer) was to identify the most popular unique words within the corpus and then go back in to find the words either side and check their popularity, etc.
However, from what I’ve learned here, my current pseudo-logic is:
1. Parse the whole source into 1, 2, 3 and 4 trueWord chunks (ideally in one pass but I’m still struggling with my array learning curve, so probably via lists & fields so I can see my workings)
2. Remove lines containing noise words and any punctuation that would, by definition terminate the keyword/phrase
3. Count & deduplicate the remaining lines
4. Sense-check against a ‘current keywords’ list (which appears to resonate with your town names problem?)
From the unique words results I’ve found, I also note issues around singular/plural, synonyms, alternative spelling, etc. - which speak to ‘fuzzy logic’ or dare one mention NLP (as inNatural Language Processing) capabilities.
I wonder if anyone has experimented with LiveCode accessing / using any libraries for this kind of language processing - probably another Pandora’s box containing infinity + 1 cans of worms! :-)
Back to basics, I’ll share my workings as I blunder forward and would welcome any insights the community experts have to offer.
Best,
Keith
> On 1 Sep 2018, at 05:48, Stephen MacLean via use-livecode <use-livecode at lists.runrev.com> wrote:
>
> Hi All,
>
> First, followed Keith Clarke’s thread and got a lot out of it, thank you all. That’s gone into my code snippets!
>
> Now I know, the title is not technically true, if it’s 2 words, they are distinct and different. Maybe it’s because I’ve been banging my head against this and some other things too long and need to step back, but I’m having issues getting this all to work reliably.
>
> I’m searching for town names in various text from a list of towns . Most names are one word, easy to find and count. Some names are 2 or 3 words, like East Hartford or West Palm Beach. Those go against distinct towns like Hartford and Palm Beach. Others have their names inside of other town names like Colchester and Chester.
>
> "is among the words of” or "is among the trueWords of” works great to find single words, but only works on single words and doesn’t consider “Chester’s” to be ”Chester”, it isn't.
>
> “is in” works great for finding multiple words like “East Hartford” and "West Palm Beach", finds “Chester” in “Chester’s” but also finds “chester” in “Colchester”.
>
> At this point, I’ve been using different methods for single word towns vs multi-word towns and while generally effective, trying to accommodate for these and other oddities has made it a complete mess of code.
>
> If someone has done something similar, or can point me in the right direction, it would be greatly appreciated.
>
> TIA,
>
> Steve MacLean
>
More information about the use-livecode
mailing list