Searching for a word when it's more than one word

Sat Sep 1 06:05:54 EDT 2018

Obviously, when considering names of places such as Colchester, 
Rochester and Chester one has
to search for the longer names first and exclude them from later searches.

Richmond.

On 1/9/2018 12:59 pm, Mark Waddingham via use-livecode wrote:
> On 2018-09-01 06:48, Stephen MacLean via use-livecode wrote:
>> Hi All,
>>
>> First, followed Keith Clarke’s thread and got a lot out of it, thank
>> you all. That’s gone into my code snippets!
>>
>> Now I know, the title is not technically true, if it’s 2 words, they
>> are distinct and different. Maybe it’s because I’ve been banging my
>> head against this and some other things too long and need to step
>> back, but I’m having issues getting this all to work reliably.
>>
>> I’m searching for town names in various text from a list of towns .
>> Most names are one word, easy to find and count. Some names are 2 or 3
>> words, like East Hartford or West Palm Beach. Those go against
>> distinct towns like Hartford and Palm Beach. Others have their names
>> inside of other town names like Colchester and Chester.
>
> So the problem you are trying to solve sounds like this:
>
> Given a source text TEXT, and a list of multi-word phrases PHRASES, 
> find the longest elements of PHRASES which occur in TEXT when reading 
> from left to right.
>
> One way to do this is to preprocess the source TEXT and PHRASES, and 
> then iterate over it with back-tracking attempting to match each 
> phrase in the list.
>
> Preprocessing can be done like this:
>
>   // pText is arbitrary language text, where it presumed 'trueWord' 
> will extract
>   // the words we can match against those in PHRASES
>   command preprocessText pText, @rWords
>     local tWords
>     repeat for each trueWord tWord in pText
>       -- normalize word variants - e.g. turn Chester's into Chester
>       if tWord ends with "'s" then
>         put char 1 to -3 of tWord into tWord
>       else if ... then
>         ...
>       else if ... then
>         ...
>       end if
>       put tWord into tWords[the number of elements in tWords + 1]
>     end repeat
>     put tWords into rWords
>   end preprocessText
>
> This gives a sequence of words, in order - where word variants have 
> been normalized to the 'root' word (the general operation here is 
> called 'stemming' - in your case as you are dealing with fragments of 
> proper nouns - 's / s suffixes are probably good enough).
>
> The processing for PHRASES is needed to ensure that they all follow a 
> consistent form:
>
>   // pPhrases is presumed to be a return-delimited list of phrases
>   command preprocessPhrases pPhrases, @rPhrases
>     -- We accumulate phrases as the keys of tPhrasesA to eliminate 
> duplicates
>     local tPhrasesA
>     put empty into tPhrasesA
>
>     local tPhrases
>     repeat for each line tPhrase in pPhrases
>       local tPhrase
>       put empty into tPhrase
>       repeat for each trueWord tWord in tPhrase
>         put tWord & space after tPhrase
>       end repeat
>       delete the last char of tPhrase
>       put true into tPhrasesA[tPhrase]
>     end repeat
>
>     put the keys of tPhrasesA into rPhrases
>   end preprocessPhrases
>
> This produces a return-delimited list of phrases, where the individual 
> words in each phrase are separated by a *single* space with all 
> punctuation stripped, and no phrase appears twice.
>
> With this pre-processing (not the PHRASES pre-processing only needs to 
> be done once for any set of PHRASES to match). A naive search 
> algorithm would be:
>
>   // pText should be a sequence array of words to search (we use an 
> array here because we need fast random access)
>   // pPhrases should be a line delimited string-list of multi-word 
> phrases to find
>   // rMatches will be a string-list of phrases which have been found
>   command searchTextForPhrases pText, pPhrases, @rMatches
>     local tMatchesA
>     put empty into tMatchesA
>
>     -- Our phrases are single-space delimited, so set the item delimiter
>     set the itemDelimiter to space
>
>     -- Loop through pText, by default we bump tIndex by one each time
>     -- however, if a match is found, then we can skip the words 
> constituting
>     -- the matched phrase.
>     local tIndex
>     put 1 into tIndex
>     repeat until pText[tIndex] is empty
>       -- Store the current longest match we have found starting at tIndex
>       local tCurrentMatch
>       put empty into tCurrentMatch
>
>       -- Check each phrase in turn for a match.
>       repeat for each line tPhrase in pPhrases
>         -- Assume a match succeeds until it doesn't
>         local tPhraseMatched
>         put true into tPhraseMatched
>
>         -- Iterate through the items (words) in each phrase, if the 
> sequence of
>         -- words in the phrase is not the same as the sequence of 
> words in the text
>         -- starting at tIndex, then tPhraseMatched will be false on 
> exit of the loop.
>         local tSubIndex
>         put tIndex into tSubIndex
>         repeat for each item tWord in tPhrase
>           -- Failure to match the word at tSubIndex is failure to 
> match the phrase
>           if pText[tSubIndex] is not tWord then
>             put false into tPhraseMatched
>             exit repeat
>           end if
>
>           -- The current word of the phrase matches, so move to the nbext
>           add 1 to tSubIndex
>         end repeat
>
>         -- We are only interested in the longest match at any point, 
> so only update
>         -- the current match if it is longer.
>         if tPhraseMatched and the number of items in tPhrase > the 
> number of items in tCurrentMatch then
>           put tPhrase into tCurrentMatch
>         end if
>       end repeat
>
>       -- If a match was found, then we have used up those words in 
> pText, otherwise
>       -- we start the search again at the next word in pText.
>       if tCurrentMatch is not empty then
>         add the number of items in tCurrentMatch to tIndex
>         put true into tMatchesA[tCurrentMatch]
>       else
>         add 1 to tIndex
>       end if
>     end repeat
>
>     -- At this point, the matched phrases are simply the keys of 
> tMatchesA
>     put the keys of tMatchesA into rMatches
>   end searchTextForPhrases
>
> Complexity wise, the above requires roughly N*M steps - where N is the 
> number of words in pText, M is the number of words in pPhrases.
>
> An immediate improvement can be made by sorting pPhrases descending by 
> the number of items - this then eliminates the need to check phrase 
> match length - the first match will always be the longest meaning once 
> you have matched, you don't need to keep iterating through the phrases.
>
>     ...
>     put the keys of tPhrasesA into rPhrases
>     sort lines of rPhrases numeric descending by the number of items 
> in each
>   end preprocessPhrases
>
>     ...
>     -- We are only interested in the longest match at any point, so 
> only update
>     -- the current match if it is longer.
>     if tPhraseMatched then
>       put tPhrase into tCurrentMatch
>       exit repeat
>     end if
>   end repeat
>
> There's a lot more that can be done here to make the above a great 
> deal more efficient (algorithmically-wise). Indeed, the best you can 
> achieve is probably N*K steps for a source text containing N words - 
> where K is the maximum difference in length between any two phrases 
> which share a common prefix.
>
> Warmest Regards,
>
> Mark.
>