Searching for a word when it's more than one word

Mark Waddingham mark at livecode.com
Sat Sep 1 05:59:04 EDT 2018


On 2018-09-01 06:48, Stephen MacLean via use-livecode wrote:
> Hi All,
> 
> First, followed Keith Clarke’s thread and got a lot out of it, thank
> you all. That’s gone into my code snippets!
> 
> Now I know, the title is not technically true, if it’s 2 words, they
> are distinct and different. Maybe it’s because I’ve been banging my
> head against this and some other things too long and need to step
> back, but I’m having issues getting this all to work reliably.
> 
> I’m searching for town names in various text from a list of towns .
> Most names are one word, easy to find and count. Some names are 2 or 3
> words, like East Hartford or West Palm Beach. Those go against
> distinct towns like Hartford and Palm Beach. Others have their names
> inside of other town names like Colchester and Chester.

So the problem you are trying to solve sounds like this:

Given a source text TEXT, and a list of multi-word phrases PHRASES, find 
the longest elements of PHRASES which occur in TEXT when reading from 
left to right.

One way to do this is to preprocess the source TEXT and PHRASES, and 
then iterate over it with back-tracking attempting to match each phrase 
in the list.

Preprocessing can be done like this:

   // pText is arbitrary language text, where it presumed 'trueWord' will 
extract
   // the words we can match against those in PHRASES
   command preprocessText pText, @rWords
     local tWords
     repeat for each trueWord tWord in pText
       -- normalize word variants - e.g. turn Chester's into Chester
       if tWord ends with "'s" then
         put char 1 to -3 of tWord into tWord
       else if ... then
         ...
       else if ... then
         ...
       end if
       put tWord into tWords[the number of elements in tWords + 1]
     end repeat
     put tWords into rWords
   end preprocessText

This gives a sequence of words, in order - where word variants have been 
normalized to the 'root' word (the general operation here is called 
'stemming' - in your case as you are dealing with fragments of proper 
nouns - 's / s suffixes are probably good enough).

The processing for PHRASES is needed to ensure that they all follow a 
consistent form:

   // pPhrases is presumed to be a return-delimited list of phrases
   command preprocessPhrases pPhrases, @rPhrases
     -- We accumulate phrases as the keys of tPhrasesA to eliminate 
duplicates
     local tPhrasesA
     put empty into tPhrasesA

     local tPhrases
     repeat for each line tPhrase in pPhrases
       local tPhrase
       put empty into tPhrase
       repeat for each trueWord tWord in tPhrase
         put tWord & space after tPhrase
       end repeat
       delete the last char of tPhrase
       put true into tPhrasesA[tPhrase]
     end repeat

     put the keys of tPhrasesA into rPhrases
   end preprocessPhrases

This produces a return-delimited list of phrases, where the individual 
words in each phrase are separated by a *single* space with all 
punctuation stripped, and no phrase appears twice.

With this pre-processing (not the PHRASES pre-processing only needs to 
be done once for any set of PHRASES to match). A naive search algorithm 
would be:

   // pText should be a sequence array of words to search (we use an 
array here because we need fast random access)
   // pPhrases should be a line delimited string-list of multi-word 
phrases to find
   // rMatches will be a string-list of phrases which have been found
   command searchTextForPhrases pText, pPhrases, @rMatches
     local tMatchesA
     put empty into tMatchesA

     -- Our phrases are single-space delimited, so set the item delimiter
     set the itemDelimiter to space

     -- Loop through pText, by default we bump tIndex by one each time
     -- however, if a match is found, then we can skip the words 
constituting
     -- the matched phrase.
     local tIndex
     put 1 into tIndex
     repeat until pText[tIndex] is empty
       -- Store the current longest match we have found starting at 
tIndex
       local tCurrentMatch
       put empty into tCurrentMatch

       -- Check each phrase in turn for a match.
       repeat for each line tPhrase in pPhrases
         -- Assume a match succeeds until it doesn't
         local tPhraseMatched
         put true into tPhraseMatched

         -- Iterate through the items (words) in each phrase, if the 
sequence of
         -- words in the phrase is not the same as the sequence of words 
in the text
         -- starting at tIndex, then tPhraseMatched will be false on exit 
of the loop.
         local tSubIndex
         put tIndex into tSubIndex
         repeat for each item tWord in tPhrase
           -- Failure to match the word at tSubIndex is failure to match 
the phrase
           if pText[tSubIndex] is not tWord then
             put false into tPhraseMatched
             exit repeat
           end if

           -- The current word of the phrase matches, so move to the 
nbext
           add 1 to tSubIndex
         end repeat

         -- We are only interested in the longest match at any point, so 
only update
         -- the current match if it is longer.
         if tPhraseMatched and the number of items in tPhrase > the 
number of items in tCurrentMatch then
           put tPhrase into tCurrentMatch
         end if
       end repeat

       -- If a match was found, then we have used up those words in 
pText, otherwise
       -- we start the search again at the next word in pText.
       if tCurrentMatch is not empty then
         add the number of items in tCurrentMatch to tIndex
         put true into tMatchesA[tCurrentMatch]
       else
         add 1 to tIndex
       end if
     end repeat

     -- At this point, the matched phrases are simply the keys of 
tMatchesA
     put the keys of tMatchesA into rMatches
   end searchTextForPhrases

Complexity wise, the above requires roughly N*M steps - where N is the 
number of words in pText, M is the number of words in pPhrases.

An immediate improvement can be made by sorting pPhrases descending by 
the number of items - this then eliminates the need to check phrase 
match length - the first match will always be the longest meaning once 
you have matched, you don't need to keep iterating through the phrases.

     ...
     put the keys of tPhrasesA into rPhrases
     sort lines of rPhrases numeric descending by the number of items in 
each
   end preprocessPhrases

     ...
     -- We are only interested in the longest match at any point, so only 
update
     -- the current match if it is longer.
     if tPhraseMatched then
       put tPhrase into tCurrentMatch
       exit repeat
     end if
   end repeat

There's a lot more that can be done here to make the above a great deal 
more efficient (algorithmically-wise). Indeed, the best you can achieve 
is probably N*K steps for a source text containing N words - where K is 
the maximum difference in length between any two phrases which share a 
common prefix.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list