Searching for a word when it's more than one word

Mark Waddingham mark at livecode.com
Sat Sep 1 06:29:04 EDT 2018


On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote:
> Obviously, when considering names of places such as Colchester,
> Rochester and Chester one has
> to search for the longer names first and exclude them from later 
> searches.

The 'substring' problem (i.e. Chester being 'in' Rochester) isn't 
relevant in the above algorithm because we are 'tokenising' input and 
phrases - essentially changing the alphabet.

i.e. "Rochester Chester Colchester" is turned into ABC, and we match A, 
B or C as atomic units.

I should perhaps point out that the 'processText' operation probably 
needs to be a little better in practice - to at least include a 'stop' 
token for punctuation. For example:

   "The man walked starting from East Hartford, West Hartford could be 
seen in the distance."

In the case where 'Hartford West' and 'Hartford' are the 'known' towns 
(and not 'East Hartford') - the proposed tokenization would result in:

    
The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance

Which means you'd get "Hartford West" and "Hartford" - when you should 
only get "Hartford" (assuming you care about the linguistic structure of 
the text, at least).

Indeed, the above actually means in preprocessing the text, you can 
actually vastly reduce the number of words to search - any sequences of 
words which aren't in any pharse (or important punctuation) can be 
replaced by "*" say. So the above would become:

   *,East,Hartford,*,West,Hartford,*

The "*" tokens block matching multi-word phrases.

Warmest Regards,

Mark.

-- 
Mark Waddingham ~ mark at livecode.com ~ http://www.livecode.com/
LiveCode: Everyone can create apps




More information about the use-livecode mailing list