Searching for a word when it's more than one word

Richmond Mathewson richmondmathewson at gmail.com
Sat Sep 1 06:35:43 EDT 2018


That's because you lot tend to use a silver teaspoon while I tend to use 
a great big shovel:

https://www.dropbox.com/s/00t8oftb1ydm8ni/Text%20analyzer%20X.livecode.zip?dl=0

Richmond.

On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote:
> On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote:
>> Obviously, when considering names of places such as Colchester,
>> Rochester and Chester one has
>> to search for the longer names first and exclude them from later 
>> searches.
>
> The 'substring' problem (i.e. Chester being 'in' Rochester) isn't 
> relevant in the above algorithm because we are 'tokenising' input and 
> phrases - essentially changing the alphabet.
>
> i.e. "Rochester Chester Colchester" is turned into ABC, and we match 
> A, B or C as atomic units.
>
> I should perhaps point out that the 'processText' operation probably 
> needs to be a little better in practice - to at least include a 'stop' 
> token for punctuation. For example:
>
>   "The man walked starting from East Hartford, West Hartford could be 
> seen in the distance."
>
> In the case where 'Hartford West' and 'Hartford' are the 'known' towns 
> (and not 'East Hartford') - the proposed tokenization would result in:
>
> The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance
>
> Which means you'd get "Hartford West" and "Hartford" - when you should 
> only get "Hartford" (assuming you care about the linguistic structure 
> of the text, at least).
>
> Indeed, the above actually means in preprocessing the text, you can 
> actually vastly reduce the number of words to search - any sequences 
> of words which aren't in any pharse (or important punctuation) can be 
> replaced by "*" say. So the above would become:
>
>   *,East,Hartford,*,West,Hartford,*
>
> The "*" tokens block matching multi-word phrases.
>
> Warmest Regards,
>
> Mark.
>




More information about the use-livecode mailing list