Searching for a word when it's more than one word
Richmond Mathewson
richmondmathewson at gmail.com
Sat Sep 1 06:35:43 EDT 2018
That's because you lot tend to use a silver teaspoon while I tend to use
a great big shovel:
https://www.dropbox.com/s/00t8oftb1ydm8ni/Text%20analyzer%20X.livecode.zip?dl=0
Richmond.
On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote:
> On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote:
>> Obviously, when considering names of places such as Colchester,
>> Rochester and Chester one has
>> to search for the longer names first and exclude them from later
>> searches.
>
> The 'substring' problem (i.e. Chester being 'in' Rochester) isn't
> relevant in the above algorithm because we are 'tokenising' input and
> phrases - essentially changing the alphabet.
>
> i.e. "Rochester Chester Colchester" is turned into ABC, and we match
> A, B or C as atomic units.
>
> I should perhaps point out that the 'processText' operation probably
> needs to be a little better in practice - to at least include a 'stop'
> token for punctuation. For example:
>
> "The man walked starting from East Hartford, West Hartford could be
> seen in the distance."
>
> In the case where 'Hartford West' and 'Hartford' are the 'known' towns
> (and not 'East Hartford') - the proposed tokenization would result in:
>
> The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance
>
> Which means you'd get "Hartford West" and "Hartford" - when you should
> only get "Hartford" (assuming you care about the linguistic structure
> of the text, at least).
>
> Indeed, the above actually means in preprocessing the text, you can
> actually vastly reduce the number of words to search - any sequences
> of words which aren't in any pharse (or important punctuation) can be
> replaced by "*" say. So the above would become:
>
> *,East,Hartford,*,West,Hartford,*
>
> The "*" tokens block matching multi-word phrases.
>
> Warmest Regards,
>
> Mark.
>
More information about the use-livecode
mailing list