Searching for a word when it's more than one word

J. Landman Gay jacque at hyperactivesw.com
Sat Sep 1 11:39:37 EDT 2018


There is a town in Texas called West, made infamous a few years ago by a 
giant explosion. I don't think you can make assumptions about names of places.

Mark's suggestion to check for words ending in "s" will fail on many towns, 
though apostrophe-s may be safe.
--
Jacqueline Landman Gay | jacque at hyperactivesw.com
HyperActive Software | http://www.hyperactivesw.com
On September 1, 2018 5:49:30 AM Richmond Mathewson via use-livecode 
<use-livecode at lists.runrev.com> wrote:

> I can see that the "problem", which my stack does not address, is with 2
> or 3 part place names:
>
> The Rochester/Chester problem is easily dealt with.
>
> While it should be realtively easy to have a subroutine to deal with
> words such as "West" (after all, there are no places just called "West"),
> places like a town my parents once lived in called "Haselbury Plucknett"
> would cause problems.
>
> AND, places such as "Ruyton of the Eleven Towns"
> (https://en.wikipedia.org/wiki/Ruyton-XI-Towns)
> would really throw a spanner in the works.
>
> Come to think of things . . .
>
> Unless anyone's code can cope with "Ruyton of the Eleven Towns" it won't
> stand up: we could even go further and call
> this the "Ruyton of the Eleven Towns Test".
>
> More muffled background noises.
>
> Richmond.
>
> On 1/9/2018 1:29 pm, Mark Waddingham via use-livecode wrote:
>> On 2018-09-01 12:05, Richmond Mathewson via use-livecode wrote:
>>> Obviously, when considering names of places such as Colchester,
>>> Rochester and Chester one has
>>> to search for the longer names first and exclude them from later
>>> searches.
>>
>> The 'substring' problem (i.e. Chester being 'in' Rochester) isn't
>> relevant in the above algorithm because we are 'tokenising' input and
>> phrases - essentially changing the alphabet.
>>
>> i.e. "Rochester Chester Colchester" is turned into ABC, and we match
>> A, B or C as atomic units.
>>
>> I should perhaps point out that the 'processText' operation probably
>> needs to be a little better in practice - to at least include a 'stop'
>> token for punctuation. For example:
>>
>> "The man walked starting from East Hartford, West Hartford could be
>> seen in the distance."
>>
>> In the case where 'Hartford West' and 'Hartford' are the 'known' towns
>> (and not 'East Hartford') - the proposed tokenization would result in:
>>
>> The,man,walked,starting,from,East,Hartford,West,Hartford,could,be,seen,in,the,distance
>>
>> Which means you'd get "Hartford West" and "Hartford" - when you should
>> only get "Hartford" (assuming you care about the linguistic structure
>> of the text, at least).
>>
>> Indeed, the above actually means in preprocessing the text, you can
>> actually vastly reduce the number of words to search - any sequences
>> of words which aren't in any pharse (or important punctuation) can be
>> replaced by "*" say. So the above would become:
>>
>> *,East,Hartford,*,West,Hartford,*
>>
>> The "*" tokens block matching multi-word phrases.
>>
>> Warmest Regards,
>>
>> Mark.
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your 
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode







More information about the use-livecode mailing list