Best way to carry out boolean searches on large text body.

Michael Kann mikekann at yahoo.com
Thu Mar 4 22:25:28 EST 2010


James, I doubt this is the "best" way. But it will work and might give you some ideas. What we think of as a "paragraph", RunRev thinks of as a "line".
-------------------------------------
put fld 1 into v  // raw text
repeat for each line k in v
  if (word_uno is in k) AND (word_dos is in k) then  // AND condition
--if (word_uno is in k) OR (word_dos is in k) then  // OR condition
     put k & cr after h
  end if
end repeat
put h into fld 2
---------------------------

To make word_uno and word_dos look like real words you have to put spaces before and after and consider words at the end of the line -- but the template gives you the idea.

You can also use the "filter" command. For the AND condition filter twice, once with each word. For the OR condition filter separately for each word, then combine the results.


The two techniques above will succeed in
 
"returning paragraphs where the hits are located"

I'm not sure exactly what the following means:

(clickable to give section containing paragraph desired.)"


I have some ideas about "proximity," if these are the kinds of suggestions you are interested in. It's really easier than you think. I think there was a recent thread on indexing strategies if you want to go that route. 







--- On Thu, 3/4/10, James Hale <james at thehales.id.au> wrote:

> From: James Hale <james at thehales.id.au>
> Subject: Best way to carry out boolean searches on large text body.
> To: use-revolution at lists.runrev.com
> Date: Thursday, March 4, 2010, 8:40 PM
> I would like the opinion of list
> members on the best way to approach the following task:
> Being able to perform boolean and proximity searches on a
> large body of text (say 400,000 words) returning paragraphs
> where the hits are located (clickable to give section
> containing paragraph desired.)
> 
> Of course by boolean I mean "find  where 'word a' AND
> 'word b' occur in the same paragraph". Same with OR.
> By proximity I mean "find where 'word a' is within x words
> of 'word b'.
> 
> In thinking about this for a while and considering how to
> store the text and ways to search it I have come to the
> conclusion that a database of words contained in the text is
> the way to go. By this I mean effectively indexing every
> word and its position in the text and then using database
> operations on this index file to produce a list of hits. The
> hits giving me either chunk expressions to display the
> relevant text blocks or record ID if I also store paragraphs
> as individual records in a further database file. 
> 
> For example with the boolean AND, get a selection of 'word
> a' hits and then a selection of 'word b' hits and find where
> they intersect based on the paragraph numbers. This would
> result in a list of paragraphs only containing both words.
> 
> Do members think this an overkill?
> Has anybody else looked at this?
> 
> Any comments would be appreciated.
> 
> 
> 
> James
> 
> 
> 
> 
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage
> your subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
> 


      



More information about the use-livecode mailing list