Best way to carry out boolean searches on large text body.

Sarah Reichelt sarah.reichelt at gmail.com
Thu Mar 4 22:10:27 EST 2010


On Fri, Mar 5, 2010 at 12:40 PM, James Hale <james at thehales.id.au> wrote:
> I would like the opinion of list members on the best way to approach the following task: Being able to perform boolean and proximity searches on a large body of text (say 400,000 words) returning paragraphs where the hits are located (clickable to give section containing paragraph desired.)
>
> Of course by boolean I mean "find  where 'word a' AND 'word b' occur in the same paragraph". Same with OR.
> By proximity I mean "find where 'word a' is within x words of 'word b'.
>
> In thinking about this for a while and considering how to store the text and ways to search it I have come to the conclusion that a database of words contained in the text is the way to go. By this I mean effectively indexing every word and its position in the text and then using database operations on this index file to produce a list of hits. The hits giving me either chunk expressions to display the relevant text blocks or record ID if I also store paragraphs as individual records in a further database file.
>
> For example with the boolean AND, get a selection of 'word a' hits and then a selection of 'word b' hits and find where they intersect based on the paragraph numbers. This would result in a list of paragraphs only containing both words.
>
> Do members think this an overkill?
> Has anybody else looked at this?


Have you tried using the filter command? Remembering that a paragraph
is really just a line, this might work well for finding paragraphs
where the two words occur.

e.g.
filter tData with "*" & wordA & "*"
filter tData with "*" & wordB & "*"

would leave you only with lines containing both wordA and wordB.

Proximity is a different type of search, but once you had narrowed the
data to just those lines with both words, then a repeat loop would be
easier.

Cheers,
Sarah



More information about the use-livecode mailing list