Best way to carry out boolean searches on large text body.

James Hale james at thehales.id.au
Thu Mar 4 21:40:50 EST 2010


I would like the opinion of list members on the best way to approach the following task: Being able to perform boolean and proximity searches on a large body of text (say 400,000 words) returning paragraphs where the hits are located (clickable to give section containing paragraph desired.)

Of course by boolean I mean "find  where 'word a' AND 'word b' occur in the same paragraph". Same with OR.
By proximity I mean "find where 'word a' is within x words of 'word b'.

In thinking about this for a while and considering how to store the text and ways to search it I have come to the conclusion that a database of words contained in the text is the way to go. By this I mean effectively indexing every word and its position in the text and then using database operations on this index file to produce a list of hits. The hits giving me either chunk expressions to display the relevant text blocks or record ID if I also store paragraphs as individual records in a further database file. 

For example with the boolean AND, get a selection of 'word a' hits and then a selection of 'word b' hits and find where they intersect based on the paragraph numbers. This would result in a list of paragraphs only containing both words.

Do members think this an overkill?
Has anybody else looked at this?

Any comments would be appreciated.



James







More information about the use-livecode mailing list