word counts - what is going on?

James Hale james at thehales.id.au
Tue Aug 14 23:39:15 EDT 2012


Well,

lots of suggestions and attempts at humour. Nice.

The problem with using the word chunk boils down to not being able to get quoted text seen as multiple words as in selecting a word within the quoted block using the word chunk command to say hilite it.
Certainly I could replace the quotes with curly quotes etc but as the source text is open (i.e. not within my control) I have no idea if that would cause some unforeseen problem with the text presentation itself.

As mentioned I decided to process the text by character and fully control word boundaries myself. Doing this resulted in the following timings.

0.022096+7.497033 secs for 488872 words

The down side being that I actually ended up with some 500326 'words' in my array.
The extra words being the components of quoted strings as well as a number of dotted strings being broken up (e.g. web addresses etc)
The upshot being the time penalty was only about 5 secs extra.
A good result all things considered. (this process only needs take place once.)
The script provides an array entry with the word itself, its line number within the text, the character position from the start of the text, the character position from the start of the line as well as the length of the word itself.

On 15/08/2012, at 2:37 AM, Michael Kann <mikekann at yahoo.com> wrote:

> Can you give us the skinny on what you are trying to do? What do you want your output to look like?
> Mike

The purpose being this is an application that will read an ebook (epub currently), display it, allow searching and annotations (with hierarchical tagging) for purposes of studying texts.

This current issue was concerned with enabling boolean and proximity searches on the text.

Boolean searches can be done with straight Livecode scripting without much trouble although once there are 3 or 4 terms the search can slow down a bit. However apart from speed issues I wanted to provide a display of the number of hits for each term as well as the number of hits for the boolean combination as the terms are entered into the search block.
for example:

Search Term       Hits          Hits Boolean
        "text"              45          
                                                     27
       "book"           123          
 
So this tells me there were 45 hits for "text", 123 hits for "book" and 27 hits where "text" and "book" appear within the same paragraph (line).
I am thinking the best way to do this was to use SQL to do joins and counts which I am assuming should be fairly quick (I could be wrong here but I hope not.)
The character positions provide both the proximity detail as well as easily showing the hits in context, for example:

     …the text was later supplied in book form to anyone th…..

I also realised that the FT module in sqlite could do all this but I couldn't guarantee that this would be available as not all installations of sqlite have this module compiled and I didn't want to go down the road of compiling and supplying it myself. My app is initially for Mac but will also be compiled for Windows once I get a working beta. I also plan to provide input for other text forms such as .txt, .html, .rtf and perhaps markdown, but early days yet.

Thanks again to everyone who has made suggestions.

James

james at thehales.id.au

Tel: +61 3 9386 2516    
Fax: +61 3 9386 1387







More information about the use-livecode mailing list