word counts - what is going on?

Michael Kann mikekann at yahoo.com
Wed Aug 15 09:19:28 EDT 2012


Sounds like a good project. If you haven't discovered it yourself I'll just mention one of my alltime favorite scripts. It outputs a frequency list of words in a text file. Something like:

on mouseUp

repeat for each word w in fileContent
add 1 to wordCount[w]
end repeat

put keys(wordCount) into keyWords
sort keyWords

repeat for each line l in keyWords
put l & tab & wordCount[l] & return after displayResult
end repeat

put displayResult into field "result"

end mouseUp

I think it came from Richard Gaskin, or perhaps the Almighty Himself (Scott Raney).

Good luck, 

--- On Tue, 8/14/12, James Hale <james at thehales.id.au> wrote:

From: James Hale <james at thehales.id.au>
Subject: Re: word counts - what is going on?
To: use-livecode at lists.runrev.com
Date: Tuesday, August 14, 2012, 10:39 PM


lots of suggestions and attempts at humour. Nice.

The problem with using the word chunk boils down to not being able to get quoted text seen as multiple words as in selecting a word within the quoted block using the word chunk command to say hilite it.
Certainly I could replace the quotes with curly quotes etc but as the source text is open (i.e. not within my control) I have no idea if that would cause some unforeseen problem with the text presentation itself.

As mentioned I decided to process the text by character and fully control word boundaries myself. Doing this resulted in the following timings.

0.022096+7.497033 secs for 488872 words

The down side being that I actually ended up with some 500326 'words' in my array.
The extra words being the components of quoted strings as well as a number of dotted strings being broken up (e.g. web addresses etc)
The upshot being the time penalty was only about 5 secs extra.
A good result all things considered. (this process only needs take place once.)
The script provides an array entry with the word itself, its line number within the text, the character position from the start of the text, the character position from the start of the line as well as the length of the word itself.

On 15/08/2012, at 2:37 AM, Michael Kann <mikekann at yahoo.com> wrote:

> Can you give us the skinny on what you are trying to do? What do you want your output to look like?
> Mike

The purpose being this is an application that will read an ebook (epub currently), display it, allow searching and annotations (with hierarchical tagging) for purposes of studying texts.

This current issue was concerned with enabling boolean and proximity searches on the text.

Boolean searches can be done with straight Livecode scripting without much trouble although once there are 3 or 4 terms the search can slow down a bit. However apart from speed issues I wanted to provide a display of the number of hits for each term as well as the number of hits for the boolean combination as the terms are entered into the search block.
for example:

Search Term       Hits          Hits Boolean
        "text"              45          
       "book"           123          
So this tells me there were 45 hits for "text", 123 hits for "book" and 27 hits where "text" and "book" appear within the same paragraph (line).
I am thinking the best way to do this was to use SQL to do joins and counts which I am assuming should be fairly quick (I could be wrong here but I hope not.)
The character positions provide both the proximity detail as well as easily showing the hits in context, for example:

     …the text was later supplied in book form to anyone th…..

I also realised that the FT module in sqlite could do all this but I couldn't guarantee that this would be available as not all installations of sqlite have this module compiled and I didn't want to go down the road of compiling and supplying it myself. My app is initially for Mac but will also be compiled for Windows once I get a working beta. I also plan to provide input for other text forms such as .txt, .html, .rtf and perhaps markdown, but early days yet.

Thanks again to everyone who has made suggestions.


james at thehales.id.au

Tel: +61 3 9386 2516    
Fax: +61 3 9386 1387

use-livecode mailing list
use-livecode at lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:

More information about the use-livecode mailing list