SpellCheck (re-inventing the wheel)

Alex Tweedly alex at tweedly.net
Thu Jan 27 20:14:02 EST 2005


Roger.E.Eller at sealedair.com wrote:

>Thank you to Andre, Jonathan, Alex, and Jim!!!
>
>Your great suggestions and word lists have provided me the necessary 
>ingredients to achieve my goal. I will probably post further questions for 
>optimizing the speed of word comparisons. If you have ideas or a script 
>that works well, please post it if you don't mind. I appreciate your help!
>  
>
Roger,
this was a very useful trigger for me. I've had a half-completed project 
to write some word-game programs that got left behind a couple of months 
ago.

One of the things I needed then was a spellchecker; I looked briefly at 
the Mozilla spellchecker (but didn't like their dictionary - seemed to 
have a lot of junk in it for my purposes). That was when I found the 
OpenOffice dictionaries, and looked at them enough to figure it would 
take some work (and in particular, expanding their downloadable 
dictionary to a simple word list would take a c compiler - which set of 
my allergy to using C :-)

I looked at it again, and decided I could dirty my hands for 5 minutes, 
downloaded the MySpell package, compiled the unmunch program to convert 
from dict+affix to simple word list.

Given that word list (162K words, 1.75Mbytes), I tried the simple brute 
force method, namely

>   put the millisecs into tStart
>   put tWords into field "inField"
>   put 0 into t
>   repeat for each word w in tWords
>     add 1 to t
>     replace "." with empty in w  -- probably more of these should be done
>     replace "," with empty in w
>     replace "!" with empty in w
>     if w is not among the words of gWords then
>       set the textstyle of word t of field "inField" to "bold"
>     end if
>   end repeat


This took on average 8 millisecs per word in tWords. Perfectly adequate 
for small input "documents".

Then I tried a slightly more complex way:
setup

>   put url ("file:" & tFile) into gWords
>   repeat for each word w in gWords
>     put 1 into  gArray[w]
>   end repeat

and then

>   put tWords into field "inField"
>   put 0 into t
>   repeat for each word w in tWords
>     add 1 to t
>     replace "." with empty in w
>     replace "," with empty in w
>     replace "!" with empty in w
>     if gArray[w] <> 1 then
>       set the textstyle of word t of field "inField" to "bold"
>     end if
>   end repeat

This took 2 millisecs for 50 words, so would be reasonable for even 
large-ish documents.

I tried to put this sample stack onto RevOnline - but I'm having some 
problem connecting to the server, so you can find it instead at
   www.tweedly.net/RunRev/SpellCheck.rev
   www.tweedly.net/RunRev/allwords.dic
(remember the dic is 1.75M - don't download it unless you really want it !)

-- Alex.


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.6 - Release Date: 27/01/2005



More information about the use-livecode mailing list