SpellCheck (re-inventing the wheel)
Alex Tweedly
alex at tweedly.net
Thu Jan 27 20:14:02 EST 2005
Roger.E.Eller at sealedair.com wrote:
>Thank you to Andre, Jonathan, Alex, and Jim!!!
>
>Your great suggestions and word lists have provided me the necessary
>ingredients to achieve my goal. I will probably post further questions for
>optimizing the speed of word comparisons. If you have ideas or a script
>that works well, please post it if you don't mind. I appreciate your help!
>
>
Roger,
this was a very useful trigger for me. I've had a half-completed project
to write some word-game programs that got left behind a couple of months
ago.
One of the things I needed then was a spellchecker; I looked briefly at
the Mozilla spellchecker (but didn't like their dictionary - seemed to
have a lot of junk in it for my purposes). That was when I found the
OpenOffice dictionaries, and looked at them enough to figure it would
take some work (and in particular, expanding their downloadable
dictionary to a simple word list would take a c compiler - which set of
my allergy to using C :-)
I looked at it again, and decided I could dirty my hands for 5 minutes,
downloaded the MySpell package, compiled the unmunch program to convert
from dict+affix to simple word list.
Given that word list (162K words, 1.75Mbytes), I tried the simple brute
force method, namely
> put the millisecs into tStart
> put tWords into field "inField"
> put 0 into t
> repeat for each word w in tWords
> add 1 to t
> replace "." with empty in w -- probably more of these should be done
> replace "," with empty in w
> replace "!" with empty in w
> if w is not among the words of gWords then
> set the textstyle of word t of field "inField" to "bold"
> end if
> end repeat
This took on average 8 millisecs per word in tWords. Perfectly adequate
for small input "documents".
Then I tried a slightly more complex way:
setup
> put url ("file:" & tFile) into gWords
> repeat for each word w in gWords
> put 1 into gArray[w]
> end repeat
and then
> put tWords into field "inField"
> put 0 into t
> repeat for each word w in tWords
> add 1 to t
> replace "." with empty in w
> replace "," with empty in w
> replace "!" with empty in w
> if gArray[w] <> 1 then
> set the textstyle of word t of field "inField" to "bold"
> end if
> end repeat
This took 2 millisecs for 50 words, so would be reasonable for even
large-ish documents.
I tried to put this sample stack onto RevOnline - but I'm having some
problem connecting to the server, so you can find it instead at
www.tweedly.net/RunRev/SpellCheck.rev
www.tweedly.net/RunRev/allwords.dic
(remember the dic is 1.75M - don't download it unless you really want it !)
-- Alex.
--
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.300 / Virus Database: 265.7.6 - Release Date: 27/01/2005
More information about the use-livecode
mailing list