convert to lower ascii 128?

Ben Rubinstein benr at cogapp.com
Thu Feb 5 14:45:55 EST 2009


 From the dept of "Please don't let my wife or boss know that I've been 
wasting time on this"....

Chipp Walters wrote:
> Interesting note: I found the following results:
> function altAsciiScrub2 pText
>    put replacetext(pText,"[" & numToChar(129) & "-" & numToChar(255) &
> "]","") into pText
>    put  replacetext(pText,"[" & numToChar(1) & "-" & numToChar(31) & "]","")
> into pText
>    return pText
> end altAsciiScrub2
> 
> executed in 19 ticks on my QuadCore Vista 64 machine
> 
> function altAsciiScrub1 pText
>    repeat for each char c in pText
>       get charToNum(c)
>       if it > 128 or it < 32 then
>          next repeat
>       end if
>       put c after t
>    end repeat
>    return t
> end altAsciiScrub1
> 
> executed in 17 ticks on my QuadCore Vista 64 machine
> 
> repeat for each is really fast.


Looking this I couldn't help wondering "does it make a difference if there are 
a lot of high-code characters to delete?" and "regex is a bit slower: but does 
the setup of regex pay off if the string is long enough"?  (Chipp didn't 
specify what kind of input he wanted to work on, or tested with.)

And I also wondered, "what other ways could we trade off setup over a long 
repeat"?

(BTW strictly speaking ASCII is 0-127; both the routines above are allowing 
128.  I only mention because it took me a while to figure out why my routines 
sometimes returned different results to the above two; the difference depended 
on whether there as a character with code 128 in the test string.)

So I implemented another couple of options: one taking Devin's suggestion of 
calling replace with the characters you don't want: thus only calling 
numtochar/chartonum a fixed number of times, regardless of the length of time.

    function asciiScrub3 pText
       set the caseSensitive to true
       -- delete any characters below space
       repeat with i = 0 to 31
          replace numtochar(i) with empty in pText
       end repeat
       -- delete any characters above ASCII
       repeat with i = 128 to 255
          replace numtochar(i) with empty in pText
       end repeat
       return pText
    end asciiScrub3

As you'd expect, this is much slower than the above approaches for a short 
string; but it is slightly faster than either of them for a large string - 
more so if there are a lot of high-code characters in the string.

In my work, I'm often dealing with non-ASCII characters; but it rarely 
suffices to delete them, I generally need to convert them to something else 
(either the relevant ASCII character, or to another character set).  The 
built-in functions in Rev are often frustratingly just off the mark for this, 
so I tend to just run the characters through an array, blessing each time I do 
so how fast repeat for each is, and how fast arrays are.  So I naturally 
wondered how that approach would work even when I just wanted to delete the 
characters over 128:


    function asciiScrub4 pText
       set the caseSensitive to true
       -- set up array to map characters we want to retain to themselves
       put empty into a
       repeat with i = 32 to 127
          get numtochar(i)
          put it into a[it]
       end repeat
       -- filter the string through the array set up above
       put empty into t
       repeat for each char c in pText
          put a[c] after t
       end repeat
       return t
    end asciiScrub4

Again, as you'd expect, the setup time costs on a short string.  But on large 
blocks of text, this turned out to be the fastest method - not quite twice as 
fast as the original two, but getting there.

As for composition, it makes less of a difference than I thought; all 
functions are slightly faster if the source string contains more high-code 
characters (ie if the output string is shorter); there doesn't seem to be a 
very significant difference between how routines are affected by this.

Mostly what I demonstrated was that all the approaches are so fast that 
especially on a short string it's hard to get any significance in the relative 
timings - you'd have to be doing a vast amount of processing to justify 
spending any time doing better than whatever the first approach you came up 
with was.  And that sometimes I'll do almost anything to avoid the work I 
should be getting on with...

- Ben




More information about the use-livecode mailing list