Not Shy...
Richard Gaskin
ambassador at fourthworld.com
Tue Jun 8 17:34:03 EDT 2004
Bob Nelson wrote:
> ...so let's dive in with both feet.
>
> HyperCard was my friend - and remained my friend until the advent of OS X
> and a new machine that won't boot into OS 9.x any more. Sad, since I used
> it for all sorts of cool tricks, especially the massaging of copious amounts
> of data that needed a good "cleaning" before dropping it into MySQL or
> FileMaker.
>
> A new project came along and prompted me to go hunting. One of the old
> HyperCard sites recommended Revolution or SuperCard, so I've been demo'ing
> Revolution for a couple of days to see how the package operates compared to
> other options - including RealBASIC.
>
> So I've got a little script that does a moderately simple thing: Grab a web
> page, bring it back, strip useless data out of it (right now, I've just got
> it stripping out the extra returns and leading spaces per line) and the next
> step will be to kill the HTML on the page so I can mine the data...
>
> My layout and code are fairly simple:
>
> Two fields and two buttons so I can work through the example - the first
> field is the 'holder' of the remote URL which has been retrieved
> (Imported_Raw) and the second field will be the restructured output when I'm
> done.
>
> Here's the code, for those who want to dive deeper...
>
> on mouseUp
> put 0 into i
> repeat forever
> add 1 to i
> if char 1 of line i of field "Imported_Raw" is numToChar(13) then
> delete line i of field "Imported_Raw"
> put "Ate one return at line " & i & " of " & the number of lines of
> field "Imported_Raw" & " total lines."
> subtract 1 from i
> end if
> repeat while char 1 of line i of field "Imported_Raw" = " "
> delete char 1 of line i of field "Imported_Raw"
> put "Ate one space at line " & i & " of " & the number of lines of
> field "Imported_Raw" & " total lines."
> end repeat
> if line i of field "Imported_Raw" is the last line of field
> "Imported_Raw" then
> exit repeat
> end if
> end repeat
>
> end mouseUp
>
>
> Here's what I noticed about execution:
>
> 1. Importing the URL is awesome - a great feature that makes my life soooo
> much easier for this project! And fast, too!
> 2. The page I grabbed consisted of 140,000 lines of code. After removing
> extra line feeds, the number of lines is around 80,000.
> 3. This script runs VERY slow, compared to relatively the same script in
> HyperCard running under 9.2.1 -- as an example, Revolution has been running
> this script for more than 18 hours and still hasn't finished processing.
> (And that's running on a Dual 2 GHz, 4 Gb RAM, OS X most current version
> with all updates.) Under HC, the similar script executed in about an hour -
> running on an iMac G3/233 with 1 Gb and OS 9.2.1 -- any comments regarding
> execution speed?
> 4. I don't see any mechanisms for determining progress of the operation --
> although I may have certainly missed something. Are there any progress
> bars, etc., that one can use in Revolution?
> 5. Looking through all the examples I can find, as well as documentation, I
> noted that there aren't many examples related to text manipulation - and
> importing/exporting text, etc., in/out of your stack. I'm sure I missed
> something on this front, as I'm sure people would be doing this all the
> time... Can anyone point me in a direction?
I think there may be issues with the original code. For example, if the
last line of the file is empty then any empty line will cause it to exit
prematurely.
I've revised the handler below, with comments to help describe the
admittedly liberal rewrite. My note there about the use of the mod
operator to update the progress bar incrementally is weak -- ideally you
should divide the data size by the number of useful scrollbar increments
to get the value to use with the mod operator.
I also added a simple timing mechanism (the references to milliseconds
at the top and bottom) so you can see how fast it is and compare it with
similar additions to your existing script.
Even as it is, the handler below should be a few orders of magnitude
faster than what you have above. But if you raise the mod value for the
scrollbar update even higher you should see it gain another big speed
increase.
--
Richard Gaskin
Fourth World Media Corporation
___________________________________________________
Rev tools and more: http://www.fourthworld.com/rev
------------------------------------------------------------
on mouseUp
-- Get initial timing:
put the milliseconds into s
--
-- Always much faster to work in a variable than field data:
put fld "Imported_Raw" into tData
--
-- Since we'll use the number of lines often, let's get it only once:
put the number of lines of tData into tNumLines
--
-- Progress indicator -
-- Add a scrollbar object, set the style to "progress":
set the endValue of scrollbar 1 to tNumLines
put 0 into i
-- The "repeat for each" construct is often two or three orders of
-- magnitude faster than any other form, since it parses the chunk
-- referenced in it as it goes while keeping a pointer into the data.
-- In order to maintain its place in the data it must treat the data
-- as read-only, so we'll copy the data into another var for output:
repeat for each line tLine in tData
add 1 to i
-- Update our progress bar
-- Since the time it takes the OS to redraw the scrollbar can cut
-- into our total processing time significantly, rather than
-- updating it in each iteration we'll update it just every 20
-- lines:
if i mod 20 = 0 then set the thumbposition of scrollbar 1 to i
--
-- Using the constant "cr" is much faster than calling the numToChar
-- function, which adds up a lot in a repeat, so we could use:
-- if char 1 of tLine is cr then
-- ...instead of:
-- if char 1 of tLine is numToChar(13)
--
-- But since we're already parsing by lines that's done for us, all
-- we need to do is see if the line is empty:
-- if tLine is empty then
-- put "Ate one return at line " & i & " of " & tNumLines & \
-- " total lines."
-- next repeat
-- end if
--
-- Unless you really need to know how many spaces are removed,
-- you can do that and this part too:
-- repeat while char 1 of tLine = " "
-- delete char 1 of tLine
-- put "Ate one space at line " & i & " of " & tNumLines & \
-- " total lines."
-- end repeat
--
-- ...in just two lines:
get word 1 to (the number of words of tLine) of tLine
if it is empty then next repeat
--
-- Now we just copy the trimmed text to an output var:
put it &cr after tOutputData
--
end repeat
-- Show completed progress in case your data isn't evenly divisible
-- by 20:
set the thumbposition of scrollbar 1 to tNumLines
--
put tOutputData into fld "Processed_Data"
--
-- Display elapsed time:
put the milliseconds - s
end mouseUp
More information about the use-livecode
mailing list