Making Revolution faster with really big arrays

Dennis Brown see3d at writeme.com
Tue Apr 12 17:30:01 EDT 2005


Thanks Frank,

Actually I have a 3.5GB 64 bit G5 machine that can handle that much 
data, and I could add a couple more gig if I needed to.  It crashes 
when I get less than 1GB into RAM (I can monitor the number of free 
pages of RAM).  I tried processing it like you suggest.  However, at 
the speed it was going, it was going to be 4 or 5 days to get the first 
pass of my data processed.  That is because if you specify a line or 
item chunk in a big array, Rev counts separators from the beginning to 
find the spot you want each time, even if you just want the next line.  
That means on the average, you have processed the array thousands of 
more times than the single pass repeat for each takes.  The way I wrote 
it, it only required about two hours for the initial pass, and about 
two minutes for single passes through one data item in the array.  
However, now I need to process more than one data item at a time, and 
that means I can use the repeat for each on only one item and I will 
have to use the chunk expressions for the others.  That will slow me 
back down to many days per pass, and I have hundreds of passes to do 
--not very interactive!  See you in a few years...

Dennis


On Apr 12, 2005, at 5:04 PM, Frank D. Engel, Jr. wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Rev's arrays are associative.  When using an array with an index like 
> [X, Y, Z], you are really saying, make a string whose contents are X, 
> Y, and Z separated by commas, then use that as the index for the 
> array.  These array indexes take up memory, along with your data.  In 
> fact, depending on what type of data you are trying to process, they 
> likely take up more.  Even without the overhead of the structures used 
> to represent the arrays, your array will likely take up well over 2GB 
> of RAM.  On a 32-bit system, you are normally limited to either 2GB or 
> 3GB of memory per process (almost always 2GB, but some Windows 
> versions -- mostly server versions -- can be configured for 3GB per 
> process), so that array would take more memory than all of your data 
> PLUS Revolution PLUS your stack(s) PLUS some code used by the runtime 
> libraries from the OS ... you get the idea.
>
> You'll never be able to fit that entire array into memory *as an 
> array* in Rev.
>
> Have you considered loading it into a single string and parsing the 
> data inline while managing it in your code?
>
> Try something like:
>
> put URL "file:/path/to/MyFile.txt" into x
>
> Then parse the data from x:
>
> put word 1 of item 2 of line 6 of x into y
>
> And so on...
>
>
> On Apr 12, 2005, at 4:36 PM, Dennis Brown wrote:
>
>> Hi all,
>>
>> I just joined this list.  What a great resource for sharing ideas and 
>> getting help.
>>
>> I am actively writing a bunch of Transcript code to sequentially 
>> process some very large arrays.  I had to figure out how to handle a 
>> gig of data.  At first I tried to load the file data into a data 
>> array[X,Y,Z] but it takes a while to load and processes for random 
>> access and it takes a lot of extra space for the structure.  I also 
>> could never get all the data loaded in without crashing Revolution 
>> and my whole system (yes, I have plenty of extra RAM).
>>
>> The scheme I ended up with is based on the fact that the only fast 
>> way I could find to process a large amount of data is with the repeat 
>> for each control structure.  I broke my data into a bunch of 10,000 
>> line by 2500 item arrays.  Each one holds a single data item (in this 
>> case it relates to stock market data).  That way I can process a 
>> single data item in one sequential pass through the array (usually 
>> building another array in the process).  I was impressed at how fast 
>> it goes for these 40MB files.  However, this technique only covers a 
>> subset of the type of operations I need to do.  The problem is that 
>> you can only specify a single item at a time to work with the repeat 
>> for each.  In many cases, I need to have two or more data items 
>> available for the calculations.  I have to pull a few rabbits out of 
>> my hat and jump through a lot of hoops to do this and still go faster 
>> than a snail.  That is a crying shame.  I believe (but don't know for 
>> sure) that all the primitive operations are in the runtime to make it 
>> possible to do this in a simple way if we could just access them from 
>> the compiler. So I came up with an idea for a proposed language 
>> extension.  I put the idea in Bugzilla yesterday, then today, I 
>> thought I should ask others if they liked the idea, had a better 
>> idea, or could help me work around not having this feature in the 
>> mean time, since I doubt I would see it implemented in my lifetime 
>> based on the speed I see things getting addressed in the Bugzilla 
>> list.
>>
>> The Idea is to break apart the essential functional elements of the 
>> repeat for each control to allow more flexibility.  This sample has a 
>> bit more refinement than what I posted yesterday in Bugzilla.
>>
>> The new keyword would be "access" , but could be something else.
>>
>> An example of the use of the new keywords syntax would be:
>>
>> access each line X in arrayX--initial setup of pointers and X value
>> access each item Y in arrayY --initial setup of pointers and Y value
>> repeat for number of lines of arrayX times --same as a repeat for each
>>    put X & comma & Y & return after ArrayXY --merged array
>>    next line X --puts the next line value in X
>>    next item Y --if arrayY has fewer elements than arrayX, then empty 
>> is supplied, could also put "End of String" in the result
>> end repeat
>>
>> Another advantage of this syntax is that it provides for more 
>> flexibility in structure of loops.  You could repeat forever, then 
>> exit repeat when you run out of values (based on getting an empty 
>> back).  The possibilities for high speed sequential access data 
>> processing are much expanded which opens up more possibilities for 
>> Revolution.
>>
>> I would love to get your feedback or other ideas about solving this 
>> problem.
>>
>> Dennis
>>
>> _______________________________________________
>> use-revolution mailing list
>> use-revolution at lists.runrev.com
>> http://lists.runrev.com/mailman/listinfo/use-revolution
>>
>>
> - -----------------------------------------------------------
> Frank D. Engel, Jr.  <fde101 at fjrhome.net>
>
> $ ln -s /usr/share/kjvbible /usr/manual
> $ true | cat /usr/manual | grep "John 3:16"
> John 3:16 For God so loved the world, that he gave his only begotten 
> Son, that whosoever believeth in him should not perish, but have 
> everlasting life.
> $
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.4 (Darwin)
>
> iD8DBQFCXDfm7aqtWrR9cZoRAnz6AKCMKYLJsg+P7IO3z+2MRHdEgTrjiQCeIS0s
> T8tEaGjSTychxi01VZJKQVw=
> =ltcj
> -----END PGP SIGNATURE-----
>
>
>
> ___________________________________________________________
> $0 Web Hosting with up to 200MB web space, 1000 MB Transfer
> 10 Personalized POP and Web E-mail Accounts, and much more.
> Signup at www.doteasy.com
>
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> http://lists.runrev.com/mailman/listinfo/use-revolution
>



More information about the use-livecode mailing list