Making Revolution faster with really big arrays
Frank D. Engel, Jr.
fde101 at fjrhome.net
Tue Apr 12 18:27:17 EDT 2005
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
That is only significant if Rev takes advantage of the 64-bit address
space, which I seriously doubt. Your Rev process will still be limited
to 2GB of addressing space, regardless of how much RAM is in the
system. Until they release a 64-bit version of Rev, of course...
If your task is that processor-intensive and your data set that large,
you should consider a lower-level language like Pascal or Ada. A
scripting language, no matter how fast it is, is not ideal for such
intensive operations on large data sets.
On Apr 12, 2005, at 5:30 PM, Dennis Brown wrote:
> Thanks Frank,
>
> Actually I have a 3.5GB 64 bit G5 machine that can handle that much
> data, and I could add a couple more gig if I needed to. It crashes
> when I get less than 1GB into RAM (I can monitor the number of free
> pages of RAM). I tried processing it like you suggest. However, at
> the speed it was going, it was going to be 4 or 5 days to get the
> first pass of my data processed. That is because if you specify a
> line or item chunk in a big array, Rev counts separators from the
> beginning to find the spot you want each time, even if you just want
> the next line. That means on the average, you have processed the
> array thousands of more times than the single pass repeat for each
> takes. The way I wrote it, it only required about two hours for the
> initial pass, and about two minutes for single passes through one data
> item in the array. However, now I need to process more than one data
> item at a time, and that means I can use the repeat for each on only
> one item and I will have to use the chunk expressions for the others.
> That will slow me back down to many days per pass, and I have hundreds
> of passes to do --not very interactive! See you in a few years...
>
> Dennis
>
>
> On Apr 12, 2005, at 5:04 PM, Frank D. Engel, Jr. wrote:
>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Rev's arrays are associative. When using an array with an index like
>> [X, Y, Z], you are really saying, make a string whose contents are X,
>> Y, and Z separated by commas, then use that as the index for the
>> array. These array indexes take up memory, along with your data. In
>> fact, depending on what type of data you are trying to process, they
>> likely take up more. Even without the overhead of the structures
>> used to represent the arrays, your array will likely take up well
>> over 2GB of RAM. On a 32-bit system, you are normally limited to
>> either 2GB or 3GB of memory per process (almost always 2GB, but some
>> Windows versions -- mostly server versions -- can be configured for
>> 3GB per process), so that array would take more memory than all of
>> your data PLUS Revolution PLUS your stack(s) PLUS some code used by
>> the runtime libraries from the OS ... you get the idea.
>>
>> You'll never be able to fit that entire array into memory *as an
>> array* in Rev.
>>
>> Have you considered loading it into a single string and parsing the
>> data inline while managing it in your code?
>>
>> Try something like:
>>
>> put URL "file:/path/to/MyFile.txt" into x
>>
>> Then parse the data from x:
>>
>> put word 1 of item 2 of line 6 of x into y
>>
>> And so on...
>>
>>
>> On Apr 12, 2005, at 4:36 PM, Dennis Brown wrote:
>>
>>> Hi all,
>>>
>>> I just joined this list. What a great resource for sharing ideas
>>> and getting help.
>>>
>>> I am actively writing a bunch of Transcript code to sequentially
>>> process some very large arrays. I had to figure out how to handle a
>>> gig of data. At first I tried to load the file data into a data
>>> array[X,Y,Z] but it takes a while to load and processes for random
>>> access and it takes a lot of extra space for the structure. I also
>>> could never get all the data loaded in without crashing Revolution
>>> and my whole system (yes, I have plenty of extra RAM).
>>>
>>> The scheme I ended up with is based on the fact that the only fast
>>> way I could find to process a large amount of data is with the
>>> repeat for each control structure. I broke my data into a bunch of
>>> 10,000 line by 2500 item arrays. Each one holds a single data item
>>> (in this case it relates to stock market data). That way I can
>>> process a single data item in one sequential pass through the array
>>> (usually building another array in the process). I was impressed at
>>> how fast it goes for these 40MB files. However, this technique only
>>> covers a subset of the type of operations I need to do. The problem
>>> is that you can only specify a single item at a time to work with
>>> the repeat for each. In many cases, I need to have two or more data
>>> items available for the calculations. I have to pull a few rabbits
>>> out of my hat and jump through a lot of hoops to do this and still
>>> go faster than a snail. That is a crying shame. I believe (but
>>> don't know for sure) that all the primitive operations are in the
>>> runtime to make it possible to do this in a simple way if we could
>>> just access them from the compiler. So I came up with an idea for a
>>> proposed language extension. I put the idea in Bugzilla yesterday,
>>> then today, I thought I should ask others if they liked the idea,
>>> had a better idea, or could help me work around not having this
>>> feature in the mean time, since I doubt I would see it implemented
>>> in my lifetime based on the speed I see things getting addressed in
>>> the Bugzilla list.
>>>
>>> The Idea is to break apart the essential functional elements of the
>>> repeat for each control to allow more flexibility. This sample has
>>> a bit more refinement than what I posted yesterday in Bugzilla.
>>>
>>> The new keyword would be "access" , but could be something else.
>>>
>>> An example of the use of the new keywords syntax would be:
>>>
>>> access each line X in arrayX--initial setup of pointers and X value
>>> access each item Y in arrayY --initial setup of pointers and Y value
>>> repeat for number of lines of arrayX times --same as a repeat for
>>> each
>>> put X & comma & Y & return after ArrayXY --merged array
>>> next line X --puts the next line value in X
>>> next item Y --if arrayY has fewer elements than arrayX, then
>>> empty is supplied, could also put "End of String" in the result
>>> end repeat
>>>
>>> Another advantage of this syntax is that it provides for more
>>> flexibility in structure of loops. You could repeat forever, then
>>> exit repeat when you run out of values (based on getting an empty
>>> back). The possibilities for high speed sequential access data
>>> processing are much expanded which opens up more possibilities for
>>> Revolution.
>>>
>>> I would love to get your feedback or other ideas about solving this
>>> problem.
>>>
>>> Dennis
>>>
>>> _______________________________________________
>>> use-revolution mailing list
>>> use-revolution at lists.runrev.com
>>> http://lists.runrev.com/mailman/listinfo/use-revolution
>>>
>>>
>> - -----------------------------------------------------------
>> Frank D. Engel, Jr. <fde101 at fjrhome.net>
>>
>> $ ln -s /usr/share/kjvbible /usr/manual
>> $ true | cat /usr/manual | grep "John 3:16"
>> John 3:16 For God so loved the world, that he gave his only begotten
>> Son, that whosoever believeth in him should not perish, but have
>> everlasting life.
>> $
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.2.4 (Darwin)
>>
>> iD8DBQFCXDfm7aqtWrR9cZoRAnz6AKCMKYLJsg+P7IO3z+2MRHdEgTrjiQCeIS0s
>> T8tEaGjSTychxi01VZJKQVw=
>> =ltcj
>> -----END PGP SIGNATURE-----
>>
>>
>>
>> ___________________________________________________________
>> $0 Web Hosting with up to 200MB web space, 1000 MB Transfer
>> 10 Personalized POP and Web E-mail Accounts, and much more.
>> Signup at www.doteasy.com
>>
>> _______________________________________________
>> use-revolution mailing list
>> use-revolution at lists.runrev.com
>> http://lists.runrev.com/mailman/listinfo/use-revolution
>>
>
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> http://lists.runrev.com/mailman/listinfo/use-revolution
>
>
- -----------------------------------------------------------
Frank D. Engel, Jr. <fde101 at fjrhome.net>
$ ln -s /usr/share/kjvbible /usr/manual
$ true | cat /usr/manual | grep "John 3:16"
John 3:16 For God so loved the world, that he gave his only begotten
Son, that whosoever believeth in him should not perish, but have
everlasting life.
$
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Darwin)
iD8DBQFCXEtG7aqtWrR9cZoRAu27AJ0esoARVdEKDEi0qdlmLTFW+jA44ACfXx5D
icNFddbPGBu6YYrKZz1Jyuo=
=ALsS
-----END PGP SIGNATURE-----
___________________________________________________________
$0 Web Hosting with up to 200MB web space, 1000 MB Transfer
10 Personalized POP and Web E-mail Accounts, and much more.
Signup at www.doteasy.com
More information about the use-livecode
mailing list