Making Revolution faster with really big arrays

Brian Yennie briany at qldlearning.com
Tue Apr 12 22:06:23 EDT 2005


Dennis,

I have to agree with Pierre here. If you are looking for random access 
to many thousands of records taking up gigabytes of memory, a database 
engine is, IMO, the only logical choice.

A simple MySQL/PostgreSQL/Valentina/etc database indexed by line number 
(or stock symbol) would be very fast.

Without indexing your data or fitting all of it into random-access 
in-memory data structures, you're fighting a painful battle. If you 
algorithm is scaling out linearly, you'll just run too slow, and if 
your data size is doing the same you'll run out of memory. On the other 
hand, database engines can potentially handle _terabytes_ of data and 
give you random access in milliseconds. You simply won't beat that in 
Transcript.

One thing you could consider if you don't want a whole database engine 
to deal with, is the feasibility of indexing the data yourself - which 
will give you some of the algorithmic benefits of a database engine. 
That is, make one pass where you store the offsets of each line in an 
index, and then use that to grab lines. Something like (untested):

## index the line starts and ends
put 1 into lineNumber
put 1 into charNum
put 1 into lineStarts[1]
repeat for each char c in tData
     if (c = return) then
        put (charNum - 1) into lineEnds[lineNumber]
        put (charNum + 1) into lineStarts[lineNumber + 1]
        add 1 to lineNumber
     end if
     add 1 to charNum
end repeat
if (c <> return) then put charNum into lineEnds[lineNumber]

## get line x via random char access
put char lineStarts[x] to lineEnds[x] of tData into lineX

- Brian

> Thanks Pierre,
>
> I considered that also.  A Database application would certainly handle 
> the amount of data, but they are really meant for finding and sorting 
> various fields, not for doing the kind of processing I am doing.  The 
> disk accessing would slow down the process.
>
> Dennis
>
> On Apr 12, 2005, at 5:27 PM, Pierre Sahores wrote:
>
>> Welcome to the Revolution Dennis,
>>
>> Why could you not take help from the native Rev ablity to manage the 
>> process in storing the datas inside an ACID-DB alike PostgreSQL or 
>> OpenBase ? It's how i would handle such amounts of datas, for my own. 
>> Transcript for the RAM fine high-speed calculations and SQL for the 
>> right datas presets extractions could probably open an interesing 
>> datas management way for your process, in about calculations speed 
>> and safety.
>>
>> Best,
>>
>> Le 12 avr. 05, à 22:36, Dennis Brown a écrit :
>>
>>> Hi all,
>>>
>>> I just joined this list.  What a great resource for sharing ideas 
>>> and getting help.
>>>
>>> I am actively writing a bunch of Transcript code to sequentially 
>>> process some very large arrays.  I had to figure out how to handle a 
>>> gig of data.  At first I tried to load the file data into a data 
>>> array[X,Y,Z] but it takes a while to load and processes for random 
>>> access and it takes a lot of extra space for the structure.  I also 
>>> could never get all the data loaded in without crashing Revolution 
>>> and my whole system (yes, I have plenty of extra RAM).
>>>
>>> The scheme I ended up with is based on the fact that the only fast 
>>> way I could find to process a large amount of data is with the 
>>> repeat for each control structure.  I broke my data into a bunch of 
>>> 10,000 line by 2500 item arrays.  Each one holds a single data item 
>>> (in this case it relates to stock market data).  That way I can 
>>> process a single data item in one sequential pass through the array 
>>> (usually building another array in the process).  I was impressed at 
>>> how fast it goes for these 40MB files.  However, this technique only 
>>> covers a subset of the type of operations I need to do.  The problem 
>>> is that you can only specify a single item at a time to work with 
>>> the repeat for each.  In many cases, I need to have two or more data 
>>> items available for the calculations.  I have to pull a few rabbits 
>>> out of my hat and jump through a lot of hoops to do this and still 
>>> go faster than a snail.  That is a crying shame.  I believe (but 
>>> don't know for sure) that all the primitive operations are in the 
>>> runtime to make it possible to do this in a simple way if we could 
>>> just access them from the compiler. So I came up with an idea for a 
>>> proposed language extension.  I put the idea in Bugzilla yesterday, 
>>> then today, I thought I should ask others if they liked the idea, 
>>> had a better idea, or could help me work around not having this 
>>> feature in the mean time, since I doubt I would see it implemented 
>>> in my lifetime based on the speed I see things getting addressed in 
>>> the Bugzilla list.
>>>
>>> The Idea is to break apart the essential functional elements of the 
>>> repeat for each control to allow more flexibility.  This sample has 
>>> a bit more refinement than what I posted yesterday in Bugzilla.
>>>
>>> The new keyword would be "access" , but could be something else.
>>>
>>> An example of the use of the new keywords syntax would be:
>>>
>>> access each line X in arrayX--initial setup of pointers and X value
>>> access each item Y in arrayY --initial setup of pointers and Y value
>>> repeat for number of lines of arrayX times --same as a repeat for 
>>> each
>>>    put X & comma & Y & return after ArrayXY --merged array
>>>    next line X --puts the next line value in X
>>>    next item Y --if arrayY has fewer elements than arrayX, then 
>>> empty is supplied, could also put "End of String" in the result
>>> end repeat
>>>
>>> Another advantage of this syntax is that it provides for more 
>>> flexibility in structure of loops.  You could repeat forever, then 
>>> exit repeat when you run out of values (based on getting an empty 
>>> back).  The possibilities for high speed sequential access data 
>>> processing are much expanded which opens up more possibilities for 
>>> Revolution.
>>>
>>> I would love to get your feedback or other ideas about solving this 
>>> problem.
>>>
>>> Dennis
>>>
>>> _______________________________________________
>>> use-revolution mailing list
>>> use-revolution at lists.runrev.com
>>> http://lists.runrev.com/mailman/listinfo/use-revolution
>>>
>>>
>> -- 
>> Bien cordialement, Pierre Sahores
>>
>> 100, rue de Paris
>> F - 77140 Nemours
>>
>> psahores+ at +easynet.fr
>> sc+ at +sahores-conseil.com
>>
>> GSM:   +33 6 03 95 77 70
>> Pro:      +33 1 64 45 05 33
>> Fax:      +33 1 64 45 05 33
>>
>> <http://www.sahores-conseil.com/>
>>
>> WEB/VoD/ACID-DB services over IP
>> "Mutualiser les deltas de productivité"
>>
>>
>> _______________________________________________
>> use-revolution mailing list
>> use-revolution at lists.runrev.com
>> http://lists.runrev.com/mailman/listinfo/use-revolution
>>
>
> _______________________________________________
> use-revolution mailing list
> use-revolution at lists.runrev.com
> http://lists.runrev.com/mailman/listinfo/use-revolution
>
>



More information about the use-livecode mailing list