Algorithm time...

Glen Bojsza gbojsza at gmail.com
Sun Dec 9 03:33:40 EST 2012


Once again I hope the algorithm experts can speed of the following...
currently on just 50,000 points it takes 10 seconds ...as I indicate below
I know why so this is why I am looking for better solutions

Key points are:

- datasets are a minimun of 10,000 lines
- datasets are always a multiple of 1,000
- datasets have two columns of data which are separated by a tab

Goal:

generate a new dataset that ALWAYS reduces down to 2,000 lines based on the
max and min of groups of the original dataset

So in the case of the minimum dataset you would divide 10,000 by 1,000
which gives the grouping size of 10

You then need to take original dataset from the start to finish in groups
of 10.
- find the min of the group of 10
- find the max of the group of 10
- put them in their original order found in the original dataset (the min
value might be line 8 and the max value might be line 3 so in building the
new dataset you would have the max value first and then the min value...
this can change for each group of 10)
- column 1 is linear ascending so it can be used to sort the new dataset

caveats:

-column 1 is a needed value but it is NOT the column that the max and min
values are being looked at... these are in column 2
- once the established order of the two values are determined then their
line values form the ne dataset

Example:

1    23
2    12
3    9
4    77
5    2
6    13
7    44
8    83
9    2
10  37

In this example the result would be **Note if one or more values are the
min or max the the first value found can be used

5    2
8    83


What I have tried:

the original dataset is in a fld mydata
the column 2 of fld mydata is in a fld mydataNew

problem I try to solve is matching values sequentially always from the
start of the data and too avoid getting earlier data I add an "x" after
values in the second column to allow wholematches to work

this is where I am losing my time (I believe)

====================

*on* mouseUp

   *put* the seconds into startTime

   *put* the number of lines of fld mydataNew into tcount

   *put* fld mydataNew into gbmem

   *put* tcount / 1000 into tgroup

   *put* 0 into x

   *put* 1 into y

   *repeat* for 1000 times

      *put* empty into tempMM

      *put* empty into minLine

      *put* empty into maxLine

      *put* x + y into startline

      *put* tgroup + startline into x

      *put* x into endline

      *put* line startline to endline of gbmem into grptest

      *replace* cr with comma in grptest

      *set* the wholematches to true

      *set* itemdel to tab

      *put* min(grptest) into mymin

      *put* the lineOffSet(mymin,gbmem) into minLine

      *put* max(grptest) into mymax

      *put* the lineOffSet(mymax,gbmem) into maxLine

      *put* minLine & tab & mymin & cr after tempMM

      *put* maxLine & tab & mymax & cr after tempMM

      *put* tempMM & cr after finalMM

      *repeat* with w = startline to endline

         *if* w <= tcount *then*

            *put* "x" after line w in gbmem

         *else*

         *end* *if*

      *end* *repeat*

   *end* *repeat*

   *sort* finalMM  numeric

   *filter* finalMM without empty

   *put* cr & line (fld gbcurrent) of fld mydata after finalMM

   *put* finalMM into fld finalPointsNew

   *put* the seconds into endTime

   *put* endTime - startTime

*end* mouseUp

I believe that this should be doable less than 1 second and consistent
since the repeat for loop will always be 1000 times only the group size
changes

thanks,

Glen



More information about the use-livecode mailing list