Optimize This!

Alex Tweedly alex at tweedly.net
Mon May 1 06:07:46 EDT 2006


Todd Geist wrote:

> Hello Everyone,
>
> I had the need to search large binary files for a string. The files  
> could be over a gigabyte in size, so I decided not to load the whole  
> file into ram but digest in chunks instead.  This is the routine I  
> came up with, it seems to work very quickly but I am wondering if  
> some of you might be able to speed it up.
>
I can see two issues - one of correctness and one of speed.

1. if the string spans the block boundary, this currently fails to find 
it; so you need to adjust the "n=0" case so that instead of doing
    put pStart + tSize into pStart
you do
   put pStart + tSize - the number of chars in pString into pStart

Note that you must then also change the overall exit test - after 
reading the next block, instead of
  IF it is "" THEN
    return 0
  END IF
you need to do the equivalent of
  IF the number of chars in it < the number of chars in pString THEN
    return 0
  END IF

The speed improvement is more complex:
2. If you find the string near the start of a block, then you adjust 
pStart - and then re-read much of the same block from the file.
It would be better to loop within a block once you've read it - and only 
go back to the file system for more data when you need it.

and one of personal preference
3.  I never use the "it" variable more than one line after when it is 
set, so the fact that you want to both search it and check for it being 
empty would be enough for me to assign it to a variable.  Which is OK, 
because in order to do item 2 we're going to need to do that anyway :-)


warning - code typed in and not tested (something I rarely do, but I 
still haven't had my first coffee, so I'd waste hours getting my testing 
right :-))

FUNCTION GetPositionInBinaryFile pPath, pString, pStart, pOccurrence
  put the number of chars in pString into tStringLength
  put 20000 into tSize
  IF tStringLength = 0 THEN
    return 0
  END IF
  IF tStringLength > tSize THEN
    return 0
  END IF
  open file pPath for binary read
  IF pStart = "" THEN
    PUT 1 into pStart
  END IF
  IF pOccurrence = "" THEN
    PUT 1 into pOccurrence
  END IF
  put 0 into tFoundCount
  REPEAT
    read from file pPath at pStart for tSize
    put it into tData
    IF the number of chars in tData < tStringLength THEN
      return 0
    END IF
    put 0 into tCharsToSkip
    REPEAT
      put offset(pString, pData, tCharsToSkip) into n
      IF n > 0 THEN
        add  1 to tFoundCount
        IF tFoundCount = pOccurrence THEN
          put pStart + tCharsToSkip + n into tPos
          Return tPos
        ELSE
          add n + tStringLength to tCharsToSkip   -- what about 
overlapping sub-strings ??!!
        END IF
      ELSE
        -- no more to be found in this block
        EXIT REPEAT
      END IF
    END REPEAT
    -- we get here when n = 0, i.e. no more occurrences in this block
    put pStart + tSize - tStringLength into pStart
  END REPEAT
  close file pPath
END GetPositionInBinaryFile

There is one more performance improvement possible - but so minor I 
wouldn't bother ....

The adjustment of pStart by "-tStringLength" means that we are 
re-reading part of the file each time - which causes minor inefficiency 
in itself, and also because it requires re-positioning the file read 
point. To avoid that we would need to do something like
     delete char 1 to tSize - tStringLength of tData
and then when we read the next block, do
    put it after tData
and adjust pStart accordingly.

It's kind of messy, so unless you often search for very long pString 
(say more than 16K bytes) I'd ignore that.



The other issue mentioned in the comments in my code is the issue of 
overlapping string occurrences.

If your pString is say "abab" and your data is "abababc" then you could say
 match at position 1, match at position 3
or simply
  match at position 1

The code above takes the second approach. If you wanted the former, then 
you'd instead want to do
    add n + 1 to tCharsToSkip


-- 
Alex Tweedly       http://www.tweedly.net



-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.385 / Virus Database: 268.5.1/327 - Release Date: 28/04/2006




More information about the use-livecode mailing list