Optimize This!
Alex Tweedly
alex at tweedly.net
Mon May 1 06:07:46 EDT 2006
Todd Geist wrote:
> Hello Everyone,
>
> I had the need to search large binary files for a string. The files
> could be over a gigabyte in size, so I decided not to load the whole
> file into ram but digest in chunks instead. This is the routine I
> came up with, it seems to work very quickly but I am wondering if
> some of you might be able to speed it up.
>
I can see two issues - one of correctness and one of speed.
1. if the string spans the block boundary, this currently fails to find
it; so you need to adjust the "n=0" case so that instead of doing
put pStart + tSize into pStart
you do
put pStart + tSize - the number of chars in pString into pStart
Note that you must then also change the overall exit test - after
reading the next block, instead of
IF it is "" THEN
return 0
END IF
you need to do the equivalent of
IF the number of chars in it < the number of chars in pString THEN
return 0
END IF
The speed improvement is more complex:
2. If you find the string near the start of a block, then you adjust
pStart - and then re-read much of the same block from the file.
It would be better to loop within a block once you've read it - and only
go back to the file system for more data when you need it.
and one of personal preference
3. I never use the "it" variable more than one line after when it is
set, so the fact that you want to both search it and check for it being
empty would be enough for me to assign it to a variable. Which is OK,
because in order to do item 2 we're going to need to do that anyway :-)
warning - code typed in and not tested (something I rarely do, but I
still haven't had my first coffee, so I'd waste hours getting my testing
right :-))
FUNCTION GetPositionInBinaryFile pPath, pString, pStart, pOccurrence
put the number of chars in pString into tStringLength
put 20000 into tSize
IF tStringLength = 0 THEN
return 0
END IF
IF tStringLength > tSize THEN
return 0
END IF
open file pPath for binary read
IF pStart = "" THEN
PUT 1 into pStart
END IF
IF pOccurrence = "" THEN
PUT 1 into pOccurrence
END IF
put 0 into tFoundCount
REPEAT
read from file pPath at pStart for tSize
put it into tData
IF the number of chars in tData < tStringLength THEN
return 0
END IF
put 0 into tCharsToSkip
REPEAT
put offset(pString, pData, tCharsToSkip) into n
IF n > 0 THEN
add 1 to tFoundCount
IF tFoundCount = pOccurrence THEN
put pStart + tCharsToSkip + n into tPos
Return tPos
ELSE
add n + tStringLength to tCharsToSkip -- what about
overlapping sub-strings ??!!
END IF
ELSE
-- no more to be found in this block
EXIT REPEAT
END IF
END REPEAT
-- we get here when n = 0, i.e. no more occurrences in this block
put pStart + tSize - tStringLength into pStart
END REPEAT
close file pPath
END GetPositionInBinaryFile
There is one more performance improvement possible - but so minor I
wouldn't bother ....
The adjustment of pStart by "-tStringLength" means that we are
re-reading part of the file each time - which causes minor inefficiency
in itself, and also because it requires re-positioning the file read
point. To avoid that we would need to do something like
delete char 1 to tSize - tStringLength of tData
and then when we read the next block, do
put it after tData
and adjust pStart accordingly.
It's kind of messy, so unless you often search for very long pString
(say more than 16K bytes) I'd ignore that.
The other issue mentioned in the comments in my code is the issue of
overlapping string occurrences.
If your pString is say "abab" and your data is "abababc" then you could say
match at position 1, match at position 3
or simply
match at position 1
The code above takes the second approach. If you wanted the former, then
you'd instead want to do
add n + 1 to tCharsToSkip
--
Alex Tweedly http://www.tweedly.net
--
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.1.385 / Virus Database: 268.5.1/327 - Release Date: 28/04/2006
More information about the use-livecode
mailing list