Text manipulation - underlying data structures
Karl Petersen
karlpet at mac.com
Fri May 17 11:31:01 EDT 2002
At 1:46 PM -0700 5/16/02, Rob Cozens wrote:
>Karl Petersen wrote a HyperTalk script or external that indexed
>strings in some fashion. Perhaps he'll respond in more detail.
Rob probably refers to an external I wrote that uses an index file to
store the start-of-line positions of text that is stored in a second
data file. To retrieve line 44,222 of the data file, you can ask the
external to look it up and read that line of the data file.
The external makes it possible to quickly retrieve one line from a
very large file without reading the entire file into memory.
It's possible to do the same with a script that uses fixed-length
numbers to store the start-of-line positions in an index. If all
index numbers are the same length, a script can easily calculate the
location of any index member, read that number from the index file,
then use it to read one line from the data file.
The file-reading scripts all use this form of the read-file command:
read from file <somePathname> at <someNumber> until return
Based on index numbers 6 chars long, an index file might look
something like this:
000000
000200
000430
000710
001220
...
So to read line 3 from the data file, a script first calculates the
location of that index member in the index file, reads 000430 from
the index file, then reads the data file starting there.
In Rev, one might store index numbers as binary data, allowing the
index file to be smaller. If might even allow the index to be stored
in memory. Rev's "repeat for each line L of <data>" is so fast it
should be possible to build an index quickly, a task normally
requiring an external. Reading the index/data files is fast and easy;
building the index is the slow part.
Karl
More information about the use-livecode
mailing list