Text manipulation - underlying data structures

Karl Petersen karlpet at mac.com
Fri May 17 11:31:01 EDT 2002


At 1:46 PM -0700 5/16/02, Rob Cozens wrote:
>Karl Petersen wrote a HyperTalk script or external that indexed 
>strings in some fashion.  Perhaps he'll respond in more detail.

Rob probably refers to an external I wrote that uses an index file to 
store the start-of-line positions of text that is stored in a second 
data file. To retrieve line 44,222 of the data file, you can ask the 
external to look it up and read that line of the data file.

The external makes it possible to quickly retrieve one line from a 
very large file without reading the entire file into memory.

It's possible to do the same with a script that uses fixed-length 
numbers to store the start-of-line positions in an index. If all 
index numbers are the same length, a script can easily calculate the 
location of any index member, read that number from the index file, 
then use it to read one line from the data file.

The file-reading scripts all use this form of the read-file command:
   read from file <somePathname> at <someNumber> until return

Based on index numbers 6 chars long, an index file might look 
something like this:

000000
000200
000430
000710
001220
...

So to read line 3 from the data file, a script first calculates the 
location of that index member in the index file, reads 000430 from 
the index file, then reads the data file starting there.

In Rev, one might store index numbers as binary data, allowing the 
index file to be smaller. If might even allow the index to be stored 
in memory. Rev's "repeat for each line L of <data>" is so fast it 
should be possible to build an index quickly, a task normally 
requiring an external. Reading the index/data files is fast and easy; 
building the index is the slow part.

Karl



More information about the use-livecode mailing list