Filtering unicode text

Neville Smythe neville.smythe at optusnet.com.au
Sat Jul 27 20:31:42 EDT 2024


David Glasgow wrote

> I have an app I haven?t touched for a while that makes heavy use of filter of string variables up to 1,000,000 lines (but often only hundreds to tens of thousands of lines).  In my case finding all lines containing the to be found string  is a benefit
> 
> I have long intended to see if I can speed things up a bit.  Should I go back and look at converting string lists to arrays, then using filter, and finally converting back to a variable?  I suppose I could do this contingent upon number of lines just in case time penalties and benefits are not linear? 

I can’t say how Mark's technique would scale up to millions or hundreds of thousands of lines, but certainly in my case of around 1500 lines I got a 10x speedup, from an unacceptable 20 minutes to 2 minutes, processing a text file which had suddenly acquired a singe character requiring unicode.

There is a caveat… I said line 1 of the keys is the first found line. That is not correct. Since  arrays are stored in an internally determined way, the lines will one reported in an unpredictable order. So you may need to add a Sort overhead. Sort is still fast even for Unicode text, though scaling to millions of lines…I don’t know; hopefully the number of found lines would be small, so that wouldn’t be a problem. Just don’t search for “the”.

The take-away lesson is to avoid anything which involves recursively finding line-endings in Unicode text, even if implicitly [A note to Mark W.: I still think the algorithm for “line k of tText” would be worth  making more efficient - as I read your comments it uses a general case search processor for Unicode which has to take account of a large number of possible variants of representations of characters.]

If your text is plain ascii or native (or if you could process an ascii version of the text for finding strings) the benefits of converting to an array may be less striking. But since the speed-up comes from the random access to the found lines I wouldn’t be surprised to find an advantage even there. The implementation of arrays seems to be extraordinarily efficient. [An OT thought just struck me - is that why NoSQL databases as used in AWC work? I know nothing about NoSQL or AWC but if I go on board with Create I may have to learn.]

 
Neville Smythe







More information about the use-livecode mailing list