AW: Re: Regex help needed...

Paul Dupuis paul at researchware.com
Sat Jan 30 19:28:22 EST 2016


Wow. I would not have expected such a significant difference. Regex has
been around a long time and lots of smart computer science types has
spent time coming up with ways to optimize its performance for pattern
matching. I assumed (falsely) that regex based filters in LC would be on
par or even superior than a custom function using chunks. This leads me to:

1) wondering if LC's hooks to whatever regex tool they are using under
the hood is a good as it should be
AND
2) planning on rewriting my code to use chunks.

Thanks for the post.


On 1/30/2016 6:45 PM, Richard Gaskin wrote:
> Regex is wonderfully compact to write relative to equivalent routines
> using chunk expressions, but sometimes paid for in execution time.
>
> When I come across a good regex example like the one you provided, if
> I have a moment I like to test things out to see where regex is faster
> and where it isn't.  It's really great for many things, but carries
> quite a bit of overhead.
>
> Of course for this test to be relevant it assumes that most of the
> specifiers in the regex expression are merely to identify the elements
> you're looking for, and that the data is expected to fit the
> definition you provided.
>
> Given that, it's possible to make the regex a bit simpler (see foo2
> below), but only with a modest boost to performance.  It can probably
> be simplified more, but the chunk-based alternative performed so well
> I didn't bother exploring the regex side any further.
>
> Writing a lengthier handler that uses chunk expressions seems to yield
> the same results you reported, running between 12 and 60 times faster
> (depending on the percentage of lines tested that match the criteria
> being looked for).
>
> For one-offs like validating email addresses regex can be an excellent
> fit, and even some larger tasks depending on the specifics.
>
> But for iterating across lists I've often been delightfully surprised
> by LiveCode's gracefully efficient chunk handling.
>
> Testing your original data replicated to become 250 lines long, and
> looking for page 1 among them, the script below yields:
>
> Regex: 9261 ms
> RegexLite: 7958 ms
> Chunks: 197 ms
> Chunks faster than orig regex by: 47.01 times
> Chunks faster than lite regex by: 40.4 times
> Same result? true
>
>
> on mouseUp
>   put fld 1 into tList
>   put 1 into tPage --< change this for different tests
>   put 1000 into n
>   --
>   -- Test 1: original regex
>   put the millisecs into t
>   repeat n
>     put foo1(tPage, tList) into r1
>   end repeat
>   put the millisecs - t into t1
>   --
>   -- Test 2: lighter regex
>   put the millisecs into t
>   repeat n
>     put foo2(tPage, tList) into r2
>   end repeat
>   put the millisecs - t into t2
>   --
>   -- Test 3: chunks
>   put the millisecs into t
>   repeat n
>     put foo3(tPage, tList) into r3
>   end repeat
>   put the millisecs - t into t3
>   --
>   -- Display results:
>   set the numberformat to "0.##"
>   put "Regex: "&t1 &" ms"&cr \
>         &"RegexLite: "&t2 &" ms"&cr \
>         &"Chunks: "& t3 &" ms"&cr \
>         &"Chunks faster than orig regex by: "&(t1 / t3)&" times" &cr \
>         &"Chunks faster than lite regex by: "&(t2 / t3)&" times" &cr \
>         &"Same result? "& (r1=r3) &cr&cr& r1 &cr&cr& r3
> end mouseUp
>
>
> function foo1 pPage, tList
>   put
> "(.+\t"&pPage&",\d+,\d+,\d+)|(.+\t\d+,\d+,"&pPage&",\d+)|(.+\t"&pPage&",\d*\.?\d*,\d*\.?\d*,\d*\.?\d*,\d*\.?\d*)"
> into tMatchPattern
>   filter lines of tList with regex pattern tMatchPattern
>   return tList
> end foo1
>
>
> function foo2 pPage, tList
>   put "(.+\t"&pPage&",*)|(.+\t\d+,\d+,"&pPage&",*)|(.+\t"&pPage&",*)"
> into tMatchPattern
>   filter lines of tList with regex pattern tMatchPattern
>   return tList
> end foo2
>
>
>
> function foo3 pPage, tList
>   repeat for each line tLine in tList
>     set the itemdel to tab
>     put item 3 of tLine into t1
>     put pPage &"," into tPageMarker
>     if "." is in t1 then
>       if (t1 begins with tPageMarker) then
>         put tLine &cr after tNuList
>       end if
>     else
>       if ( t1 begins with tPageMarker) OR (item 4 of tLine begins with
> tPageMarker) then
>         put tLine &cr after tNuList
>       end if
>     end if
>   end repeat
>   delete last char of tNuList
>   return tNuList
> end foo3
>
>
>
>
>
>
>
>
>
>





More information about the use-livecode mailing list