AW: Re: Regex help needed...

Richard Gaskin ambassador at fourthworld.com
Sat Jan 30 18:45:53 EST 2016


Regex is wonderfully compact to write relative to equivalent routines 
using chunk expressions, but sometimes paid for in execution time.

When I come across a good regex example like the one you provided, if I 
have a moment I like to test things out to see where regex is faster and 
where it isn't.  It's really great for many things, but carries quite a 
bit of overhead.

Of course for this test to be relevant it assumes that most of the 
specifiers in the regex expression are merely to identify the elements 
you're looking for, and that the data is expected to fit the definition 
you provided.

Given that, it's possible to make the regex a bit simpler (see foo2 
below), but only with a modest boost to performance.  It can probably be 
simplified more, but the chunk-based alternative performed so well I 
didn't bother exploring the regex side any further.

Writing a lengthier handler that uses chunk expressions seems to yield 
the same results you reported, running between 12 and 60 times faster 
(depending on the percentage of lines tested that match the criteria 
being looked for).

For one-offs like validating email addresses regex can be an excellent 
fit, and even some larger tasks depending on the specifics.

But for iterating across lists I've often been delightfully surprised by 
LiveCode's gracefully efficient chunk handling.

Testing your original data replicated to become 250 lines long, and 
looking for page 1 among them, the script below yields:

Regex: 9261 ms
RegexLite: 7958 ms
Chunks: 197 ms
Chunks faster than orig regex by: 47.01 times
Chunks faster than lite regex by: 40.4 times
Same result? true


on mouseUp
   put fld 1 into tList
   put 1 into tPage --< change this for different tests
   put 1000 into n
   --
   -- Test 1: original regex
   put the millisecs into t
   repeat n
     put foo1(tPage, tList) into r1
   end repeat
   put the millisecs - t into t1
   --
   -- Test 2: lighter regex
   put the millisecs into t
   repeat n
     put foo2(tPage, tList) into r2
   end repeat
   put the millisecs - t into t2
   --
   -- Test 3: chunks
   put the millisecs into t
   repeat n
     put foo3(tPage, tList) into r3
   end repeat
   put the millisecs - t into t3
   --
   -- Display results:
   set the numberformat to "0.##"
   put "Regex: "&t1 &" ms"&cr \
         &"RegexLite: "&t2 &" ms"&cr \
         &"Chunks: "& t3 &" ms"&cr \
         &"Chunks faster than orig regex by: "&(t1 / t3)&" times" &cr \
         &"Chunks faster than lite regex by: "&(t2 / t3)&" times" &cr \
         &"Same result? "& (r1=r3) &cr&cr& r1 &cr&cr& r3
end mouseUp


function foo1 pPage, tList
   put 
"(.+\t"&pPage&",\d+,\d+,\d+)|(.+\t\d+,\d+,"&pPage&",\d+)|(.+\t"&pPage&",\d*\.?\d*,\d*\.?\d*,\d*\.?\d*,\d*\.?\d*)" 
into tMatchPattern
   filter lines of tList with regex pattern tMatchPattern
   return tList
end foo1


function foo2 pPage, tList
   put "(.+\t"&pPage&",*)|(.+\t\d+,\d+,"&pPage&",*)|(.+\t"&pPage&",*)" 
into tMatchPattern
   filter lines of tList with regex pattern tMatchPattern
   return tList
end foo2



function foo3 pPage, tList
   repeat for each line tLine in tList
     set the itemdel to tab
     put item 3 of tLine into t1
     put pPage &"," into tPageMarker
     if "." is in t1 then
       if (t1 begins with tPageMarker) then
         put tLine &cr after tNuList
       end if
     else
       if ( t1 begins with tPageMarker) OR (item 4 of tLine begins with 
tPageMarker) then
         put tLine &cr after tNuList
       end if
     end if
   end repeat
   delete last char of tNuList
   return tNuList
end foo3










-- 
  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  ____________________________________________________________________
  Ambassador at FourthWorld.com                http://www.FourthWorld.com


Paul Dupuis wrote:
> Never mind. Solved it.
>
> It was the pattern for the 2nd format. Fixed with
> "(.+\t"&pPage&",\d+,\d+,\d+)|(.+\t\d+,\d+,"&pPage&",\d+)|(.+\t"&pPage&",\d*\.?\d*,\d*\.?\d*,\d*\.?\d*,\d*\.?\d*)"
>
> On 1/30/2016 3:17 PM, Paul Dupuis wrote:
>> I need some regex help.
>>
>> I have a list that is of the form:
>> <number><tab><text><tab><numberCol1><tab><numberCol2>
>> i.e.
>> 1    Testing    1,747    1,1,1,747
>> 2    Testing    752,1800    1,752,1,1800
>> 3    Testing    5398,5846    2,320,2,768
>> 4    Testing    3,111.951,683.915,302.268,385.751
>>  3,111.951,683.915,302.268,385.751
>>
>> <numberCol2> can have a list of number in 1 of 2 formats:
>> A comma separated list of 4 integers, i.e.
>> <integer1>,<integer2>,<integer3>,<integer4>
>> OR
>> A comma separated list of 1 integer, followed by 4 decimal numbers, i.e.
>> <integer>,<decimal>,<decimal>,<decimal>,<decimal>
>>
>> I need filter the lines of this list with a REGEX pattern to get lines
>> WHERE a value pPage matches certain places in <numberCol2>, specifically:
>> where pPage is equal to either <integer1> or <integer3> in the first
>> format(i.e. item 1 or item 3)
>> OR
>> where pPage is equal to <integer> in the second format(i.e. item 1)
>>
>> So my code is:
>> put
>> "((.+\t"&pPage&",\d+,\d+,\d+)|(.+\t\d+,\d+,"&pPage&",\d+)|(.+\t"&pPage&",?[0-9]*\.?[0-9]+,?[0-9]*\.?[0-9]+,?[0-9]*\.?[0-9]+,?[0-9]*\.?[0-9]+))"
>> into tMatchPattern
>> filter lines of tList with regex pattern tMatchPattern
>>
>> If pPage is 1 then I should get:
>> 1    Testing    1,747    1,1,1,747
>> 2    Testing    752,1800    1,752,1,1800
>> and I do. If pPage is 2 then I should get:
>> 3    Testing    5398,5846    2,320,2,768
>> and I do. If pPage is 3 then I should get:
>> 4    Testing    3,111.951,683.915,302.268,385.751
>>  3,111.951,683.915,302.268,385.751
>> and I do. if pPage is 4 then I should get and empty list, and I do, but
>> when pPage is 5, I am expecting an empty list and I get
>> 3    Testing    5398,5846    2,320,2,768
>>
>> So something is wrong with my Regex, but I can not figure out what? It
>> looks like it is matching against <numberCol1> in the last case
>> (pPage=5) but it should not since there are only 2 items in the list
>> rather than 4 or 5.
>>
>> I am using LiveCode 6.7.6
>>





More information about the use-livecode mailing list