Search / replace

Kay C Lan lan.kc.macmail at gmail.com
Tue Feb 23 12:03:47 EST 2010


Folks,

because I was the first one to suggest the use of offset() I feel it's
my duty to advise all that although it was an answer to the question,
it isn't a suitable solution to the problem.

I'm a little surprised that Richard The Benchmark Obsessive hasn't
already jumped all over this but as my plane home is delayed out of
New York I've had a bit of time to run some tests and here are the
results.

Firstly I created a script that would write a set number of lines of
data. A line of data randomly consists of 1 to 10 'sentences'. Each
sentence randomly consists of 10 to 20 'words'. Each word randomly
consist of 3 to 8 chars. Each char randomly consists of 33 to 122
converted with numToChar() - therefore the [ char also appears
randomly in the data.

Each line randomly either has no 'n[' (n = 1-9) in it; n[ randomly at
the beginning and/or in the content, and/or at the end. Each n[
randomly has a space before and/or after it. The sample single line
below contains ten n[ eight of which have spaces on both sides, 2 of
which are immediately followed by a char. It happens that it doesn't
contain an example of a n[ immediately following a char. There are
eleven [ in the line that do not have a digit before it.

Here is a sample line:

ILld7l~ 4M*g$n 0DTg8 uQz]Y7 /n4\) qA$kU?` _Y~J 70ep{gM& w")jS [Wc3+1.h
jmhym3qt -4u66S vy&t6M R<Lp!Q{f 2[,>unyj4, ZSVjE3I! (|.4AIM# 4IQqp
:1P1cn} -M/]f Qfb ZZIJ}Q .gkP~5S| "rQ!#J6M @]n='u=j YDp) "_tN =!YP
k?;? )AT0g<7 YDvb]9 7[AK:@D _sLg$E4 XODJ>qe~ Qxs/V D7z-Af Un!, JPfXt<
Rxl}-6N o};WIV S%:p4 2[ +<uV Qg25nw Bwr|Lp` eApl [r]h{|y, *hBZpK
V~e*G[ 4y at .W e*R" ;}(2>Q!@ &}H ^v(Jd77/  2[ f*O/o5 V_wl_ 15_ {-MnS{b
90vL6 QnU/D /NgC_~B) ,Dy59_O# PBFC7( ks6W at D8_ 0Za L?aIyNq ^8rD3 t;bM[c
~K{Io8 7[ $n:!Em<: */v-H 28-p4 CWef{. wsKZ, YS'=0+{ M;J^ A\u"W qaBk9~H
NIlgjs(& v=% FF`9 ['n{GT* Iwetn ohAqLg k4r[/pR 2$e*u(B+ qQBL }`Y:/eD}
2[ {iJC'a ~/f` WF_'q,m V:#o *gWn|Wlk x^pV MVg+e? ib\v at 6@ 3j~ E[bGn>+A
^3?U"} jE`u fZ:}/h=  9[ <):Jb% vnQo%\N M2PCx.3 =o+gY.ST co>>"&
[K.b-DSU +}IV_ LixZ KF}DX Dqo d,j_ CLMy vytAhx9O ,oV 5[ \RzCJ G&G
Lc999 -wh1{ e~U~|H hKh LWV at _tfr o%6 %S`W PKL^y .?=?9Pf\ 5[ H at e cH{\=r
$"`KEy >={aB4 *iY -0-] {[e 0(s77^ :x;VI IY~m~~J 4[ _o%NQ ~UG w`@!
a^)WxL 2hKw#X 0w[u6 8;+U wl5K[N7) X2X#j<@ {-a0{eg

After creating the data I put it into a customProp. From there I
placed it into a local variable, from which I ran the scripts listed
in this thread - referencing the variable, not a field. The result I
placed into another customProp.

Here are the results - all times in millisecs.

The number of Lines: 100
The number of Words: 6377
The number of Chars: 45908
96,96,96,96,97, Min: 96 Max: 97 Avg: 96 - Jim Bufalini's offset Test
31,30,30,31,30, Min: 30 Max: 31 Avg: 30 - Igor de Oliveira Couto's repeat Test
2,2,2,2,2, Min: 2 Max: 2 Avg: 2 - Jim Ault's lineDel Test
12,11,11,11,11, Min: 11 Max: 12 Avg: 11 - Mike Bonner's regex Test
4,4,3,4,3, Min: 3 Max: 4 Avg: 4 - Kee Nethery's repeat Test

The number of Lines: 1000
The number of Words: 64241
The number of Chars: 466296
9613,9610,9597,9596,9608, Min: 9596 Max: 9613 Avg: 9608 - Jim
Bufalini's offset Test
833,831,834,832,832, Min: 831 Max: 834 Avg: 832 - Igor de Oliveira
Couto's repeat Test
17,18,18,18,18, Min: 17 Max: 18 Avg: 18 - Jim Ault's lineDel Test
85,85,87,89,86, Min: 85 Max: 89 Avg: 86 - Mike Bonner's regex Test
28,27,28,28,28, Min: 27 Max: 28 Avg: 28 - Kee Nethery's repeat Test

The number of Lines: 10000
The number of Words: 626405
The number of Chars: 4585675
--didn't even bother with offset
--only did the test once
426252, Igor de Oliveira Couto's repeat Test
238, Jim Ault's lineDel Test
848, Mike Bonner's regex Test
292, Kee Nethery's repeat Test

As can be clearly seen, the offset() version does not scale very well
at all, and the repeat version offered by Igor is not much better once
you start to work with large amounts of data. Clearly Jim A's solution
scales the best with Kee's not far behind.

Unfortunately the story doesn't end there. I compared the output of
only 10 lines, using BBEEdit's Find Differences... and this is what I
got.

Jim B = Igor
Jim B <> Jim A (2 lines)
Jim B = Mike (1 line)
Jim B <> Kee (1 line)
Igor <> Jim A (2 lines) --
Igor = Mike (1 line) --
Igor <> Kee (1 line) -- all makes sense as it's the same as Jim B's
Jim A <> Mike (1 line)
Jim A <> Kee (1 line)
Mike <> Kee (1 line)

I don't have the time to find exactly where the errors are but I think
one of the problems with Jim A's solution is the use of the Rev term
'word' in his script as clearly my words don't match what Rev thinks a
word is. Jim A's might work in a real world case though.

I quickly note that everyone removed the random [ (those without a
preceding number) except for Kee, whose script left them all in place.

So until further refinements are made, maybe Mike is the current
solution winner.

Gotta go, they're calling my flight… at last :-|



More information about the use-livecode mailing list