MatchText, MatchChunk and the needle in the haystack
Jim Ault
JimAultWins at yahoo.com
Tue Mar 20 11:29:48 EDT 2007
> Jim, Dave, Devin
>
> Thanks for your help in making me think harder about this. I literally
> woke up out of a dream this morning and knew right away what was wrong
> with the script. There was one error that would have persistently been a
> problem that I have fixed now.
Glad it worked out so well. Data mining is a tricky business, especially if
the originator allows delimiters to also be content (such as commas and
hyphens). The one change I would make in your routine is the use of a tab
instead of a comma as a delim, since this is a common character, but that
depends on your data set. I assume that are not encountering and commas in
the data.
I love the mornings when I wake up and realize the answer to a programming
puzzle. No matter what the weather, it is a sunny day for me :-)
Jim Ault
Las Vegas
On 3/20/07 3:12 AM, "Bryan McCormick" <bryan at deepfoo.com> wrote:
> Jim, Dave, Devin
>
> Thanks for your help in making me think harder about this. I literally
> woke up out of a dream this morning and knew right away what was wrong
> with the script. There was one error that would have persistently been a
> problem that I have fixed now.
>
> In the interests of anyone else who encounters a similar horrible string
> task, the solution is provided below.
>
> One more thing. You all get credit for making me think harder about what
> else was in the files that might have been a random char throwing things
> off.
>
> Now, I did go and change the script to make it simpler. I realized I
> only needed to find the hyphen at the start of the date and simply
> advance forward past the next hyphen in the date string. Since we were
> dealing with fixed length records forward from the first hyphen (three
> char month, hyphen, two char year) this was the simplest way.
>
> Genius? I thought so.
>
> As luck would happen I had hit upon the few records that were problem
> children right off the bat.
>
> It turned out that a few of the records had the word "in-line" with a
> hyphen which threw off the whole thing. So there is a separate script
> when the file is read in that checks now for nulls, odd-ball ascii
> codes, and our friend "in-line". I was lucky in this case that the
> records were so simple. The alternative would have been to keep the
> "-Jan-...-Dec-" chunks and walk through the file 12 times. No big deal I
> suppose and it could always be done that way if one had different chunks
> to search for.
>
> Anyway, here is the finished script with comments. I hope it helps
> others who might have similar issues. I have over 5000 of these files to
> do which will now take about ten minutes versus the agony (and days) I'd
> have had to endure if there had been no community here to draw upon for
> help and if rev was not so darn handy.
>
> By the way the script that adds the return character also puts in a
> comma in the right place after the date so that I have another delimiter
> to work with and the record in the end is comma delimited with a return
> character as the record marker. Much better than the ugly long single
> string I started out with.
>
> Thanks All.
>
> ------------------------------------------
>
>
> on mouseUp
> put fld 1 into textBlock
> put makeOffsets("-",textBlock,1) into varOffsets
> sort lines of varOffsets numeric descending
> -- this is the only way it works as otherwise the char count gets thrown
> -- off. essentially we are working up from the end of the string forward
> repeat for each line varRecord in varOffsets
> put char varRecord-2 to varRecord-1 of textBlock into eval
> if char 1 of eval is a number and char 2 of eval is a number then
> put comma after char varRecord+6 of textBlock
> put cr before char varRecord-2 of textBlock
> else
> if char 1 of eval is not a number and char 2 of eval is a
> number then
> put comma after char varRecord+6 of textBlock
> put cr before char varRecord-1 of textBlock
> end if
> end if
> end repeat
> put textBlock into fld 1
> end mouseUp
>
> function makeOffsets varChunk,textBlock,posStart
> if posStart = empty then
> put 1 into pos
> else
> put posStart into pos
> end if
> repeat until varOffset = 0
> put offset(varChunk, textBlock, pos) into varOffset
> if varOffset <>0 then
> put varOffset+pos&return after newText
> -- this was what was mucked-up in the original script
> -- have to add the prior pos to the new one since we
> -- are using the "skip chars" option and need to add
> -- add the prior position to the new relative pos
> add varOffset+length(varChunk)+6 to pos
> -- i could get away with adding a fixed number in this
> -- case since the date was never going to be shorter than
> -- six chars + the found offset + chunk, ("-") in this case
> else
> exit repeat
> end if
> end repeat
> return newText
> end makeOffsets
More information about the use-livecode
mailing list