MatchText, MatchChunk and the needle in the haystack

Jim Ault JimAultWins at yahoo.com
Tue Mar 20 11:29:48 EDT 2007


> Jim, Dave, Devin
> 
> Thanks for your help in making me think harder about this. I literally
> woke up out of a dream this morning and knew right away what was wrong
> with the script. There was one error that would have persistently been a
> problem that I have fixed now.

Glad it worked out so well.  Data mining is a tricky business, especially if
the originator allows delimiters to also be content (such as commas and
hyphens).  The one change I would make in your routine is the use of a tab
instead of a comma as a delim, since this is a common character, but that
depends on your data set.  I assume that are not encountering and commas in
the data.

I love the mornings when I wake up and realize the answer to a programming
puzzle.  No matter what the weather, it is a sunny day for me :-)

Jim Ault
Las Vegas


On 3/20/07 3:12 AM, "Bryan McCormick" <bryan at deepfoo.com> wrote:

> Jim, Dave, Devin
> 
> Thanks for your help in making me think harder about this. I literally
> woke up out of a dream this morning and knew right away what was wrong
> with the script. There was one error that would have persistently been a
> problem that I have fixed now.
> 
> In the interests of anyone else who encounters a similar horrible string
> task, the solution is provided below.
> 
> One more thing. You all get credit for making me think harder about what
> else was in the files that might have been a random char throwing things
> off.
> 
> Now, I did go and change the script to make it simpler. I realized I
> only needed to find the hyphen at the start of the date and simply
> advance forward past the next hyphen in the date string. Since we were
> dealing with fixed length records forward from the first hyphen (three
> char month, hyphen, two char year) this was the simplest way.
> 
> Genius? I thought so.
> 
> As luck would happen I had hit upon the few records that were problem
> children right off the bat.
> 
> It turned out that a few of the records had the word "in-line" with a
> hyphen which threw off the whole thing. So there is a separate script
> when the file is read in that checks now for nulls, odd-ball ascii
> codes, and our friend "in-line". I was lucky in this case that the
> records were so simple. The alternative would have been to keep the
> "-Jan-...-Dec-" chunks and walk through the file 12 times. No big deal I
> suppose and it could always be done that way if one had different chunks
> to search for.
> 
> Anyway, here is the finished script with comments. I hope it helps
> others who might have similar issues. I have over 5000 of these files to
> do which will now take about ten minutes versus the agony (and days) I'd
> have had to endure if there had been no community here to draw upon for
> help and if rev was not so darn handy.
> 
> By the way the script that adds the return character also puts in a
> comma in the right place after the date so that I have another delimiter
> to work with and the record in the end is comma delimited with a return
> character as the record marker. Much better than the ugly long single
> string I started out with.
> 
> Thanks All.
> 
> ------------------------------------------
> 
> 
> on mouseUp
>    put fld 1 into textBlock
>    put makeOffsets("-",textBlock,1) into varOffsets
>    sort lines of varOffsets numeric descending
>    -- this is the only way it works as otherwise the char count gets thrown
>    -- off. essentially we are working up from the end of the string forward
>    repeat for each line varRecord in varOffsets
>      put char varRecord-2 to varRecord-1 of textBlock into eval
>      if char 1 of eval is a number and char 2 of eval is a number  then
>        put comma after char varRecord+6 of textBlock
>        put cr  before char varRecord-2 of textBlock
>      else
>        if char 1 of eval is  not a  number and char 2 of eval is a
> number   then
>          put comma after char varRecord+6 of textBlock
>          put cr before char varRecord-1 of textBlock
>        end if
>      end if
>    end repeat
>    put textBlock into fld 1
> end mouseUp
> 
> function makeOffsets varChunk,textBlock,posStart
>    if posStart = empty then
>      put 1 into pos
>    else
>      put posStart into pos
>    end if
>    repeat until varOffset = 0
>      put offset(varChunk, textBlock, pos) into varOffset
>      if varOffset <>0 then
>        put varOffset+pos&return after newText
>        -- this was what was mucked-up in the original script
>        -- have to add the prior pos to the new one since we
>        -- are using the "skip chars" option and need to add
>        -- add the prior position to the new relative pos
>        add varOffset+length(varChunk)+6 to pos
>        -- i could get away with adding a fixed number in this
>        -- case since the date was never going to be shorter than
>        -- six chars + the found offset + chunk, ("-") in this case
>      else
>        exit repeat
>      end if
>    end repeat
>    return newText
> end makeOffsets





More information about the use-livecode mailing list