Small regex project for pay [CLOSED]

Richard Gaskin ambassador at fourthworld.com
Thu Mar 17 13:12:18 EDT 2016


That's a handy handler, Peter, but I think it would need to be enhanced 
to accommodate Paul's request here, as his algo needs to account for not 
only white space but also punctuation.  Trickier, it needs to 
accommodate punctuation across multiple languages, so the range of 
characters to be checked could be potentially quite lengthy and perhaps 
difficult to anticipate for all possible use cases.

Personally, I wouldn't bother with any language-parsing tasks in 
anything prior to v7.0, given the power of trueWord.  As Mark Waddingham 
has noted here, most of the increase in the engine size between v6 and 
v7 is Unicode libraries and tables whose purpose is to handle exactly 
this sort of problem.

V6 and v7 have been identified as approaching EOL ASAP, when v8.0 goes 
final.  All serious apps I work on here are being developed in v8, 
shipping for now with either v6.x or 7.x as needed depending on the 
specifics of the app at hand.  But the moment v8.0 goes final I'll be 
able to have confidence that it'll do what I need because I've already 
run my work through this new engine and have already submitted bug 
reports that have already been addressed.

Waiting to run my work in v8.0 until after v8.0 Stable is released would 
only increase my changes that some uncommon thing my app depends on met 
with a regression I didn't identify when I had the chance, pushing back 
my own time-to-market by having to wait for a v8.1.

With more than 2500+ bug fixes and enhancements between v6.0 and v8.0, 
there's plenty there to keep me motivated about the upgrade.

-- 
  Richard Gaskin
  Fourth World Systems
  Software Design and Development for the Desktop, Mobile, and the Web
  ____________________________________________________________________
  Ambassador at FourthWorld.com                http://www.FourthWorld.com


Peter M. Brigham wrote:

> On Mar 17, 2016, at 8:20 AM, David Bovill wrote:
>
>> Hi Peter, any chance of sharing it?
>
> Sure. Below is the offsets function that returns all the offsets of a string in a container. Then all you have to do is something like this:
>
> function getStringChunks pSearchStr,pText,beginsWholeWord,endsWholeWord
>    if beginsWholeWord = empty then put false into beginsWholeWord
>    if endsWholeWord = empty then put false into endsWholeWord
>    -- default to simple offsets, not whole word offsets
>    put offsets(pSearchStr,pText) into offSts
>    replace comma with cr in offSts
>    put len(pSearchStr) into strLen
>    put cr & space & tab & " " into wSpace
>    -- include non-breaking space
>    repeat for each line i in offSts
>       put char i-1 of pText into charBefore
>       put char i+strLen of pText into charAfter
>       if beginsWholeWord and not (charBefore is in wSpace) then next repeat
>       if endsWholeWord and not (charAfter is in wSpace) then next repeat
>       put i & comma & (i+strLen-1) & cr after outList
>    end repeat
>    return line 1 to -1 of outList
> end getStringChunks
>
> Pass beginsWholeWord = true and endsWholeWord = true for wholeMatches.
> Might not be really fast for pText of 100K+ characters, but should be quite efficient on smaller texts. Often LC's chunking functions are faster than regex anyway.
>
> ---------
>
> function offsets str, pContainer
>    -- returns a comma-delimited list of all the offsets of str in pContainer
>    -- returns 0 if not found
>    -- note: offsets("xx","xxxxxx") returns "1,3,5" not "1,2,3,4,5"
>    --     ie, overlapping offsets are not counted
>    -- note: to get the last occurrence of a string in a container (often useful)
>    --     use "item -1 of offsets(...)"
>
>    if str is not in pContainer then return 0
>    put 0 into startPoint
>    repeat
>       put offset(str,pContainer,startPoint) into thisOffset
>       if thisOffset = 0 then exit repeat
>       add thisOffset to startPoint
>       put startPoint & comma after offsetList
>       add length(str)-1 to startPoint
>    end repeat
>    return item 1 to -1 of offsetList -- delete trailing comma
> end offsets
>
> -- Peter
>
> Peter M. Brigham
> pmbrig at gmail.com
> http://home.comcast.net/~pmbrig





More information about the use-livecode mailing list