Small regex project for pay [CLOSED]
Richard Gaskin
ambassador at fourthworld.com
Thu Mar 17 13:12:18 EDT 2016
That's a handy handler, Peter, but I think it would need to be enhanced
to accommodate Paul's request here, as his algo needs to account for not
only white space but also punctuation. Trickier, it needs to
accommodate punctuation across multiple languages, so the range of
characters to be checked could be potentially quite lengthy and perhaps
difficult to anticipate for all possible use cases.
Personally, I wouldn't bother with any language-parsing tasks in
anything prior to v7.0, given the power of trueWord. As Mark Waddingham
has noted here, most of the increase in the engine size between v6 and
v7 is Unicode libraries and tables whose purpose is to handle exactly
this sort of problem.
V6 and v7 have been identified as approaching EOL ASAP, when v8.0 goes
final. All serious apps I work on here are being developed in v8,
shipping for now with either v6.x or 7.x as needed depending on the
specifics of the app at hand. But the moment v8.0 goes final I'll be
able to have confidence that it'll do what I need because I've already
run my work through this new engine and have already submitted bug
reports that have already been addressed.
Waiting to run my work in v8.0 until after v8.0 Stable is released would
only increase my changes that some uncommon thing my app depends on met
with a regression I didn't identify when I had the chance, pushing back
my own time-to-market by having to wait for a v8.1.
With more than 2500+ bug fixes and enhancements between v6.0 and v8.0,
there's plenty there to keep me motivated about the upgrade.
--
Richard Gaskin
Fourth World Systems
Software Design and Development for the Desktop, Mobile, and the Web
____________________________________________________________________
Ambassador at FourthWorld.com http://www.FourthWorld.com
Peter M. Brigham wrote:
> On Mar 17, 2016, at 8:20 AM, David Bovill wrote:
>
>> Hi Peter, any chance of sharing it?
>
> Sure. Below is the offsets function that returns all the offsets of a string in a container. Then all you have to do is something like this:
>
> function getStringChunks pSearchStr,pText,beginsWholeWord,endsWholeWord
> if beginsWholeWord = empty then put false into beginsWholeWord
> if endsWholeWord = empty then put false into endsWholeWord
> -- default to simple offsets, not whole word offsets
> put offsets(pSearchStr,pText) into offSts
> replace comma with cr in offSts
> put len(pSearchStr) into strLen
> put cr & space & tab & " " into wSpace
> -- include non-breaking space
> repeat for each line i in offSts
> put char i-1 of pText into charBefore
> put char i+strLen of pText into charAfter
> if beginsWholeWord and not (charBefore is in wSpace) then next repeat
> if endsWholeWord and not (charAfter is in wSpace) then next repeat
> put i & comma & (i+strLen-1) & cr after outList
> end repeat
> return line 1 to -1 of outList
> end getStringChunks
>
> Pass beginsWholeWord = true and endsWholeWord = true for wholeMatches.
> Might not be really fast for pText of 100K+ characters, but should be quite efficient on smaller texts. Often LC's chunking functions are faster than regex anyway.
>
> ---------
>
> function offsets str, pContainer
> -- returns a comma-delimited list of all the offsets of str in pContainer
> -- returns 0 if not found
> -- note: offsets("xx","xxxxxx") returns "1,3,5" not "1,2,3,4,5"
> -- ie, overlapping offsets are not counted
> -- note: to get the last occurrence of a string in a container (often useful)
> -- use "item -1 of offsets(...)"
>
> if str is not in pContainer then return 0
> put 0 into startPoint
> repeat
> put offset(str,pContainer,startPoint) into thisOffset
> if thisOffset = 0 then exit repeat
> add thisOffset to startPoint
> put startPoint & comma after offsetList
> add length(str)-1 to startPoint
> end repeat
> return item 1 to -1 of offsetList -- delete trailing comma
> end offsets
>
> -- Peter
>
> Peter M. Brigham
> pmbrig at gmail.com
> http://home.comcast.net/~pmbrig
More information about the use-livecode
mailing list