Internet communication through a Firewall

Tue Sep 28 23:48:48 EDT 2004

On 9/28/04 4:20 PM, "J. Landman Gay" <jacque at hyperactivesw.com> wrote:

> On 9/28/04 2:55 PM, Ken Ray wrote:
> 
>> function trim what
>>   local tText
>>   get matchText(what,"(?s)\s*(\S.*\S)\s*",tText)
>>   return tText
>> end trim
> 
> Being somewhat grep-challenged, could you explain more what the above
> does? 

Sure. To translate it into english, what the regex says is:

"Search for any amount of white space (spaces, tabs, crs, non-breaking
spaces, etc.) up to a non-whitespace character. Start capturing all of the
characters starting at that point until you reach a character where what
follows it is just another bunch of whitespace-only characters. Stop there,
and return what you've captured."

For those abnormally interested in learning regex (grin), here's the
breakdown:

(?s) -- This means to do a "non-greedy" search. Regex likes to be "greedy",
that is, it likes to return the *last* match in a string that satisfies what
you're looking for (for example, if the string is "The red coat is red" and
you were matching for "red", you'd get the second "red" in the string).
Using the (?s) directive tells it to return the *first* match instead.

\s  -- This stands for any whitespace character (tab, CRs, spaces,
non-breaking spaces, etc.).

*  -- This means "0 or more of the preceding character". When attached to
\s, this means "0 or more whitespace characters".

( )  -- This is the data to capture and return, in the variable(s) at the
end of the "matchText" function (in the code above, this would be "tText").

\S  -- This stands for any non-whitespace character.

.  -- This stands for *any* character at all.

So:

(\S.*\S)  -- This means to capture a chunk of text that starts and ends with
a non-whitespace character with anything else inbetween. Since regex needs
to satisfy the entire pattern, marrying this with the \s* before and after
the captured portion means that it will only be a match if what precedes the
captured portion is ONLY whitespace, and what follows the captured portion
is ONLY whitespace.

> Is there a difference between that and this:
> 
>    put word 1 to -1 of tText into tText

One minor difference - your code above won't strip off non-breaking spaces
or odd combinations where a word is followed by a space and then a
non-breaking space. Compare:

The code:

on mouseUp
  put tab & cr & space & numToChar(202) & "Jacque" & numToChar(202) & \
    tab & cr & space & "Ken" && numToChar(202) & tab & cr & space into temp
  put "|" & word 1 to -1 of temp & "|"
end mouseUp

returns:

| Jacque    
 Ken  |

where the character before "Jacque" is a non-breaking space, and the two
characters following "Ken" are a regular space followed by a non-breaking
space.

The code:

on mouseUp
  local tText
  put tab & cr & space & numToChar(202) & "Jacque" & numToChar(202) & \
    tab & cr & space & "Ken" && numToChar(202) & tab & cr & space into temp
  get matchText(temp,"(?s)\s*(\S.*\S)\s*",tText)
  put "|" & tText & "|"
end mouseUp

returns:

|Jacque     
 Ken|

True these are rare circumstances, but since I do a bunch of work with web
stuff where non-breaking spaces are used to move text into a particular
place on a page, it is important that I strip *everything*.

> That's my standard way of removing leading and trailing white space. My
> way doesn't remove duplicate spaces between words, however.

The regex above doesn't either; it's only concerned about leading and
trailing whitespace.

OK, class dismissed!

:-)

Ken Ray
Sons of Thunder Software
Web site: http://www.sonsothunder.com/
Email: kray at sonsothunder.com