A couple of handlers...
Jim Ault
JimAultWins at yahoo.com
Thu Jan 11 11:47:21 EST 2007
On 1/11/07 5:14 AM, "David Bovill" <david at openpartnership.net> wrote:
> As I am not much good at regular expressions I thought I would share my
> ignorance with others :) Here is the best I can do with regard to extracting
> links from htmltext in fields - they only work on a single line and they do
> not find links with variable whitespace as you may get in html web pages.
>
> on html_DeconstructNameLink nextHtmlLine, @someText, @someLink
> -- <a name="/Users/david/Movies/crossingTheBridge.mp4">Crossing The
> Bridge</a>
>
> put "<a name=" & quote & "([^>]*)" & quote & ">([^<]*)</a>" into someReg
> return matchText(nextHtmlLine, someReg, someLink, someText)
> end html_DeconstructNameLink
>
> on html_DeconstructRefLink nextHtmlLine, @someText, @someLink
> -- <a href="/Users/david/Movies/crossingTheBridge.mp4">Crossing The
> Bridge</a>
>
> put "<a href=" & quote & "([^>]*)" & quote & ">([^<]*)</a>" into someReg
> return matchText(nextHtmlLine, someReg, someLink, someText)
> end html_DeconstructRefLink
>
> Is there a better way?
As much as I like and use RegEx, there are better of ways that I use for
links depending on the web content you encounter. Some pages are driven by
javascript, php, or other server program and become very nicely consistent.
Others are done using templates and are haphazardly composed.
One of the starting points I have sent to the list in the past couple months
is the non-Regex method:
replace cr with empty in pageText --remove all cr's
replace "<a" with (cr & "<a") in pageText
replace "</a" with (cr & "</a")
filter pageText with "<a*"
-- now you only have a list of <A> tags
filter pageText with "href"
-- now you only have the <A with HREF
assuming you want the "http" only
replace "http:" with cr & "http:"
-- now all lines start with "http:"
What may work in your pages is
repeat for each line LNN in pageText
put word 1 of LNN & cr after newList
end repeat
delete last char of newList
-- the gotcha would be spaces in the link
This is a little more robust
replace "http:" with cr & "http:"
set the itemDel to quote
repeat for each line LNN in pageText
put item 1 of LNN & cr after newList
end repeat
delete last char of newList
Hope this gets you close.
There are more examples I have posted in the past, so you might want to
search the archives on my name to find those threads.
http://www.mail-archive.com/use-revolution@lists.runrev.com/
Jim Ault
Las Vegas
More information about the use-livecode
mailing list