htmlText, xHTML and revXML

Jim Ault JimAultWins at yahoo.com
Fri Dec 22 21:15:44 EST 2006


On 12/22/06 4:23 PM, "David Bovill" <david at openpartnership.net> wrote:
> On an aside note - despite searching for a very long time the archives I
> cannot find the previous post on how to parse HTML to extract all image
> links or href links... I remember some clever replacing and filtering going
> on... but I forget the sequence...
> 
> Anyone have some scripts for extracting all anchors (ie "a name="http:....">
> ) or href/image links from htmltext?

This depends on your html page.

General rules I use to begin, then later refine to get to my goal.
put sorceTxt into htmlPage
replace cr with empty in htmlPage --text is now one line
replace "href=" with "href="&cr in htmlPage
replace "</a" with "</a"&cr in htmlPage
filter htmlPage with "*http://*"
set the itemdel to ">"
repeat for each line LNN in htmlPage
   put item 1 of LNN & cr after newLinkList
end repeat

replace cr with empty in htmlPage2 --text is now one line
replace "imgsrc=" with "img" & cr & "src=" in htmlPage2
replace ".jpg" with ".jpg"&cr in htmlPage2
replace ".gif" with ".gif"&cr in htmlPage2
filter htmlPage2 with "src=*"
set the itemDel to "="
repeat for each line LNN in htmlPage2
   put item 2 of LNN & cr after newImgList
end repeat

You need to refine to match your html page and your goals.
Not all 'img' tags are in link tags
Not all links have "http://"

Other variations can occur, so I would recommend doing the following test:
[1]  do these 3 steps, then read the result to see how to proceed
replace cr with empty in htmlList --text is one line
replace "href=" with "href="&cr in htmlPage
replace "</a" with "</a"&cr in htmlPage

[2]  do these 4 steps, then read the result to see how to proceed
replace cr with empty in htmlList --text is one line
replace "href=" with "href="&cr in htmlPage
replace ".jpg" with ".jpg"&cr in htmlPage
replace ".gif" with ".gif"&cr in htmlPage

Look for lines that have multiple hits that did not get separated, etc.
Look for spaces such as "href=", "href = ", "href= "
The reason you cannot remove spaces for the whole container is that img src
links could contain folder names with spaces.  If you know spaces won't
occur in your links or imgs, then add this line to the top of the code...

replace space with empty in htmlPage

Hope this helps


Jim Ault
Las Vegas






More information about the use-livecode mailing list