Anyone got one of these?

Trevor DeVore lists at mangomultimedia.com
Fri Jan 26 15:13:34 EST 2007


On Jan 26, 2007, at 10:56 AM, Chipp Walters wrote:

> function stripAllTagsBut pHtml,pTagsList
>  --> pTagsList IS A LIST OF TAGS NOT TO EXCLUDE FROM PARSING
>  --> EX. LINE 1 OF pTagsList CAN BE "img" AND LINE 2 CAN BE "b", etc..
>
>
> It's used to strip all tags from HTML but those in the pTagsList  
> parameter.
>
> IOW, it can be used to grab the HTML of a page, and strip  
> everything but the
> img tags.
>
> I'm starting to write it, but thought I'd ask-- just in case.

Well, I have one I've been working on that takes a list of things to  
strip.  You could modify it to fit your needs maybe.  The first  
version was much more compact and use matchText.  Then I stress  
tested it and it was slooooow and I had to call it quite often and  
with large amounts of text.  So I came up with the attached version.


-- 
Trevor DeVore
Blue Mango Learning Systems - www.bluemangolearning.com
trevor at bluemangolearning.com



/**
  * Cleanses a string of the specified Revolution HTML tags.
  *
  * @param  pHTML               HTML to act on.
  * @param  pStripFilter        List of tags to strip:  
p,size,face,lang,color,bgcolor,b,i,u,strike,sub,sup,box,threedbox,expand 
ed,condensed,img,a.
  * @param  pStripTrailingCR pass true to strip any trailing CR from  
END of the pHTML.
  *
  * @return empty
  */
FUNCTION str_stripHTML pHTML, pStripFilter, pStripTrailingCR
     local tProp,tFontFilter,tInlineFilter,tAttributeFilter,tStart,tEnd
     local tSkip,tOffset1,tOffset2,tDeleteChars,i

     set the wholematches to true

     --> PROCESS pStripFilter
     REPEAT for each item tProp in pStripFilter
         IF tProp is among the items of  
"p,b,i,u,strike,sub,sup,box,threedbox,expanded,condensed" THEN
             put tProp &comma after tAttributeFilter
         ELSE IF tProp is among the items of  
"face,size,color,bgcolor,lang" THEN
             put tProp &comma after tFontFilter
         ELSE IF tProp is among the items of "img,a" THEN
             put tProp & comma after tInlineFilter
         END IF
     END REPEAT

     --> PROCESS
     REPEAT forever --> OK, I TRIED USING MATCHCHUNK WITH THIS BUT IT  
WAS A GAZILLION TIMES SLOWER
         put offset("<font", pHTML, tSkip) into tOffset1
         IF tOffset1 > 0 THEN
             put offset(">", pHTML, tSkip + tOffset1) into tOffset2 -- 
 > GET CLOSING TAG

             --> LOOP THROUGH PROPS AND ERASE
             REPEAT for each item tProp in tFontFilter
                 put offset(space & tProp & "=" & quote, pHTML, tSkip  
+ tOffset1) into tStart
                 IF tStart > 0 AND tSkip + tOffset1 + tStart < tSkip  
+ tOffset1 + tOffset2 THEN --> ONLY LOOK FOR PROPS IN CURRENT FONT TAG
                     get tSkip + tOffset1 + tStart + length(tProp) + 2
                     put offset(quote, pHTML, it) into tEnd
                     IF tEnd > 0 THEN
                         put tSkip + tStart + tOffset1 & comma & it +  
tEnd & cr after tDeleteChars
                     END IF
                 END IF
             END REPEAT

             --> NOW MOVE BACKWARDS THROUGH LIST AND DELETE
             REPEAT with i = the number of lines of tDeleteChars down  
to 1
                 delete char (item 1 of line i of tDeleteChars) to  
(item 2 of line i of tDeleteChars) of pHTML
             END REPEAT
             put empty into tDeleteChars
         ELSE
             exit REPEAT
         END IF
         add tOffset1 + 4 to tSkip
     END REPEAT

     REPEAT for each item tProp in tAttributeFilter
         replace "<"&tProp&">" with empty in pHTML
         replace "</"&tProp&">" with empty in pHTML
     END REPEAT

     REPEAT for each item tProp in tInlineFilter
         REPEAT forever
             put offset("<"&tProp, pHTML) into tStart
             IF tStart > 0 THEN
                 put offset(">", pHTML, tStart) into tEnd
                 IF tEnd > 0 THEN
                     delete char tStart to (tStart+tEnd) of pHTML
                 ELSE
                     exit REPEAT
                 END IF
             ELSE
                 exit REPEAT
             END IF
         END REPEAT
     END REPEAT

     IF "a" is among the items of tInlineFilter THEN
         replace "</a>" with empty in pHTML
     END IF

     --> REMOVE ANY LONELY <FONT> TAGS
     REPEAT forever
         put offset("<font>", pHTML) into tStart
         IF tStart > 0 THEN
             put offset("</font>", pHTML, tStart) into tEnd
             IF tEnd > 0 THEN
                 delete char tStart+tEnd to tStart+tEnd+6 of pHTML
                 delete char tStart to tStart+5 of pHTML
             ELSE
                 exit REPEAT
             END IF
         ELSE
             exit REPEAT
         END IF
     END REPEAT

     --> REMOVE TRAILING RETURNS
     IF pStripTrailingCR THEN
         REPEAT forever
             IF char -8 to -1 of pHTML is cr&"<p></p>" THEN
                 delete char -8 to -1 of pHTML
             ELSE
                 exit REPEAT
             END IF
         END REPEAT
     END IF

     return pHTML
END str_stripHTML




More information about the use-livecode mailing list