remove html tags from text

Mark Smith mark at maseurope.net
Sat Sep 9 03:58:04 EDT 2006


On 9 Sep 2006, at 03:57, Richard Gaskin wrote:
>
> So I have two questions about the sort of variable-based methods  
> for filtering SGML-style tags and using a field object to so the same:
>
> 1. Which is more forgiving of html which may not be well-formed?
>
> 2. Which is faster?
>

A quick initial test. The html used was what returned from

get URL "http://google.com"

I amended a tag that was <b>Web</b>
to  <b>Web 5<10 </b>
and then
<b>Web 10>5 </b>


I ran 100 iterations of each of the following

1)
function stripHtmlTagsUsingField tHtml
   set the htmlText of fld "hiddenFld" to tHtml
   return the text of fld "hiddenFld"
end stripHtmlTagsUsingField

This took 370 ms
it failed on 5<10 (the tag content returned was "Web 5", though the  
following content was ok)
it succeeded with 10>5

2)
function stsStripHTML what
   replace cr with empty in what -- my addition to Kens handler - to  
handle tags containing cr
   put replaceText(what,"<.*?>","") into noHTML
   return noHTML
end stsStripHTML

This took 920 ms
Same results as 1)

3)
function stripHtmlTags tHtml
   replace cr with empty in tHtml -- in case of multi-line tags
   replace "<" with cr & "<" in tHtml
   replace ">" with ">" & cr in tHtml
   filter tHtml without "<*>"
   repeat for each line LNN in tHtml
     put word 1 to -1 of LNN  & cr after newHtml
   end repeat
   filter newHtml without empty
   replace cr with space in newHtml
   return newHtml
end stripHtmlTags

This took 45 ms
Semi-succeeded with the amended tag content, in that 10>5 became 10>  
5 (additional space) and 5<10 became 5 <10.

The hidden field approach was the only one that translated html  
entities.

Best,

Mark







More information about the use-livecode mailing list