Getting the text content of a HTML page

Jim Ault JimAultWins at yahoo.com
Mon Aug 4 12:25:23 EDT 2008


On 8/4/08 8:25 AM, "Richard Gaskin" <ambassador at fourthworld.com> wrote:

> viktoras didziulis wrote:
>> one more way to do things using regular expressions:
>> 
>> put the replaceText(myText,"</?[A-Za-z]+>","") into myText
>> 
>> will simply replace all tags with empty string. Where myText is the text
>> where replacements have to be made. </?[A-Za-z]+> is a regular
>> expression matching most html tags and "" is empty replacement string.
> 
> Always looking for potential optimizations, I was going to benchmark
> that here but couldn't get it to work, even after removing "the". :(
-------------------------------
>> put the replaceText(myText,"</?[A-Za-z]+>","") into myText

The problem with this may be that it only looks for alpha chars,
not spaces or numbers, quotes or equal signs
therefore it finds less matches depending on the html

oops, these don't match and won't be replaced with empty  --------------
<img src="somebody.jpg" width="160">
<img src="somebody.jpg" width="160" />
<div class="mainFormat">
<table cellpadding="" width=100%">
<b />
<hr />

works on this tag  -------------
<B>Making this bold</B>

put "" into newString
put "(?U)<.*> into regEx
put replaceText(myText,regEx,newString) into myText

By the way, (?U) says "make the shortest match possible"
(?Ui) says "make the shortest match ignoring case"
(?Usi) says "make the shortest match, ignoring case, and staying on the same
line"  ( the opposite is ignore line returns to make the match)


So
(?U)<.*>
says "find a < char, then scan the text as long as you find any character
for as long as it takes to find the next >"
The dot means any character, the * means unlimited number of chars
the   (?  says "this is a directive you must follow, Mr RegEx Engine"
If you did not use (?U) [stands for "Ungreedy"], the default behavior is to
find the longest possible match.  For HTML, that would mean the entire
document would be selected as one chunk, because
<html>
<head>
</head>
<body>
</body>
</html >
would qualify for the first  <  and the last  >  using the expression
        put "<.*>" into regEx

Hope this helps those diving in to 'get' regular expressions.  The
benchmarking you will do will show the inherent slowness of regular
expressions since they actually scan the text forward and back depending on
the complexity of the expression and the text block being scanned and the
number of successful matches.  This is why filter and for each are so
efficient most times.

Jim Ault
Las Vegas


On 8/4/08 8:25 AM, "Richard Gaskin" <ambassador at fourthworld.com> wrote:

> viktoras didziulis wrote:
>> one more way to do things using regular expressions:
>> 
>> put the replaceText(myText,"</?[A-Za-z]+>","") into myText
>> 
>> will simply replace all tags with empty string. Where myText is the text
>> where replacements have to be made. </?[A-Za-z]+> is a regular
>> expression matching most html tags and "" is empty replacement string.
> 
> Always looking for potential optimizations, I was going to benchmark
> that here but couldn't get it to work, even after removing "the". :(





More information about the use-livecode mailing list