reading and converting web page HTML text

Peter Brigham MD pmbrig at gmail.com
Sun Mar 7 13:26:17 EST 2010


On Mar 6, 2010, at 7:13 PM, Jim Ault wrote:

> On Mar 6, 2010, at 2:35 PM, Mark Stuart wrote:
>
>>
>> Hi François,
>>
>> Thanx for your quick reply.
>> I added Sarah's script into my application and ran it.
>>
>> The function halted with an error on ("), because it is not a  
>> number. I
>> think Sarah's function is looking for a number after the ampersand,  
>> correct?
>> So I'm handling the (") as an exception for now by using this  
>> script:
>>
>> if theText contains """ then
>>  replace """ with quote in theText
>> end if
>>
>> and then call the decodeEntities(theText) function.
>>
>> I'm sure I'll come across other HTML text like this, but don't know  
>> how to
>> handle it really.
>
>
> Basically, I would go to a site that shows all html entities, make a  
> list of those, and do a replace using a repeat loop.
>
> Google 'html entities' to get the possibilities.
>
> Jim Ault
> Las Vegas

I was curious about this so I looked in the dictionary entry for  
"HTMLtext", in which there is a list of named HTML entities that Rev  
is supposed to recognize. In my version of the dictionary this list is  
missing the ampersand before almost all the entries and also mostly  
doesn't show the characters referred to (I'm submitting a user note on  
this). I cleaned up this list meanwhile and it is available at:
http://home.comcast.net/%7Epmbrig/HTMLcharEncoding.dmg (Mac)
http://home.comcast.net/%7Epmbrig/HTMLcharEncoding.rev.zip (Windows)

The data is stored also in stack custom properties -- to find the HTML  
encoding for a character c, use:
the chartoHTML[c] of this stack
but you have to set the casesensitive to true first before calling the  
function or it won't recognize the difference between "É" and  
"é"

On another note, in perusing the dictionary entry for HTMLtext, I  
noticed that it says:

--------
<font>  </font>
Encloses text whose textFont, textSize, foregroundColor, or  
backgroundColor is different from the field's default. These five  
properties are represented as attributes of the <font> tag.
	* face="fontName" appears in the <font> tag if the textFont is not  
the default.
	* size="pointSize" appears if the textSize is not the default.
In standard HTML, the size attribute normally takes a value between 1  
and 7, representing a relative text size, with 3 being the normal text  
size for the web page. To accommodate this convention, when setting  
the HTMLText of a field, if the pointSize is between 1 and 7, the  
textSize of the text is set to a standard value:
pointSize		textSize
	1		8 point
	2		10 point
	3		12 point
	4		14 point
	5		17 point
	6		20 point
	7		25 point
--------

and further down it says:

--------
* The size attribute of the font tag can encode the font's point size,  
in addition to the standard 7 HTML sizes.
--------

Apparently numbers less than 8 are interpreted as HTML relative size  
and larger numbers specify point size.

Could this have something to do with the recently mentioned problems  
with font sizes on Unix platforms? If somehow the rev unix engine is  
mixing these up, then something intended to be size 14 could display  
at size 4.  But I know very little about this stuff, it's just a  
thought.

-- Peter

Peter M. Brigham
pmbrig at gmail.com
http://home.comcast.net/~pmbrig





More information about the use-livecode mailing list