reading and converting web page HTML text
Peter Brigham MD
pmbrig at gmail.com
Sun Mar 7 13:26:17 EST 2010
On Mar 6, 2010, at 7:13 PM, Jim Ault wrote:
> On Mar 6, 2010, at 2:35 PM, Mark Stuart wrote:
>
>>
>> Hi François,
>>
>> Thanx for your quick reply.
>> I added Sarah's script into my application and ran it.
>>
>> The function halted with an error on ("), because it is not a
>> number. I
>> think Sarah's function is looking for a number after the ampersand,
>> correct?
>> So I'm handling the (") as an exception for now by using this
>> script:
>>
>> if theText contains """ then
>> replace """ with quote in theText
>> end if
>>
>> and then call the decodeEntities(theText) function.
>>
>> I'm sure I'll come across other HTML text like this, but don't know
>> how to
>> handle it really.
>
>
> Basically, I would go to a site that shows all html entities, make a
> list of those, and do a replace using a repeat loop.
>
> Google 'html entities' to get the possibilities.
>
> Jim Ault
> Las Vegas
I was curious about this so I looked in the dictionary entry for
"HTMLtext", in which there is a list of named HTML entities that Rev
is supposed to recognize. In my version of the dictionary this list is
missing the ampersand before almost all the entries and also mostly
doesn't show the characters referred to (I'm submitting a user note on
this). I cleaned up this list meanwhile and it is available at:
http://home.comcast.net/%7Epmbrig/HTMLcharEncoding.dmg (Mac)
http://home.comcast.net/%7Epmbrig/HTMLcharEncoding.rev.zip (Windows)
The data is stored also in stack custom properties -- to find the HTML
encoding for a character c, use:
the chartoHTML[c] of this stack
but you have to set the casesensitive to true first before calling the
function or it won't recognize the difference between "É" and
"é"
On another note, in perusing the dictionary entry for HTMLtext, I
noticed that it says:
--------
<font> </font>
Encloses text whose textFont, textSize, foregroundColor, or
backgroundColor is different from the field's default. These five
properties are represented as attributes of the <font> tag.
* face="fontName" appears in the <font> tag if the textFont is not
the default.
* size="pointSize" appears if the textSize is not the default.
In standard HTML, the size attribute normally takes a value between 1
and 7, representing a relative text size, with 3 being the normal text
size for the web page. To accommodate this convention, when setting
the HTMLText of a field, if the pointSize is between 1 and 7, the
textSize of the text is set to a standard value:
pointSize textSize
1 8 point
2 10 point
3 12 point
4 14 point
5 17 point
6 20 point
7 25 point
--------
and further down it says:
--------
* The size attribute of the font tag can encode the font's point size,
in addition to the standard 7 HTML sizes.
--------
Apparently numbers less than 8 are interpreted as HTML relative size
and larger numbers specify point size.
Could this have something to do with the recently mentioned problems
with font sizes on Unix platforms? If somehow the rev unix engine is
mixing these up, then something intended to be size 14 could display
at size 4. But I know very little about this stuff, it's just a
thought.
-- Peter
Peter M. Brigham
pmbrig at gmail.com
http://home.comcast.net/~pmbrig
More information about the use-livecode
mailing list