reading and converting web page HTML text

Richard Gaskin ambassador at fourthworld.com
Sun Mar 7 00:27:33 EST 2010


Mark Stuart wrote:

> Richard - the html entity that didn't "convert" was the quot, starting with
> & and ending with semi-colon ;
> (if I typed that into the email, you would only see ", as you may see in the
> following).
>
> Jim - so you are suggesting a function to convert all possible entities for
> a text chunk:
>
> function convertHTMLEntities theText
>  replace """ with quote in theText
>  replace "whatever" with "Ç" in theText
>  ...
>  ...
>  return theText
> end convertHTMLEntities


Actually I tried this with the text of your original email, using two 
fields and a button with this script:


on mouseUp
   put htmlTextToText(fld 1) into fld 2
end mouseUp

function htmlToText pHtml
   set the htmlText of the templateField to pHtml
   return the text of the templateField
end htmlToText


In field 1 I had:

-----------
I'm reading the HTML text of a web page and parsing it. Some of the text
that I'm parsing contains (") - braces not included.

What runrev function do I use to convert that HTML text to the double quote
(") character?
There will be other characters that I also need to convert, such as
(Björnke).
After reading and parsing the text, I'll be loading a DataGrid.
----------


After running it through the function I get:

----------
I'm reading the HTML text of a web page and parsing it. Some of the text 
that I'm parsing contains (") - braces not included.  What runrev 
function do I use to convert that HTML text to the double quote (") 
character? There will be other characters that I also need to convert, 
such as (Björnke). After reading and parsing the text, I'll be loading a 
DataGrid.
----------


The htmlText property is designed not to be true HTML, but to be the one 
way you can represent the contents of fields using ASCII characters with 
complete fidelity.  HTML conventions were adopted for this because of 
their simple, extensible nature, so while the name "htmlText" often 
conjures up all sorts of web expectations it wasn't designed to fulfill, 
when it comes to providing an SGML-like representation of anything you 
can do with Rev fields it generally works like a champ.

As Jim noted, there are some things you can do in HTML that aren't 
supported by Rev fields currently, so those will fail when attempting to 
use htmlText as a generic HTML-to-text converter.  But you'd be 
surprised at what you can do with it, often including many Unicode 
entities as well now that Rev supports Unicode.

Try out the htmlTextToText function above and let me know where it 
doesn't work for you for anything you can display in a Rev field.

--
  Richard Gaskin
  Fourth World
  Rev training and consulting: http://www.fourthworld.com
  Webzine for Rev developers: http://www.revjournal.com
  revJournal blog: http://revjournal.com/blog.irv



More information about the use-livecode mailing list