reading and converting web page HTML text

François Chaplais francois.chaplais at mines-paristech.fr
Sat Mar 6 17:10:52 EST 2010


Le 6 mars 2010 à 23:01, Mark Stuart a écrit :

> Hi all,
> I'm reading the HTML text of a web page and parsing it. Some of the text
> that I'm parsing contains (") - braces not included.
> 
> What runrev function do I use to convert that HTML text to the double quote
> (") character?
> There will be other characters that I also need to convert, such as
> (Björnke).
> After reading and parsing the text, I'll be loading a DataGrid.
> 
> I've tried some functions, but with no success.
> 
> Regards,
> Mark Stuart
> 
digging in my mail archive I found this post from Sarah (it puts unicode text into a field from an HTML source, if I am correct)
HTH
--------------------------------------------------------------------

On Sun, Jul 26, 2009 at 7:18 AM, Sivakatirswami<katir at hindu.org> wrote:
> Is there a way to get htmlEntities
> 
> "“Kanwar”
> 
> The rest of their lifestyle — names, marriage rituals, dressing styles
> — continued to be the same...."
> 
> to appear correctly in a field where such enties are  part of the html used
> to set the htmltext of a field?


I had to wrestle with this recently and after numerous attempts with
uniencode, unidecode, macToISO etc., I ended up writing my own
function to do it:

function decodeEntities pText
  if pText contains "&#" is false then return pText

  set the useunicode to true
  put empty into tNew
  repeat until pText is empty
     put char 1 of pText into c
     if c <> "&" then
        put c after tNew
        delete char 1 of pText
     else
        put empty into tCode
        delete char 1 to 2 of pText
        repeat until char 1 of pText = ";"
           put char 1 of pText after tCode
           delete char 1 of pText
           if pText is empty then exit repeat
        end repeat
        delete char 1 of pText
        put numtochar(tCode) into tChar
        set the unicodetext of the templatefield to tChar
        put the text of the templatefield after tNew
     end if
  end repeat

  set the useunicode to true

  return tNew
end decodeEntities

Use it like this:        put decodeEntities("“Kanwar&#8221")
which returns:         “Kanwar” (curly opening & closing quotes which
may not show in the email).

I feel sure that there must be a better method, but until someone
discovers it, this function seems to do the job.

Cheers,
Sarah
_______________________________________________
use-revolution mailing list
use-revolution at lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution




More information about the use-livecode mailing list