Parsing (scraping) OpenGraph Tags from html HEAD

Sannyasin Brahmanathaswami brahma at hindu.org
Wed Aug 2 00:45:05 EDT 2017


Thanks Thierry

though I'm yet sure when using regEx this is better than using Jacque's method


on parseHeader pData
   set the lineDel to "<meta property="
   repeat for each line l in pData
     if l contains "og:" then put char 1 to offset(">",l)-1 of l & cr
after tList
   end repeat
   -- do something with tList
end parseHeader

Either way it would seem prudent to extract the head first before processing

put the htmlText of widget "youtubes" into _HTML # interesting convention of underscore usage for var declaration
put  char ( offset("<head>",_HTML)) to  ( ( offset("</head>",_HTML))+6) of _html  into tHead

Using jacques method just gets the list.. and we need to do more coding to get the array we need.
but it returns:

"og:site_name" content="YouTube"
"og:url" content="https://www.youtube.com/user/kauaiaadheenam"
"og:title" content="Kauai's Hindu Monastery"
"og:image" content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg"
"og:description" content="{where hinduism meets the future}"
"og:type" content="profile"
"og:video:tag" content="kauai"
"og:video:tag" content="hawaii"
"og:video:tag" content="hindu"
"og:video:tag" content="hinduism"
"og:video:tag" content="siva"
# And many more tags total of 39 tags…

But your method can only handle 1 tag.

description:{where hinduism meets the future}
image:https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg
site_name:YouTube
title:Kauai's Hindu Monastery
type:profile
url:https://www.youtube.com/user/kauaiaadheenam
video:tag:scriptural   

#r est of the tags, all preceeding 38 of them, are lost  -- "scriptural" was the last one
# and so stands as the final output for the key as the loop which is
# effectively retain the single key "og:video"tag" and replacing the value 39 times
# leaving us with on the last value of the 39th tag.
# so we would need an ordered multi-dimensional array like

OG["site_name"]
# and the other top keys, then:
OG["video"]["tags"][1]  
OG["video"]["tags"][2]  

But I'm not sure we need tags for the particular use case in question which is to create a robust "history" of web viewing with more detail.    OTOH, since we are coding for "Oh God" data, we may as well get all the tags into the array. This could be useful later to have this code in the toolbox for when we *do* want all the tags from the OG set… God does not like to see partial metadata, because S/He Knows All the Metadata.

BR






On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez via use-livecode" <use-livecode-bounces at lists.runrev.com on behalf of use-livecode at lists.runrev.com> wrote:

    So, here is the code:
    
       local Rx, Rslt, _Html, OG
    
       put empty into Rslt
       put URL "https://www.youtube.com/user/kauaiaadheenam" into _Html
    
       get
    "(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{22}(.+?)\x{22}>"
       put IT into Rx
    
       repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
          put  char p3 to p4 of _Html  into OG[  char p1 to p2 of _Html ]
          delete char 1 to p4 of _Html
       end repeat
    
    
    
    and you can test it this way:
    
       combine OG using return and ":"
       put OG into fld 1
    
    
    
    
    
    HTH and feel free to ask any question...
    
    Kind regards,
    
    Thierry



More information about the use-livecode mailing list