Parsing (scraping) OpenGraph Tags from html HEAD
Sannyasin Brahmanathaswami
brahma at hindu.org
Wed Aug 2 00:45:05 EDT 2017
Thanks Thierry
though I'm yet sure when using regEx this is better than using Jacque's method
on parseHeader pData
set the lineDel to "<meta property="
repeat for each line l in pData
if l contains "og:" then put char 1 to offset(">",l)-1 of l & cr
after tList
end repeat
-- do something with tList
end parseHeader
Either way it would seem prudent to extract the head first before processing
put the htmlText of widget "youtubes" into _HTML # interesting convention of underscore usage for var declaration
put char ( offset("<head>",_HTML)) to ( ( offset("</head>",_HTML))+6) of _html into tHead
Using jacques method just gets the list.. and we need to do more coding to get the array we need.
but it returns:
"og:site_name" content="YouTube"
"og:url" content="https://www.youtube.com/user/kauaiaadheenam"
"og:title" content="Kauai's Hindu Monastery"
"og:image" content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg"
"og:description" content="{where hinduism meets the future}"
"og:type" content="profile"
"og:video:tag" content="kauai"
"og:video:tag" content="hawaii"
"og:video:tag" content="hindu"
"og:video:tag" content="hinduism"
"og:video:tag" content="siva"
# And many more tags total of 39 tags…
But your method can only handle 1 tag.
description:{where hinduism meets the future}
image:https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg
site_name:YouTube
title:Kauai's Hindu Monastery
type:profile
url:https://www.youtube.com/user/kauaiaadheenam
video:tag:scriptural
#r est of the tags, all preceeding 38 of them, are lost -- "scriptural" was the last one
# and so stands as the final output for the key as the loop which is
# effectively retain the single key "og:video"tag" and replacing the value 39 times
# leaving us with on the last value of the 39th tag.
# so we would need an ordered multi-dimensional array like
OG["site_name"]
# and the other top keys, then:
OG["video"]["tags"][1]
OG["video"]["tags"][2]
But I'm not sure we need tags for the particular use case in question which is to create a robust "history" of web viewing with more detail. OTOH, since we are coding for "Oh God" data, we may as well get all the tags into the array. This could be useful later to have this code in the toolbox for when we *do* want all the tags from the OG set… God does not like to see partial metadata, because S/He Knows All the Metadata.
BR
On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez via use-livecode" <use-livecode-bounces at lists.runrev.com on behalf of use-livecode at lists.runrev.com> wrote:
So, here is the code:
local Rx, Rslt, _Html, OG
put empty into Rslt
put URL "https://www.youtube.com/user/kauaiaadheenam" into _Html
get
"(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{22}(.+?)\x{22}>"
put IT into Rx
repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
put char p3 to p4 of _Html into OG[ char p1 to p2 of _Html ]
delete char 1 to p4 of _Html
end repeat
and you can test it this way:
combine OG using return and ":"
put OG into fld 1
HTH and feel free to ask any question...
Kind regards,
Thierry
More information about the use-livecode
mailing list