Parsing (scraping) OpenGraph Tags from html HEAD
Sannyasin Brahmanathaswami
brahma at hindu.org
Sat Jul 29 16:16:23 EDT 2017
given that
a) trying to instantiate an XML tree from any given web page is likely to fail 85% of the time because they simply are never built to that strict a standard
and
b) you want to extract from the <head> of the document the openGraph tags
<meta property="og:site_name" content="YouTube">
<meta property="og:url" content="https://www.youtube.com/user/kauaiaadheenam">
<meta property="og:title" content="Kauai's Hindu Monastery">
<meta property="og:image" content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg">
<meta property="og:description" content="{where hinduism meets the future}">
c) you also cannot depend on the output being line delimited, because some CMS's delivery "agents" will minimize this to
<meta property="og:site_name" content="YouTube"><meta property="og:url" content="https://www.youtube.com/user/kauaiaadheenam"><meta property="og:title" content="Kauai's Hindu Monastery"><meta property="og:image" content="https://yt3.ggpht.com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xffffff/photo.jpg"><meta property="og:description" content="{where hinduism meets the future}">
Has anyone rolled up a parser/scraper for this? Looks like "idiot simple text extraction" but I'm trying to wrap my head around how to extract the name=value pairs, and not getting anything easy… these are space delimited, but then we also have spaces inside quoted strings. Maybe easier target "<meta (.*?)>" using regEx with matchText, get ALL the meta tags in the HEAD, push to array then just check for if key contains "og:" then we have an openGraph value.
I'll sleep on this, but but before I wake up and write 50 lines to get this done… I see the other thread on scraping pages generated by JS and suspect perhaps some wizard among us already has this done…would save a bit of time here.
BR
More information about the use-livecode
mailing list