Parsing (scraping) OpenGraph Tags from html HEAD
Thierry Douez
th.douez at gmail.com
Mon Jul 31 06:30:48 EDT 2017
2017-07-29 22:16 GMT+02:00 Sannyasin Brahmanathaswami
:
> you want to extract from the <head> of the document the openGraph tags
>
> <meta property="og:site_name" content="YouTube">
> <meta property="og:url" content="https://www.youtube.
> com/user/kauaiaadheenam">
> <meta property="og:title" content="Kauai's Hindu Monastery">
> <meta property="og:image" content="https://yt3.ggpht.
> com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-
> c-k-no-mo-rj-c0xffffff/photo.jpg">
> <meta property="og:description" content="{where hinduism meets the
> future}">
>
> c) you also cannot depend on the output being line delimited, because some
> CMS's delivery "agents" will minimize this to
>
> <meta property="og:site_name" content="YouTube"><meta property="og:url"
> content="https://www.youtube.com/user/kauaiaadheenam"><meta
> property="og:title" content="Kauai's Hindu Monastery"><meta
> property="og:image" content="https://yt3.ggpht.
> com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-
> c-k-no-mo-rj-c0xffffff/photo.jpg"><meta property="og:description"
> content="{where hinduism meets the future}">
>
> Has anyone rolled up a parser/scraper for this?
Looks like "idiot simple text extraction"
Hi,
Here is a quick coded piece of code, tested only on your URL.
I did write this regex based on the Datas you provide in your email.
>
I see the other thread on scraping pages generated by JS and suspect
> perhaps some wizard among us already has this done…would save a bit of time
> here.
>
> BR
>
Every time you see any kind of scraping/search/extraction/transformation
in JS, you can be sure
it's possible to do it in LiveCode
So, here is the code:
local Rx, Rslt, _Html, OG
put empty into Rslt
put URL "https://www.youtube.com/user/kauaiaadheenam" into _Html
get
"(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{22}(.+?)\x{22}>"
put IT into Rx
repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
put char p3 to p4 of _Html into OG[ char p1 to p2 of _Html ]
delete char 1 to p4 of _Html
end repeat
and you can test it this way:
combine OG using return and ":"
put OG into fld 1
HTH and feel free to ask any question...
Kind regards,
Thierry
--
------------------------------------------------
Thierry Douez - sunny-tdz.com
sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage
More information about the use-livecode
mailing list