Parsing (scraping) OpenGraph Tags from html HEAD

Thierry Douez th.douez at gmail.com
Mon Jul 31 06:30:48 EDT 2017


2017-07-29 22:16 GMT+02:00 Sannyasin Brahmanathaswami
​
:


> you want to extract from the <head> of the document  the openGraph  tags
>
> <meta property="og:site_name" content="YouTube">
> <meta property="og:url" content="https://www.youtube.
> com/user/kauaiaadheenam">
> <meta property="og:title" content="Kauai's Hindu Monastery">
> <meta property="og:image" content="https://yt3.ggpht.
> com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-
> c-k-no-mo-rj-c0xffffff/photo.jpg">
> <meta property="og:description" content="{where hinduism meets the
> future}">
>
> c) you also cannot depend on the output being line delimited, because some
> CMS's delivery "agents" will minimize this to
>
> <meta property="og:site_name" content="YouTube"><meta property="og:url"
> content="https://www.youtube.com/user/kauaiaadheenam"><meta
> property="og:title" content="Kauai's Hindu Monastery"><meta
> property="og:image" content="https://yt3.ggpht.
> com/-p766LczvKHY/AAAAAAAAAAI/AAAAAAAAAAA/SIu6ZAJbMDc/s900-
> c-k-no-mo-rj-c0xffffff/photo.jpg"><meta property="og:description"
> content="{where hinduism meets the future}">
>
> Has anyone rolled up a parser/scraper for this?

Looks like "idiot simple text extraction"



​Hi,

Here is a quick coded piece of code, tested only on your URL.
I did write this regex based on the Datas you provide in your email.
​

>

I see the other thread on scraping pages generated by JS and suspect
> perhaps some wizard among us already has this done…would save a bit of time
> here.
>
> BR
>

​Every time you see any kind of scraping/search/extraction/transformation
in JS, you can be sure
it's possible to do it in LiveCode​

So, here is the code:

   local Rx, Rslt, _Html, OG

   put empty into Rslt
   put URL "https://www.youtube.com/user/kauaiaadheenam" into _Html

   get
"(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{22}(.+?)\x{22}>"
   put IT into Rx

   repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
      put  char p3 to p4 of _Html  into OG[  char p1 to p2 of _Html ]
      delete char 1 to p4 of _Html
   end repeat



and you can test it this way:

   combine OG using return and ":"
   put OG into fld 1





HTH and feel free to ask any question...

Kind regards,

Thierry

-- 
------------------------------------------------
Thierry Douez - sunny-tdz.com
sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage



More information about the use-livecode mailing list