Parsing (scraping) OpenGraph Tags from html HEAD

Thierry Douez th.douez at gmail.com
Wed Aug 2 04:06:55 EDT 2017


2017-08-02 6:45 GMT+02:00 Sannyasin Brahmanathaswami:


​Hi Brahmanathaswami,
​

Thanks Thierry
>
> though I'm yet sure when using regEx this is better than using Jacque's
> method
>

​That's 2 different ways..
but with the regex one, you have the exact key and value of each tags,
nothing more to do.​


Either way it would seem prudent to extract the head first before processing
>

​Mmm, don't really see why, but I've added a line of code for this too
below.

​

>
> Using jacques method just gets the list..

and we need to do more coding to get the array we need.
>
> But your method can only handle 1 tag.
>


​I was aware of that but didn't know what you want to achieve, therefore I
leave it for the reader.
However this has nothing to do with the regex but with the code inside the
repeat loop.


Here is another way to do it, changing only *1* line of code inside the loop
with the same regex as before:



  -- to please BR wishes, but not necessary
  -- erase everything after </head>
   put replaceText( _Html, "(?ms)</head>.*?$", empty) into _Html

   repeat while matchChunk( _Html, Rx, p1,p2,p3,p4 )
      put  char p1 to p2 of _Html & tab& char p3 to p4 of _Html  &cr after
Rslt
      delete char 1 to p4 of _Html
   end repeat
   delete last char of Rslt -- extra cr

   put Rslt into fld 1
   answer "Got " & the number of lines of Rslt & " og: meta tags!"


Building a multi-dimensionnal array after the extraction,
a bit more work inside the repeat loop will be needed,
but  the extraction part is still valid.
​

​

Finally, if you are not at ease with regex, go with Jacque's way and
everything will be fine.
There are fundamentally not much differences in between the 2 ways.


Kind regards,

Thierry






> On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez wrote:
>
>     So, here is the code:
>
>        local Rx, Rslt, _Html, OG
>
>        put empty into Rslt
>        put URL "https://www.youtube.com/user/kauaiaadheenam" into _Html
>
>        get
>     "(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{
> 22}(.+?)\x{22}>"
>        put IT into Rx
>
>        repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
>           put  char p3 to p4 of _Html  into OG[  char p1 to p2 of _Html ]
>           delete char 1 to p4 of _Html
>        end repeat
>
>
>
>     and you can test it this way:
>
>        combine OG using return and ":"
>        put OG into fld 1
>
>
>
>     HTH and feel free to ask any question...
>
>     Kind regards,
>
>     Thierry
>


-- 
------------------------------------------------
Thierry Douez - sunny-tdz.com
sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage



More information about the use-livecode mailing list