Parsing (scraping) OpenGraph Tags from html HEAD

Sannyasin Brahmanathaswami brahma at hindu.org
Wed Aug 2 11:54:33 EDT 2017


Responding on top

Jacque's method only gets us a  list, not an array, so one ends up having to write more code to parse the list anyway, your method is more efficient.

"not comfortable with RegEx"  Ha,, right. but it worth the effort to keep the little grey cells green! I will have to study the regEx… things like ?ms
are "brand new" to me.


re: extracting the head first: I was under the impression your repeat loop would have to work through the entire text of _HTML unnecessarily and that extracting the heads would reduce processing time. OTOH, Andre tells me that for this kind of operation, even cell phones have CPU's that are more powerful than some desktop machines and so perhaps the time to loop through the entire html source is too trivial to consider at all.

Thanks for the effort you put into this. We are adding OG tags to all the media on our web site (eventually) and our apps will need to parse that out in various contexts.

BR



 

On 8/1/17, 10:07 PM, "use-livecode on behalf of Thierry Douez via use-livecode" <use-livecode-bounces at lists.runrev.com on behalf of use-livecode at lists.runrev.com> wrote:

    2017-08-02 6:45 GMT+02:00 Sannyasin Brahmanathaswami:
    
    
    ​Hi Brahmanathaswami,
    ​
    
    Thanks Thierry
    >
    > though I'm yet sure when using regEx this is better than using Jacque's
    > method
    >
    
    ​That's 2 different ways..
    but with the regex one, you have the exact key and value of each tags,
    nothing more to do.​
    
    
    Either way it would seem prudent to extract the head first before processing
    >
    
    ​Mmm, don't really see why, but I've added a line of code for this too
    below.
    
    ​
    
    >
    > Using jacques method just gets the list..
    
    and we need to do more coding to get the array we need.
    >
    > But your method can only handle 1 tag.
    >
    
    
    ​I was aware of that but didn't know what you want to achieve, therefore I
    leave it for the reader.
    However this has nothing to do with the regex but with the code inside the
    repeat loop.
    
    
    Here is another way to do it, changing only *1* line of code inside the loop
    with the same regex as before:
    
    
    
      -- to please BR wishes, but not necessary
      -- erase everything after </head>
       put replaceText( _Html, "(?ms)</head>.*?$", empty) into _Html
    
       repeat while matchChunk( _Html, Rx, p1,p2,p3,p4 )
          put  char p1 to p2 of _Html & tab& char p3 to p4 of _Html  &cr after
    Rslt
          delete char 1 to p4 of _Html
       end repeat
       delete last char of Rslt -- extra cr
    
       put Rslt into fld 1
       answer "Got " & the number of lines of Rslt & " og: meta tags!"
    
    
    Building a multi-dimensionnal array after the extraction,
    a bit more work inside the repeat loop will be needed,
    but  the extraction part is still valid.
    ​
    
    ​
    
    Finally, if you are not at ease with regex, go with Jacque's way and
    everything will be fine.
    There are fundamentally not much differences in between the 2 ways.
    
    
    Kind regards,
    
    Thierry
    
    
    
    
    
    
    > On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez wrote:
    >
    >     So, here is the code:
    >
    >        local Rx, Rslt, _Html, OG
    >
    >        put empty into Rslt
    >        put URL "https://www.youtube.com/user/kauaiaadheenam" into _Html
    >
    >        get
    >     "(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{
    > 22}(.+?)\x{22}>"
    >        put IT into Rx
    >
    >        repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
    >           put  char p3 to p4 of _Html  into OG[  char p1 to p2 of _Html ]
    >           delete char 1 to p4 of _Html
    >        end repeat
    >
    >
    >
    >     and you can test it this way:
    >
    >        combine OG using return and ":"
    >        put OG into fld 1
    >
    >
    >
    >     HTH and feel free to ask any question...
    >
    >     Kind regards,
    >
    >     Thierry
    >
    
    
    -- 
    ------------------------------------------------
    Thierry Douez - sunny-tdz.com
    sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage
    _______________________________________________
    use-livecode mailing list
    use-livecode at lists.runrev.com
    Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
    http://lists.runrev.com/mailman/listinfo/use-livecode



More information about the use-livecode mailing list