Parsing (scraping) OpenGraph Tags from html HEAD

Thierry Douez th.douez at gmail.com
Wed Aug 2 12:22:56 EDT 2017


2017-08-02 17:54 GMT+02:00 Sannyasin Brahmanathaswami via use-livecode <
use-livecode at lists.runrev.com>:

> Responding on top
>
> Jacque's method only gets us a  list, not an array, so one ends up having
> to write more code to parse the list anyway, your method is more efficient.
>
> "not comfortable with RegEx"  Ha,, right. but it worth the effort to keep
> the little grey cells green! I will have to study the regEx… things like ?ms
> are "brand new" to me.
>

​So, you win your first Regex training :)

(?ms) are regex options.

m means multi-lines
s means the dot ( '.' ) could also match a return/cr/lf char.

​

>
>
> re: extracting the head first: I was under the impression your repeat loop
> would have to work through the entire text of _HTML unnecessarily and that
> extracting the heads would reduce processing time.



​Well, you are right:
 but only when the regex will try to match after the last valid pattern.

What is most costly is the delete inside the loop; so working only with the
<head>...</head> of your html might be more efficient in this case. But
this is more a LC thing.


​

> OTOH, Andre tells me that for this kind of operation, even cell phones
> have CPU's that are more powerful than some desktop machines and so perhaps
> the time to loop through the entire html source is too trivial to consider
> at all.
>

​Yep, as I said, only after the last match, the regex will loop through the
end
of the html and only one time. About quality concerns, restricting the
regex to the <head> part is a good idea as you never know what could be
some html in the future...

​

>
> Thanks for the effort you put into this.


You're welcome.

Kind regards,

Thierry



We are adding OG tags to all the media on our web site (eventually) and our
> apps will need to parse that out in various contexts.
>
> BR
>
>
>
>
>
> On 8/1/17, 10:07 PM, "use-livecode on behalf of Thierry Douez via
> use-livecode" <use-livecode-bounces at lists.runrev.com on behalf of
> use-livecode at lists.runrev.com> wrote:
>
>     2017-08-02 6:45 GMT+02:00 Sannyasin Brahmanathaswami:
>
>
>     ​Hi Brahmanathaswami,
>>
>     Thanks Thierry
>     >
>     > though I'm yet sure when using regEx this is better than using
> Jacque's
>     > method
>     >
>
>     ​That's 2 different ways..
>     but with the regex one, you have the exact key and value of each tags,
>     nothing more to do.​
>
>
>     Either way it would seem prudent to extract the head first before
> processing
>     >
>
>     ​Mmm, don't really see why, but I've added a line of code for this too
>     below.
>
>>
>     >
>     > Using jacques method just gets the list..
>
>     and we need to do more coding to get the array we need.
>     >
>     > But your method can only handle 1 tag.
>     >
>
>
>     ​I was aware of that but didn't know what you want to achieve,
> therefore I
>     leave it for the reader.
>     However this has nothing to do with the regex but with the code inside
> the
>     repeat loop.
>
>
>     Here is another way to do it, changing only *1* line of code inside
> the loop
>     with the same regex as before:
>
>
>
>       -- to please BR wishes, but not necessary
>       -- erase everything after </head>
>        put replaceText( _Html, "(?ms)</head>.*?$", empty) into _Html
>
>        repeat while matchChunk( _Html, Rx, p1,p2,p3,p4 )
>           put  char p1 to p2 of _Html & tab& char p3 to p4 of _Html  &cr
> after
>     Rslt
>           delete char 1 to p4 of _Html
>        end repeat
>        delete last char of Rslt -- extra cr
>
>        put Rslt into fld 1
>        answer "Got " & the number of lines of Rslt & " og: meta tags!"
>
>
>     Building a multi-dimensionnal array after the extraction,
>     a bit more work inside the repeat loop will be needed,
>     but  the extraction part is still valid.
>>
>>
>     Finally, if you are not at ease with regex, go with Jacque's way and
>     everything will be fine.
>     There are fundamentally not much differences in between the 2 ways.
>
>
>     Kind regards,
>
>     Thierry
>
>
>
>
>
>
>     > On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez wrote:
>     >
>     >     So, here is the code:
>     >
>     >        local Rx, Rslt, _Html, OG
>     >
>     >        put empty into Rslt
>     >        put URL "https://www.youtube.com/user/kauaiaadheenam" into
> _Html
>     >
>     >        get
>     >     "(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{
>     > 22}(.+?)\x{22}>"
>     >        put IT into Rx
>     >
>     >        repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
>     >           put  char p3 to p4 of _Html  into OG[  char p1 to p2 of
> _Html ]
>     >           delete char 1 to p4 of _Html
>     >        end repeat
>     >
>     >
>     >
>     >     and you can test it this way:
>     >
>     >        combine OG using return and ":"
>     >        put OG into fld 1
>     >
>     >
>     >
>     >     HTH and feel free to ask any question...
>     >
>     >     Kind regards,
>     >
>     >     Thierry
>     >
>
>
>     --
>     ------------------------------------------------
>     Thierry Douez - sunny-tdz.com
>     sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage
>     _______________________________________________
>     use-livecode mailing list
>     use-livecode at lists.runrev.com
>     Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
>     http://lists.runrev.com/mailman/listinfo/use-livecode
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
------------------------------------------------
Thierry Douez - sunny-tdz.com
sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage



More information about the Use-livecode mailing list