Parsing (scraping) OpenGraph Tags from html HEAD
Thierry Douez
th.douez at gmail.com
Wed Aug 2 12:22:56 EDT 2017
2017-08-02 17:54 GMT+02:00 Sannyasin Brahmanathaswami via use-livecode <
use-livecode at lists.runrev.com>:
> Responding on top
>
> Jacque's method only gets us a list, not an array, so one ends up having
> to write more code to parse the list anyway, your method is more efficient.
>
> "not comfortable with RegEx" Ha,, right. but it worth the effort to keep
> the little grey cells green! I will have to study the regEx… things like ?ms
> are "brand new" to me.
>
So, you win your first Regex training :)
(?ms) are regex options.
m means multi-lines
s means the dot ( '.' ) could also match a return/cr/lf char.
>
>
> re: extracting the head first: I was under the impression your repeat loop
> would have to work through the entire text of _HTML unnecessarily and that
> extracting the heads would reduce processing time.
Well, you are right:
but only when the regex will try to match after the last valid pattern.
What is most costly is the delete inside the loop; so working only with the
<head>...</head> of your html might be more efficient in this case. But
this is more a LC thing.
> OTOH, Andre tells me that for this kind of operation, even cell phones
> have CPU's that are more powerful than some desktop machines and so perhaps
> the time to loop through the entire html source is too trivial to consider
> at all.
>
Yep, as I said, only after the last match, the regex will loop through the
end
of the html and only one time. About quality concerns, restricting the
regex to the <head> part is a good idea as you never know what could be
some html in the future...
>
> Thanks for the effort you put into this.
You're welcome.
Kind regards,
Thierry
We are adding OG tags to all the media on our web site (eventually) and our
> apps will need to parse that out in various contexts.
>
> BR
>
>
>
>
>
> On 8/1/17, 10:07 PM, "use-livecode on behalf of Thierry Douez via
> use-livecode" <use-livecode-bounces at lists.runrev.com on behalf of
> use-livecode at lists.runrev.com> wrote:
>
> 2017-08-02 6:45 GMT+02:00 Sannyasin Brahmanathaswami:
>
>
> Hi Brahmanathaswami,
>
>
> Thanks Thierry
> >
> > though I'm yet sure when using regEx this is better than using
> Jacque's
> > method
> >
>
> That's 2 different ways..
> but with the regex one, you have the exact key and value of each tags,
> nothing more to do.
>
>
> Either way it would seem prudent to extract the head first before
> processing
> >
>
> Mmm, don't really see why, but I've added a line of code for this too
> below.
>
>
>
> >
> > Using jacques method just gets the list..
>
> and we need to do more coding to get the array we need.
> >
> > But your method can only handle 1 tag.
> >
>
>
> I was aware of that but didn't know what you want to achieve,
> therefore I
> leave it for the reader.
> However this has nothing to do with the regex but with the code inside
> the
> repeat loop.
>
>
> Here is another way to do it, changing only *1* line of code inside
> the loop
> with the same regex as before:
>
>
>
> -- to please BR wishes, but not necessary
> -- erase everything after </head>
> put replaceText( _Html, "(?ms)</head>.*?$", empty) into _Html
>
> repeat while matchChunk( _Html, Rx, p1,p2,p3,p4 )
> put char p1 to p2 of _Html & tab& char p3 to p4 of _Html &cr
> after
> Rslt
> delete char 1 to p4 of _Html
> end repeat
> delete last char of Rslt -- extra cr
>
> put Rslt into fld 1
> answer "Got " & the number of lines of Rslt & " og: meta tags!"
>
>
> Building a multi-dimensionnal array after the extraction,
> a bit more work inside the repeat loop will be needed,
> but the extraction part is still valid.
>
>
>
>
> Finally, if you are not at ease with regex, go with Jacque's way and
> everything will be fine.
> There are fundamentally not much differences in between the 2 ways.
>
>
> Kind regards,
>
> Thierry
>
>
>
>
>
>
> > On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez wrote:
> >
> > So, here is the code:
> >
> > local Rx, Rslt, _Html, OG
> >
> > put empty into Rslt
> > put URL "https://www.youtube.com/user/kauaiaadheenam" into
> _Html
> >
> > get
> > "(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{
> > 22}(.+?)\x{22}>"
> > put IT into Rx
> >
> > repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
> > put char p3 to p4 of _Html into OG[ char p1 to p2 of
> _Html ]
> > delete char 1 to p4 of _Html
> > end repeat
> >
> >
> >
> > and you can test it this way:
> >
> > combine OG using return and ":"
> > put OG into fld 1
> >
> >
> >
> > HTH and feel free to ask any question...
> >
> > Kind regards,
> >
> > Thierry
> >
>
>
> --
> ------------------------------------------------
> Thierry Douez - sunny-tdz.com
> sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
> _______________________________________________
> use-livecode mailing list
> use-livecode at lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
--
------------------------------------------------
Thierry Douez - sunny-tdz.com
sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage
More information about the use-livecode
mailing list