Parsing (scraping) OpenGraph Tags from html HEAD
Sannyasin Brahmanathaswami
brahma at hindu.org
Wed Aug 2 11:54:33 EDT 2017
Responding on top
Jacque's method only gets us a list, not an array, so one ends up having to write more code to parse the list anyway, your method is more efficient.
"not comfortable with RegEx" Ha,, right. but it worth the effort to keep the little grey cells green! I will have to study the regEx… things like ?ms
are "brand new" to me.
re: extracting the head first: I was under the impression your repeat loop would have to work through the entire text of _HTML unnecessarily and that extracting the heads would reduce processing time. OTOH, Andre tells me that for this kind of operation, even cell phones have CPU's that are more powerful than some desktop machines and so perhaps the time to loop through the entire html source is too trivial to consider at all.
Thanks for the effort you put into this. We are adding OG tags to all the media on our web site (eventually) and our apps will need to parse that out in various contexts.
BR
On 8/1/17, 10:07 PM, "use-livecode on behalf of Thierry Douez via use-livecode" <use-livecode-bounces at lists.runrev.com on behalf of use-livecode at lists.runrev.com> wrote:
2017-08-02 6:45 GMT+02:00 Sannyasin Brahmanathaswami:
Hi Brahmanathaswami,
Thanks Thierry
>
> though I'm yet sure when using regEx this is better than using Jacque's
> method
>
That's 2 different ways..
but with the regex one, you have the exact key and value of each tags,
nothing more to do.
Either way it would seem prudent to extract the head first before processing
>
Mmm, don't really see why, but I've added a line of code for this too
below.
>
> Using jacques method just gets the list..
and we need to do more coding to get the array we need.
>
> But your method can only handle 1 tag.
>
I was aware of that but didn't know what you want to achieve, therefore I
leave it for the reader.
However this has nothing to do with the regex but with the code inside the
repeat loop.
Here is another way to do it, changing only *1* line of code inside the loop
with the same regex as before:
-- to please BR wishes, but not necessary
-- erase everything after </head>
put replaceText( _Html, "(?ms)</head>.*?$", empty) into _Html
repeat while matchChunk( _Html, Rx, p1,p2,p3,p4 )
put char p1 to p2 of _Html & tab& char p3 to p4 of _Html &cr after
Rslt
delete char 1 to p4 of _Html
end repeat
delete last char of Rslt -- extra cr
put Rslt into fld 1
answer "Got " & the number of lines of Rslt & " og: meta tags!"
Building a multi-dimensionnal array after the extraction,
a bit more work inside the repeat loop will be needed,
but the extraction part is still valid.
Finally, if you are not at ease with regex, go with Jacque's way and
everything will be fine.
There are fundamentally not much differences in between the 2 ways.
Kind regards,
Thierry
> On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez wrote:
>
> So, here is the code:
>
> local Rx, Rslt, _Html, OG
>
> put empty into Rslt
> put URL "https://www.youtube.com/user/kauaiaadheenam" into _Html
>
> get
> "(?ms)<meta\s+property=\x{22}og:(.+?)\x{22}\s+content=\x{
> 22}(.+?)\x{22}>"
> put IT into Rx
>
> repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 )
> put char p3 to p4 of _Html into OG[ char p1 to p2 of _Html ]
> delete char 1 to p4 of _Html
> end repeat
>
>
>
> and you can test it this way:
>
> combine OG using return and ":"
> put OG into fld 1
>
>
>
> HTH and feel free to ask any question...
>
> Kind regards,
>
> Thierry
>
--
------------------------------------------------
Thierry Douez - sunny-tdz.com
sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage
_______________________________________________
use-livecode mailing list
use-livecode at lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode
More information about the use-livecode
mailing list