Webby URL search and scrape question
Andreas Bergendal
andreas.bergendal at gmail.com
Sat Dec 21 07:11:13 EST 2024
Hi David,
I don’t know how to obtain thousands of random urls in a useful way, but I can contribute some ideas on how to process a url once you have it. With LiveCode, you won’t have to actually ’visit' the web site as in seeing it rendered in a browser, but you need to load the content, which may be slow enough, depending on the website.
To get started testing, I would do something like
put URL tURL into tHTMLcontent
put char 1 to 9999 of tHTMLcontent into tHTMLcontent
to as soon as possible cut down the memory burden. Then it would be easy to parse out the <title> and <html lang=”..."> tags to find title and language - although neither are mandatory, so you’d need alternative ways if those tags are not present.
Keywords are harder to determine, as they can appear (or not) in many different ways. These days, I would simply hook the stack up to an AI api and let that analyse the content, which could work for language categorisation as well.
To scale up and automate the analysis of thousands of urls, using ’put URL’ would be too slow though, as it’s a blocking command.
You’d have to set up parallel ’load URL’s and have the results processed dynamically as they trickle in, and error handle those that for some reason don’t load.
Sounds like a fun project! :)
/Andreas
> 21 dec. 2024 kl. 11:43 skrev David Glasgow via use-livecode <use-livecode at lists.runrev.com>:
>
> I am alsmost a complete novice with respect to HTML and web stuff, but
> suddenly find myself needing to change that. I have read around web
> scraping but most stuff is much more ambitious than what I want to do,
> focusses on python and assumes slightly (?)shady business goals - like
> scraping data from your competitors' websites. My goal is pure
> research...and of course I would love to do it in Livecode
>
> I want to work through a list of (pseudo) random URLs, harvest the page
> title and any keywords, and aggregate the results in a field/table. So I
> don't want the URLs to be found on the basis of content or top level domain
> at all. Metaphorically, dipping into the WWW bran tub allowing all domain
> suffixes, pulling out a random page, checking it has english content,
> check it has keywords, extract the page title, extract any keywords, save
> them to a table indexed by URL and then move on to the next lucky dip.
>
> The specifics I would appreciate advice on are
>
> 1/ how to sample as close to a random sample of URLs as possible. There
> are websites that purport to take you to a random www page, but I
> couldn't work out how they pull that off - or indeed how random
> the destination really is. They also want to do it only one by one,
> whereas I want to do it a few thousand times on the bounce, ideally without
> visiting any page in the browsing sense.
>
> 2/ how might I check the URL a) is in english and b) contains keywords
>
> 3/ is it possible to extract the title and keywords from a URL using
> Livecode 'remotely' or do I need to use a browser to visit?
>
> Thanks in advance for any advice or thoughts
>
> Cheers
>
> David G
More information about the use-livecode
mailing list