Webby URL search and scrape question

Sat Dec 21 05:43:05 EST 2024

I am alsmost a complete novice with respect to HTML and web stuff, but
suddenly find myself needing to change that.  I have read around web
scraping but most stuff is much more ambitious than what I want to do,
focusses on python and assumes slightly (?)shady business goals - like
scraping data from your competitors' websites.  My goal is pure
research...and of course I would love to do it in Livecode

I want to work through a list of (pseudo) random URLs,  harvest the page
title and any keywords, and aggregate the results in a field/table. So I
don't want the URLs to be found on the basis of content or top level domain
at all.  Metaphorically, dipping into the WWW bran tub allowing all domain
suffixes, pulling out a random page, checking it  has english content,
check it has keywords, extract the page title, extract any keywords, save
them to a table indexed by URL and then move on to the next lucky dip.

The specifics I would appreciate advice on are

1/ how to sample as close to a random sample of URLs as possible.  There
are websites that purport to take you to a random www page, but I
couldn't work out how they pull that off - or indeed how random
the destination really is.  They also want to do it only one by one,
whereas I want to do it a few thousand times on the bounce, ideally without
visiting any page in the browsing sense.

2/ how might I check the URL a) is in english and b) contains keywords

3/ is it possible to extract the title and keywords from a URL  using
Livecode 'remotely' or do I need to use a browser to visit?

Thanks in advance for any advice or thoughts

Cheers

David G

-- 
David Glasgow
Consultant Forensic & Clinical Psychologist
Honorary Professor, Nottingham Trent University
Sexual Offences, Crime and Misconduct Research Unit
Carlton Glasgow Partnership
Director, Child & Family Training, York