Webby URL search and scrape question
David Glasgow
dvglasgow at gmail.com
Sat Dec 21 05:43:05 EST 2024
I am alsmost a complete novice with respect to HTML and web stuff, but
suddenly find myself needing to change that. I have read around web
scraping but most stuff is much more ambitious than what I want to do,
focusses on python and assumes slightly (?)shady business goals - like
scraping data from your competitors' websites. My goal is pure
research...and of course I would love to do it in Livecode
I want to work through a list of (pseudo) random URLs, harvest the page
title and any keywords, and aggregate the results in a field/table. So I
don't want the URLs to be found on the basis of content or top level domain
at all. Metaphorically, dipping into the WWW bran tub allowing all domain
suffixes, pulling out a random page, checking it has english content,
check it has keywords, extract the page title, extract any keywords, save
them to a table indexed by URL and then move on to the next lucky dip.
The specifics I would appreciate advice on are
1/ how to sample as close to a random sample of URLs as possible. There
are websites that purport to take you to a random www page, but I
couldn't work out how they pull that off - or indeed how random
the destination really is. They also want to do it only one by one,
whereas I want to do it a few thousand times on the bounce, ideally without
visiting any page in the browsing sense.
2/ how might I check the URL a) is in english and b) contains keywords
3/ is it possible to extract the title and keywords from a URL using
Livecode 'remotely' or do I need to use a browser to visit?
Thanks in advance for any advice or thoughts
Cheers
David G
--
David Glasgow
Consultant Forensic & Clinical Psychologist
Honorary Professor, Nottingham Trent University
Sexual Offences, Crime and Misconduct Research Unit
Carlton Glasgow Partnership
Director, Child & Family Training, York
More information about the use-livecode
mailing list