How to filter a list with a variable anywhere in a line

Colin Holgate colinholgate at gmail.com
Mon Aug 13 13:05:00 EDT 2018


In that book I wrote there is a chapter on making a web scraper, something that could pull images and other media from a web page. I soon found all the articles talking about not using regex with HTML, so I used a mixture of techniques instead. Here’s the first part I wrote about it:

“A common approach when extracting a known pattern of text is to use regular expressions, often referred to as regex or regexp. At its simplest it's easy to understand, but it can get quite complex. Read the Wikipedia article if you want to understand it in depth:

http://en.wikipedia.org/wiki/Regular_expression

Another useful source of information is this Packt article on regular expressions:

http://www.packtpub.com/article/regular-expressions-python-26-text-processing

One problem though is that using regexp to parse HTML content is frowned upon. There are scores of articles online telling you outright not to parse HTML with regexp! Here's one pithy example:

http://boingboing.net/2011/11/24/why-you-shouldnt-parse-html.html

Now, parsing HTML source is exactly what we want to do here, and one solution to the problem is to mix and match, using LiveCode's other text matching and filtering abilities to do most of the work. Although it's not exactly regexp, LiveCode can use regular expressions in some of its matching and filtering functions, and they are somewhat easier to understand than full-blown regexp. So, let's begin by using those …”

A few pages later I do use some regex to pull text from the page:

function getText pPageSource
   put replaceText(pPageSource,"(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)","") into pPageSource
   replace lf with "" in pPageSource”
   replace tab with " " in pPageSource
   return pPageSource
end getText



More information about the use-livecode mailing list