Ah, the power of regex - was: using the SHELL function to GREP a body of text

James Hale james at thehales.id.au
Mon Mar 21 09:04:12 EDT 2016


Recently I asked about using the shell function to access back reference feature of REGEX that I thought I needed to parse an aspect of HTML/XHTML files.
Given the response I thought about it some more and realised that the "matchchunk" function with multiple search expressions could, with the aid of some livescript do what I wanted.

In the process I discovered how picky the LC parser is about what you try to tell it is REGEX.
I also discovered the wonderful RegEx builder plugin by F Rinaldi.
Rather than type my working regexs into the script editor I found that entering them into (NOT PASTING) the RegEx builder plugin allowed me to use it, to construct and insert the correctly formed "matchchunk" script snippet.

Buoyed by my success I was then able to construct a sequence of matchchunk expressions to reveal anchor text of interest.
So far I have discovered that far from the simple

<a id="an_anchor"></a>

form I was familiar with, the ePubs I was looking at also used

<a id="an_anchor" />  -- this required converting from the XHTML form to the HTML form.
<a name="an_anchor"></a>    --??!!
<p id="an_anchor">
<HX id="an_anchor"> ('X' being an integer)

There are probably more variants (one text I am using has a ID attribute in every tag!) but none so far that are actually being used as actual anchors.

I then wrote a cascading set of 'if then else" statements and was able to use various matchunks to correctly expose all the anchors of interest.

While working on this in the early hours of the morning here in Oz, Thierry Douez contacted me and offered to have a look at what I wanted saying that although parsing a complete HTML file is indeed a fool's errand (my words, not his) parsing particular snippets is not.

I sent Thierry details of my requirements and the code I had written.

He sent me back a stack to compare his version.

Thierry was able to collapse my nested matchchunks that relied on multiple search expressions within each one to a single matchchunk using a single REGEX search expression. It was also 30% faster.
Of particular note, the REGEX used some options which struck me a really useful such as allowing spaces in the expression to be ignored (great for not getting lost), parenthesizing parts of the expression without them being counted as a distinct 'found' string and the ability to operate a broken line (ie has aline break in the middle of the desired text string.)

for those interested in the actual REGEX

 "(?msxi) ( < (?: [ap] | h[1-6] )\b  .*? \b (?:id|name)=theTextID   [^>]*  > (?: </a>)? )"



James






More information about the use-livecode mailing list