Need Help With String Pattern Matching

Quentin Long cubist at aol.com
Sun Jun 12 23:45:20 EDT 2016


Message: 14
Date: Sat, 11 Jun 2016 15:48:00 -0400
From: Gregory Lypny <gregory.lypny at videotron.ca>
To: LiveCode Discussion List <use-livecode at lists.runrev.com>
Subject: Need Help With String Pattern Matching
Message-ID: <19A0E5FC-E4CE-42E8-9DD1-1B4D9040B7F9 at videotron.ca>
Content-Type: text/plain; charset=utf-8

Hello everyone,

> I used to do some basic text analysis of files where the lines containing strings of interest were consistent and therefore easy to spot. I am now working on files where the chunk of text that contains the data I want is more ambiguous.…

>The chunk starts with the word *owner* or the phrase *beneficial owner*.
>
>The chunk ends with *all directors* or *less than one percent*.
>
>The chunk contains all of the following:
>- At least four or five big numbers, e.g., 234,879
>- At least two percentages, e.g., 3.4%, or percentage signs
MatchChunk uses regular expressions ("regex" for short). I don't claim to be a master of regex, but hopefully the following will be of some help to you.

First off, "owner" or "beneficial owner". That would be like so:

[owner|beneficial owner]

Since that's the start of the chunks you're interested in, you'll put that at the beginning of your regex filter. Next is "all directors" or "less than one percent". That's going to be similar:

[all directors|less than one percent]

And *that* bit goes at the *end* of your regex filter. In between the start-bit and the end-bit, you have "four or five big numbers", and "percentages" or "percentage signs". "Big number" isn't really a well-defined concept, but here's one way to go for "big numbers": 

[0-9][0-9],[0-9][0-9][0-9]

In regex, that bit will match any string that consists of *at least* two digits, a comma, and three more digits. It'll match XX,XXX (where "X" is any digit at all); it'll match XXX,XXX (because if you can match *two* digits in a row, you can certainly match *three* digits in a row); it'll match XX,XXXX (if you can match 3 in a row, you can match 4 in a row); and so on. Note that this bit *will not* match XXXXX—that's a string of five digits in a row *without* any commas. As for percentages, this will work for matching a percent sign:

&

And this will work for matching a single digit followed by a percent sign:

[0-9]%

I'm going to assume that you don't know exactly where the "big number"s or "percentage"s will be within the chunks you're interested in, or how many characters will occur in between the bits of interest. If you want your regex filter to ignore what occurs between the bits of interest, this will do the trick:

.*

The period will match any character (except a newline character), and the asterisk is regex for "at least 0 of that thing just previous". So if you want to match Big Number followed by Percentage, this should do the trick:

[0-9][0-9],[0-9][0-9][0-9].*[0-9]%

If you at least know what order your Big Numbers and Percentages going to be found in, you can build a regex filter for that sequence by fitting the bits together like Lego bricks, with the period-asterisk "spacer" in between the important bits.
   
"Bewitched" + "Charlie's Angels" - Charlie = "At Arm's Length"
    
Read the webcomic at [ http://www.atarmslength.net ]!
    
If you like "At Arm's Length", support it at [ http://www.patreon.com/DarkwingDude ].




More information about the use-livecode mailing list