Topic: Trying to understand Regex (Read 2120 times)

Vincent Kars · « **on:** May 10, 2012, 01:16:54 pm »

MrC once handed me an expression.
Regex([Composition], /#(Op\.|BWV) ?([^(]+)#/, -1)/
if(isequal([R1], op., 1), FixCase([R1],2),[R1]) [R2]

A bit simplified:
Regex([Composition], /#(Hob|Op\.|op|BWV|D\.|KV|woo\.) ?([^)(" ]+)#/, -1)/ [R1] [R2]

It derives Opus numbers from the composition.
The good news: it works
The bad news: I don’t understand why.
This is the syntax.
Regex(String to test, Regular expression, Mode, Case sensitivity)

Mode:
0 no string output , only a logical 1 or zero.
-1 no output (silent mode). Sounds silly as I do need output
/ [R1] [2] contains the output of the 2 sub expressions

(Hob|Op\.|op|BWV|D\.|KV|woo\.)
Matches Hob or Op. or op or BWV or D.
I have strings like
String Quartet No. 14 in D minor ('Death and the Maiden') - D. 810
This simply looks for where e.g. D. starts so the start of an opus number.

[^)(" ]
This is an exclusion list, don’t incorporate anything starting with any character between the []

) ?
Is unclear to me. Obvious the space has a meaning.
+) likewise

Obvious there are 2 sub expressions but it is totally unclear to me how they are related.
The first one is the start but the "stop" is unclear to me.

MrC · « **Reply #1 on:** May 10, 2012, 09:54:29 pm »

Sorry for the delay in responding. I was out working in the yard all day (and ate fresh 2 grapefruits and an orange - yum).

Here's the breakdown of the RE, highlighted in red:

Regex([Composition], /#(Hob|Op\.|op|BWV|D\.|KV|woo\.) ?([^)(" ]+)#/, -1)/ [R1] [R2]

The initial capture group:

(Hob|Op\.|op|BWV|D\.|KV|woo\.)

matches Hob or Op. or op or BWV or D. or KV or woo. and remembers it (because it is a capture). What was matched will be in MC's [R1] pseudo-field.

The next piece, the " ?", allows for an optional space to follow the text above (I've used quotes here so you can see/read the <space> followed by the question mark). The <space> is the character to match, and the ? makes it optional (i.e. 0 or 1 of them). It could have been written as \s? too. This allows the space that is in the middle of your "D. 810", or no space, as in "op32". (I don't have at hand all the variances that were present in your input, so just made up the op32 for an example.)

The next piece:

([^)(" ]+)

is another capture group, requiring a match on anything that is NOT any of the following characters enclosed in the [ ] character group:

) ( " <space>

and the + quantifier means match one or more of them. Again, the capture is remembered for later use, this time in MC's [R2].

I used the -1 Regex() mode because we don't want output from Regex() itself - rather, we want to perform the match, and we'll just use the captures stored in [R1] and [R2]. By not outputting immediately, you can use the [R#] values at your convenience, and in the order you want, later.

The trick when writing REs is to try to notice distinct patterns in the input, and then reducing these patterns into the simple RE constructs such that a match will occur. In your case, Composition was rather straightforward, as it was basically some word (Op, BMV, etc.) followed by some number or other (which was easier to describe by saying NOT certain characters).

marko · « **Reply #2 on:** May 11, 2012, 12:54:54 am »

So, the hat, "^" implies "Start of line", I think, because it almost worked for me when I was looking for vowels. I think it was the only bit I got right!!

In that expression, we used a character group too, so, if I had wanted a match on consonants instead, we would have used [^aAeEiIoOuU] yes?

and if I had wanted to match vowels and hats, it would have been [\^aAeEiIoOuU]?

another question regarding the initial capture group above, if you don't mind...

In my head, that is the range -> (Hob|Op\.|op|BWV|D\.|KV|woo\.)
If I had been trying to write this myself, I would have wrapped that in a second layer of parenthesis to get MC to capture the result in [R1], like so, ((Hob|Op\.|op|BWV|D\.|KV|woo\.))

Is this just a case of me misunderstanding the cheat sheet?

-marko

MrC · « **Reply #3 on:** May 11, 2012, 11:25:05 am »

Quote from: marko on May 11, 2012, 12:54:54 am

So, the hat, "^" implies "Start of line", I think, because it almost worked for me when I was looking for vowels. I think it was the only bit I got right!!

The ^ character means start of line, except when it inside a character group. As the first character inside a character group, it inverts the listed characters, to mean NOT those characters. Otherwise, when inside, it is just an ordinary character.

Quote from: marko on May 11, 2012, 12:54:54 am

In that expression, we used a character group too, so, if I had wanted a match on consonants instead, we would have used [^aAeEiIoOuU] yes?

Yes, exactly.

Quote from: marko on May 11, 2012, 12:54:54 am

and if I had wanted to match vowels and hats, it would have been [\^aAeEiIoOuU]?

Better would be just place the hat at the end, so as to avoid confusion and the extra backslash escape: [aAeEiIoOuU^]

Quote from: marko

another question regarding the initial capture group above, if you don't mind...

In my head, that is the range -> (Hob|Op\.|op|BWV|D\.|KV|woo\.)

This would not be a range, but instead is a capture group, which happens to contain the list of alternations:

Hob|Op\.|op|BWV|D\.|KV|woo\.

From the capture's point of view, it is no different than the simpler expression: (foo).

Quote from: marko

If I had been trying to write this myself, I would have wrapped that in a second layer of parenthesis to get MC to capture the result in [R1], like so, ((Hob|Op\.|op|BWV|D\.|KV|woo\.))

Is this just a case of me misunderstanding the cheat sheet?

That would give you two captures, both containing the same contents.

Which cheat sheet?

Vincent Kars · « **Reply #4 on:** May 11, 2012, 11:30:21 am »

MrC
Thanks for the clarification. Will experiment a bit more.

I use http://www.regular-expressions.info/tutorial.html
Any other suggestions?

MrC · « **Reply #5 on:** May 11, 2012, 11:38:06 am »

That's a perfectly fine reference. But keep in mind, the features of MC's implementation will differ. The definitive support is here:

http://msdn.microsoft.com/en-us/library/bb982727.aspx

and MC is using the ECMAScript variant.

marko · « **Reply #6 on:** May 11, 2012, 12:32:52 pm »

"Which cheat sheet?"

Version 2 from addedbytes.com

MrC · « **Reply #7 on:** May 11, 2012, 01:36:01 pm »

Quote from: marko on May 11, 2012, 12:32:52 pm

"Which cheat sheet?"

Version 2 from addedbytes.com

I quickly looked at it, and made some strikeouts and highlights where appropriate. See attached.

INTERACT FORUM

Author Topic: Trying to understand Regex (Read 2120 times)

Vincent Kars

Trying to understand Regex

MrC

Re: Trying to understand Regex

marko

Re: Trying to understand Regex

MrC

Re: Trying to understand Regex

Vincent Kars

Re: Trying to understand Regex

MrC

Re: Trying to understand Regex

marko

Re: Trying to understand Regex

MrC

Re: Trying to understand Regex