Topic: A Lesson In Regular Expressions (regex) (Read 5302 times)

blgentry · « **on:** June 03, 2015, 04:04:22 pm »

In another thread someone asked for a technique to remove some characters from a file name:

http://yabb.jriver.com/interact/index.php?topic=97989.0

I gave a regular expression to do this. I'm using this post to explain the regular expression and hopefully serve as a short introduction to how regex works.

The assignment was to take a string that has a dash and a space in front of it. "- " ...and return only the part after the dash and the space. For example:

"- Speak To Me" would become "Speak To Me" . In addition, not every string we are going to process will have the "- ". Some will just be normal and our expression needs to deal with those too.

Regular Expressions are a pattern matching and grouping tool. They let you do things like slicing up strings into different pieces, reorder them, manipulate them, etc. So regex is very well suited to this task. Fundamentally what we are trying to do is take an expression like this one:

- SomeCharactersHere

..and return only the second part or "SomeCharactersHere". Let's take that expression and slowly build it up into a proper Regular Expression. Regular Expressions have their own set of special characters that mean certain things and they are very useful. The first one we are going to use is the "match any character" special. This is simply a period or a dot. Now our expression becomes:

Code: [Select]

- .

That's not very useful though because just one dot only matches one character. We want to match *any* number of characters after the "- " sequence. Luckily there's a modifier we can use that says "take the last character and allow it to repeat one or more times." That is the plus or "+". Now we have:

Code: [Select]

- .+
..and we are getting somewhere! Let's go further. We want to be able to separate out the last part. That is, the part that doesn't have the "- " in it. Regex lets us group things with parenthesis. Let's put them in here:

Code: [Select]

- (.+)
Regex will let us refer to a group later. Anything that's inside of ( ) is called a Reference or a Back Reference. They are numbered from left to right, and we can have as many as we need. But let's group off that sequence of "- " too. Now our expression is:

Code: [Select]

(- )(.+)
So now we can refer to the first part as Reference #1 and the second part as Reference #2. The second Reference is really what we want to print out of this when we are done, so we can transform that field to remove those pesky leading characters. Speaking of, remember when I said that the leading "- " was optional? We need to set up the regex to deal with that. As it is, the regex will only match if it sees the "- " followed by 1 or more other characters.

Remember the modifier we used earlier, the plus ? That means "repeat the thing before this one or more times". There are other modifiers that are similar. We are going to use the one that means "Repeat the thing that comes right before this ZERO or more times." That sounds kind of like "optional" right? That's exactly what it is. That modifier is the star or * . So our expression now turns into:

Code: [Select]

(- )*(.+)
That * means to repeat the stuff inside that first set of parenthesis zero or more times. We have almost a complete regular expression at this point. However, all Regular Expressions have a start character and an end character. This tells the regex engine where the actual regex characters begin and end. In JRiver MC they have chosen /# and #/ as the start and end character sequences. So now our expression becomes:

Code: [Select]

/#(- )*(.+)#/
Getting really close now. So lets review it, left to right.

/# is the start sequence. Then (- ) means to match a sequence of a dash and then a space and to group those two characters into a reference. Then the * tells that the group of (- ) can be seen zero or more times. Now (.+) means to match ANY character ONE or more times and to group all of those characters into a reference for us. Incidentally, that's the second reference in the expression. Finally the #/ sequence tells the regex engine that the regular expression is done.

That's the meat of the expression, but we need to put it inside of MC's regex() function. Regex() takes 4 arguments. The first one is the string we want to process. In this case we are going to use the [Name] field. The second argument is easy, it's the Regular Expression itself. The thing we just spent all this time building up. Just take the last two arguments on faith for a moment. Here's the regex call we are going to make:

Code: [Select]

regex([Name],/#(- )*(.+)#/, -1, 0)
At this point (and any time you are working on a regex) it would be helpful to make an Expression Column in one of your views and cut and paste this regex into it. You should also edit one of your song names to have a dash and then a space in front of it. So we have something to test with. When you do, you're going to see that the new expression column is totally blank! That's because the regex function itself doesn't usually return anything, so nothing gets printout out in the column. What we REALLY want is that stuff in the second set of parenthesis... the second Back Reference, remember? Regex references are accessed as [R1], [R2], [R3], etc. Let's tell it to print the second Reference or [R2]:

Code: [Select]

regex([Name],/#(- )*(.+)#/, -1, 0)[R2]
You should now see, in your expression column, all of your original song names intact and the one you added with the "- " in front of it, should now have the "- " sequence removed!

If you want to use this to transform several of the your [Name] fields to remove the "- ", all you have to do is highlight them, go to the Tag Editor and paste the expression into the [Name] field with an "=" in front of it like:

Code: [Select]

=regex([Name],/#(- )*(.+)#/, -1, 0)[R2]
You can use this to experiment with your own Regular Expressions. Using an expression column is totally non-destructive, so you can play around as much as you want and not break anything. Hopefully this will get you started towards writing and using your own Regular Expressions.

Good luck!

Brian.

6233638 · « **Reply #1 on:** June 03, 2015, 04:49:12 pm »

Thanks for taking the time to post this.

ferday · « **Reply #2 on:** June 03, 2015, 07:00:33 pm »

Regex has been on my list, this post is my motivation. Appreciate it Brian

RoderickGI · « **Reply #3 on:** June 03, 2015, 07:46:42 pm »

I kind of like your original expression version better, as it isn't necessary to explicitly output [R2].

This one,

Code: [Select]

=regex([Name],/#(- )*(.+)#/,2,0)

rather than this one

Code: [Select]

=regex([Name],/#(- )*(.+)#/, -1, 0)[R2]

I'm sure there is a purist reason for using the later, such as it is more universal, but may as well use the built in features of MC.
http://wiki.jriver.com/index.php/Expression_Language#Regex.28.E2.80.A6.29:_Regular_expression_pattern_matching_and_capture

Nice explanation of the build up of the Regex sequence.

blgentry · « **Reply #4 on:** June 03, 2015, 09:07:01 pm »

^ Yeah, I *almost* wrote the explanation of how to get to that version. I guess I should just type up the extra details now, so here goes:

Our last regex was this:

Code: [Select]

=regex([Name],/#(- )*(.+)#/, -1, 0)[R2]
Which, as Roderick said, explicitly has the [R2] at the end so that the second Reference gets printed for us to see and use. It's exactly what we want. But we can make our expression a little more compact by using the arguments of regex(). Regex, as I said, takes 4 arguments. You already know the first 2. But I'm including them again for review. Regex arguments:

1. String to be worked on. Usually a field like [Name]
2. The actual regex string that has the pattern matching in it.
3. A number that tells regex() what to output. When set to -1 like my original example, regex() is "silent". It outputs nothing. When set to 0 it works as a logic function. It outputs 1 if the pattern matches and 0 if the pattern doesn't match. But here's where it gets interesting. If you make this argument 1 it returns [R1]. If you make it 2, it returns [R2]. If you make it 3, it outputs [R3]. ...and so on.
4. This argument tells regex() whether to ignore case (uppercase versus lowercase) or to distinguish between cases. 0 means ignore case, 1 means obey case.

So we can rewrite our expression above by omitting the [R2], and changing the 3rd argument to a 2. Like so:

Code: [Select]

=regex([Name],/#(- )*(.+)#/, 2, 0)
It's a more compact form. Like Roderick said, you might as well use the features if they are there. Both regex() examples I've given here do *exactly* the same thing. It's all a matter of syntax. ...and now you know a few more details about the arguments of regex().

Brian.

ferday · « **Reply #5 on:** June 04, 2015, 12:13:28 am »

I really enjoy the command line style, I get the concept of regex syntax now thanks to this post

is there a recommended /x map for MC? online search is finding some conflicting platform info and I can't find the one I want

INTERACT FORUM

Author Topic: A Lesson In Regular Expressions (regex) (Read 5302 times)

blgentry

A Lesson In Regular Expressions (regex)

6233638

Re: A Short Lesson In Regular Expressions, regex

ferday

Re: A Short Lesson In Regular Expressions, regex

RoderickGI

Re: A Short Lesson In Regular Expressions, regex

blgentry

Re: A Short Lesson In Regular Expressions, regex

ferday

Re: A Short Lesson In Regular Expressions, regex