INTERACT FORUM

Please login or register.

Login with username, password and session length
Advanced search  
Pages: [1] 2   Go Down

Author Topic: Regex() expression language section ready for review...  (Read 10101 times)

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Regex() expression language section ready for review...
« on: August 28, 2011, 06:11:01 pm »

I've created a first draft of the Regex() section of the wiki's Media Center Expression Language page.  Be the first to get a peek:
wiki.jriver.com/index.php/MrC-temp

Edit: the main wiki expression page is now updated: Regex() @ Expression language page on Wiki

Try out the examples if you're interested.

This section won't be a comprehensive overview of regular expressions.  An additional Wiki page will be created to provide more focus on regular expressions.  This section is mostly focused on basic construction and usage of Regex() within expressions.
Logged
The opinions I express represent my own folly.

JimH

  • Administrator
  • Citizen of the Universe
  • *****
  • Posts: 72438
  • Where did I put my teeth?
Re: Regex() expression language section ready for review...
« Reply #1 on: August 28, 2011, 07:07:36 pm »

Thanks again, Mike.
Logged

kensn

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 1362
Re: Regex() expression language section ready for review...
« Reply #2 on: August 28, 2011, 09:01:14 pm »

Great write up...

Ken
Logged
If(IsEmpty([Coffee Cup]), Coffee, Drink)

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #3 on: August 28, 2011, 09:45:26 pm »

Thanks!
Logged
The opinions I express represent my own folly.

marko

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 9139
Re: Regex() expression language section ready for review...
« Reply #4 on: August 29, 2011, 01:41:32 am »

I've replied to your PM, it's all good :)

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #5 on: August 29, 2011, 02:33:17 am »

Great job!

Maybe an example where a MC field is referenced within the RE would be illustrative.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #6 on: August 29, 2011, 11:15:48 am »

How about like this one:

Code: [Select]
if(Regex([Album], /([Artist]/)), [R1] /([Album]/), / *** Artist NOT in Album)
Logged
The opinions I express represent my own folly.

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #7 on: August 29, 2011, 04:07:25 pm »

Yes, something like that. Ideally the RE part should contain also some "genuine" RE characters that needed to be escaped, but I cannot come up with a realistic example right now. And it should be shown how this could be achieved using the /# way as an alternative.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #8 on: August 29, 2011, 04:27:42 pm »

Yes, something like that. Ideally the RE part should contain also some "genuine" RE characters that needed to be escaped, but I cannot come up with a realistic example right now. And it should be shown how this could be achieved using the /# way as an alternative.

It has them... the ( and ) are RE chars, which need to be MC-escaped by /.

  /([Artist]/)

I'll show the alternate form too:

Code: [Select]
if(Regex([Album], /#(#/[Artist]/#)#/), [R1] /([Album]/), / *** Artist NOT in Album)

I have a couple of other expression ideas for this example.
Logged
The opinions I express represent my own folly.

JustinChase

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 3276
  • Getting older every day
Re: Regex() expression language section ready for review...
« Reply #9 on: August 29, 2011, 07:19:02 pm »

I'm guessing it's understood that this is over the heads of most people :)

with that said, I have a basic understanding of expressions, and can usually follow along, and figure out the parts of the syntax, then can usually piece it together enough to understand it.  usually.

not here :(

I can mostly follow this...
Code: [Select]
Regex([Name], /#(Big.*Man)#/, 1)

Matches track names that contain Big followed by Man, with anything (including nothing) in between, and outputs the matched tracks. Sample output:

    Big Butter and Egg Man
    Big Man
    Big Manager
    It's a Bigman Thing

‾‾‾‾

but the .* confuses me a bit.  I assume the * is for "anything(including nothing) in between" part, but I don't understand the need for the .

then the next one baffles me...
Code: [Select]
Regex([Artist], /#([(].+)$#/, 1)

Matches against the Artist field and returns items that contain an opening (left) parenthesis followed by additional characters until the end of the artist string. Only the sub-string from any opening parenthesis until the end of the string will be returned, since this is the only captured portion.

Sample output:

    (Brian Eno/U2)
    (feat. DJ Cam)
    (Otis Day & The Knights)
    (w/Emmylou Harris)

‾‾‾‾

this part only...
([(].+)$

I assume the first ( is actually to group the rest until the ) before $, acting as traditional parenthesis

then another ( is wrapped in [] with another .   - I don't understand the need for this part.  I also don't understand the .  as I mentioned above.

the + seems like it might actually be for that purpose to add this to the $ part, which I also don't understand.  I'd think it's a wildcard, but isn't the * the wildcard above?

If pressed, I'd say the [(] is designating the character to be searched for (don't understand the syntax, and could not create that without the example), and I'd guess the + means include all characters until ) in the $ (sting?), but couldn't say why any of it is in that order, or why the * wildcard isn't used.

Again; this is probably understood fine by those that regularly work with expressions, and maybe it's best if I don't mess with it, since I don't understand it.

But, I'd like to (understand it), and you asked for feedback, so I'm just giving you my take.  I really like the Wiki overall, and the descriptions are generally fine, but it seems to assume a knowledge level that I don't have yet, and couldn't find reference to.

If you wished to make it more clear to the ignorant, perhaps an explanation of why those parts function the way they do, with that example would be useful.

Perhaps this syntax is covered elsewhere in the Wiki and I missed it, but I triple-checked the section on Regex() above this, but saw nothing but the /# explained (as far as symbols go).

Again, not criticism, just some friendly feedback  ;D

thanks :)
Logged
pretend this is something funny

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #10 on: August 29, 2011, 08:14:05 pm »

No worries JustinChase (which I invariably hear in my head as Just In Case when I read it).

Good feedback.  Let me comment.

There are two concepts here: 1) the syntax of the expressions, and 2) usage of the expressions.  Marco's excellent wiki page generally does (1) with some examples of (2) to help out.  The entry in that page for Regex() also does (1), with a little more (2) than usual because of some special circumstances - there is a language within a language, and that requires special usage and attention.

Another Wiki page will be devoted entirely to a basic overview of regular expressions themselves (that is, arg. 2 of the Regex() function, the Regular expression component), and their usage within MC (that would be concept (2) above).  Some of your questions are really about a concept (3), which is, the syntax and meaning of the regular expression language itself.

The language of RE's is just too large a topic to fit into the MC expression language page.

Your questions about asterisk, dot, question mark, brackets, etc. are all expected.  You're coming from a Wildcard frame of reference, and are thinking they are the same.  They aren't, so you'll want to push aside your wildcard knowledge when considering RE's which use a much richer, broader language (albeit, and compact, seemingly cryptic one).

There is a thread on the beta board which has a number of discussions, overviews, and explanations, but I have yet to move its contents over to the MC16 board.

Just for you:

. matches exactly 1 character
* means any number, including 0, of the previous thing (called an atom)
+ like *, but means 1 or more
? means the previous thing (atom) is optional
[ ] means a Character Class, and these brackets generally turn off special meaning of other characters, w/some exceptions
( ) are Captures, which mean group and remember (and their contents can be recalled later - this is useful)
^ means Match at Beginning of Line
$ means Match at End of Line

and there are more.

So, hang in there just a bit longer.  You can also read about the RE language, which really is not complex, here:

http://msdn.microsoft.com/en-us/library/bb982727.aspx
http://www.grymoire.com/Unix/Regular.html

and grab a cheat sheet:

http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/
Logged
The opinions I express represent my own folly.

JustinChase

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 3276
  • Getting older every day
Re: Regex() expression language section ready for review...
« Reply #11 on: August 29, 2011, 10:05:59 pm »

No worries JustinChase (which I invariably hear in my head as Just In Case when I read it).

haha, that's funny!!!

That's actually how I came up with that 'name', probably 12 or 15 years ago. I thought it'd be a 'clever' twist to Just in Case; but no one, until now, has ever said that to me.  funny (to me :))

There is a thread on the beta board which has a number of discussions, overviews, and explanations, but I have yet to move its contents over to the MC16 board.

Just for you:

. matches exactly 1 character
* means any number, including 0, of the previous thing (called an atom)
+ like *, but means 1 or more
? means the previous thing (atom) is optional
[ ] means a Character Class, and these brackets generally turn off special meaning of other characters, w/some exceptions
( ) are Captures, which mean group and remember (and their contents can be recalled later - this is useful)
^ means Match at Beginning of Line
$ means Match at End of Line

and there are more.

So, hang in there just a bit longer.  You can also read about the RE language, which really is not complex, here:

http://msdn.microsoft.com/en-us/library/bb982727.aspx
http://www.grymoire.com/Unix/Regular.html

and grab a cheat sheet:

http://www.addedbytes.com/cheat-sheets/regular-expressions-cheat-sheet/

yeah, some days I miss the beta board more than others  :P

Anyway, thanks for the explanation!  I hate being ignorant, so I appreciate the help to learn :)

yeah, definitely unusual (to me), but I can see how it can be very powerful too. I'm glad I wasn't just missing the obvious there ;)

I'll have to take a look at your guides and cheat sheet soon.  it looks like that language would be useful in more than just MC

I look forward to your next well presented wiki.

thank you again for all that you do around here, I appreciate it.
Logged
pretend this is something funny

marko

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 9139
Re: Regex() expression language section ready for review...
« Reply #12 on: August 30, 2011, 10:48:31 am »

if [:punct:] = contains punctuation, why do we wrap it inside square brackets, [[:punct:]]? I tried the example without the second layer hoping to answer my own question and the results were, shall we say, interesting, but offered me no clue as to what was going on!!

-marko

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #13 on: August 30, 2011, 10:57:59 am »

The [] designate a Character Range.  So [A-Z] represents the character range A through Z, and [0-9] the character range 0 through 9.

But there are named character classes too, such as [:digits:] and [:upper:] and [:alnum:].   The brackets here are part of the named character class.

Otherwise, how would you distinguish:

   [:punct:]

the named character class from:

   [cnptu:]

a character range that contains only the chars c, n, p, t, u and : ?

So, defined named character classes look like:

   [:punct:]

and to use them, they are placed like other characters ranges, inside brackets:

   [[:punct:]]

And you can add more characters, and any order:

      [ab[:punct:]cd[:digits:]ABCD]

which means a, b, c, d, A, B, C, D, punctuation, and digits.
Logged
The opinions I express represent my own folly.

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #14 on: August 30, 2011, 11:03:32 am »

I updated my terminology above to better clarify the response.
Logged
The opinions I express represent my own folly.

marko

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 9139
Re: Regex() expression language section ready for review...
« Reply #15 on: August 30, 2011, 11:04:15 am »

Crystal clear now, thank you.

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #16 on: August 30, 2011, 11:04:27 am »

I wrote this, but MrC was, as always, faster and more elaborate but I go ahead and post this anyway.

I did the same mistake. [:punct:] finds any one of the characters inside the brackets which is not what we want. I guess the [:punct:] has to be within brackets so that you can negate it with ^, i.e. [^[:punct:]]. One can wonder though why not other characters were chosen for those named classes, like {:punct:}.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #17 on: August 30, 2011, 11:23:11 am »

Esoteric historic stuff follows...

Well, {} were generally already in use for either quantifiers (and now other named entities and extensions).  And \{... has meaning too, so requiring folks to escape { and } inside a character range -- [\{] -- would have been excessive and problematic.  Remember, inside a character range, almost all characters have any special meta-character meaning disabled - * is just an asterisk, not a meta-character meaning 0 or more.  The exceptions are right bracket "]", dash "-", backslash "\", and caret "^".  Once RE language designer start extending those rules, existing REs can break, so great care was taken to ensure existing REs ran correctly.  This was difficult.

The number of grouping entities and metacharacters in RE has over the years become strained.  And there are plenty of variants of RE implementations, Basic, Exteneded, PCRE, POSIX, TR1, etc.  Some of these have ventured down their own paths.  For example, POSIX allows [:^alnum:] as the negated version of [:alnum:].
Logged
The opinions I express represent my own folly.

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #18 on: August 30, 2011, 11:33:13 am »

Interesting. I am not trying to rewrite the RE language, just learn it and accept it the way it is (i.e. powerful, but sometimes confusing)...

May I ask two things which I do not understand (and no, I have not done extensive testing, but was trying the short cut of asking someone who knows):

1. What if I look for a [:punct:] character but not & or ;? I guess ([[:punct:]]|[^&;]) will not work.

2. How do I get all entries which do not end with ] or )?

EDIT: Corrected the RE in p. 1.

EDIT 2: Finally figured p. 2 out myself - Regex([name],/#(.*[^\])]$)#/,1). My previous attempts at this failed due to a stupid user error, as always.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #19 on: August 30, 2011, 12:38:17 pm »

...I am not trying to rewrite the RE language, ...

Understood - I updated my historical explanation above, clarifying that I mean RE designers, not literally "you" vagskal.  ("you" colloquially in English often means "anybody", as in "once you go down that path...")

1. What if I look for a [:punct:] character but not & or ;? I guess ([[:punct:]]|[^&;]) will not work.

The alternation says match this OR that, and if one satisfies the condition, the RE engine is happy.

Keep in mind - the RE engine really wants to succeed, so when it finds a match, it uses it.  It only backtracks and rejects the match if subsequent matches further along in the string fail.  So the alternation you have above won't work.

You can use negative lookahead assertions to test that something is or isn't going to match, without consuming characters (these are zero-width assertions, just like ^, $ and \b).  For example, the following:

   (?![&;])[[:punct:]]

ensures the character being considered now is not either & or ; and if that is true, proceed trying to match the next pattern, which here is [[:punct:]].
Logged
The opinions I express represent my own folly.

marko

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 9139
Re: Regex() expression language section ready for review...
« Reply #20 on: August 30, 2011, 01:36:17 pm »

Imagine any standard MC list field...

If I wanted to match, and capture, xxxanything; and ditch the rest how would I get the expression engine to stop at the first semicolon?

regex([field],/#(xxx.*;)#/,0) catches the "xxx.*" part, but then captures everything up to the last instance of ";" in the string.
I've tried many things, without success. (Well, without achieving my goal, a few of my failures did teach me other things, so not all bad) :)

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #21 on: August 30, 2011, 02:14:24 pm »

Given your description:

   xxxanything;

really means that anything either does or does not contain a semicolon.  If it did, you'd really have:

   xxxanything_non-semi; anything

and that pretty much spells out in english what you want in RE language:

   xxx[^;]+;

where [^;]+ means one or more non-semicolon characters.  With the subsequent semicolon, the RE engine would match at the first semicolon (assuming xxx is a literal xxx and not some other match anything expression).

As an aside, and important later, RE's have a a concept of greediness.  They consume as much as they can, such that the match succeeds.  Simply put, the RE worker-bee keeps consuming characters until it hits a failing dead-end.  Then it backs up, and tries something else.

To force the RE worker-bee to be not so greedy, use a ? after a * or +, to mean:

*?   0 or more times, but as few as possible (non-greedy repetition)
+?  1 or more times, but as few as possible (non-greedy repetition)

So a captured, non-greedy RE would be:

   (xxx.*?;)
Logged
The opinions I express represent my own folly.

marko

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 9139
Re: Regex() expression language section ready for review...
« Reply #22 on: August 31, 2011, 01:32:43 am »

Magic reply. I'm getting all that expression help I've dished out over the years back with bells on here!! :)

This is a bit like those energy saving neon bulbs we bought, you know the kind? You switch the lights on, and about 30 minutes later they start kind of glowing ;)

I was approaching this in very linear fashion, seemed most logical to me, and I was toiling....
Quote
really means that anything either does or does not contain a semicolon.
You're right, it does, but I simply never thought of it like that, and never would have without your help.

I read about 'greediness' and 'laziness' on the 'cheat sheet' site you linked to and thought I'd nailed it, but it wasn't to be... I wasn't putting the '?' in the right place. DOH!
I got it to work my way in the end, but your answer is so much more elegant. I have to start thinking like a RE engine... :)

-marko

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #23 on: August 31, 2011, 01:38:34 am »

... I have to start thinking like a RE engine... :)

Yeah, that's exactly it.  It is like a 5-year old racing around the house, who wants to run, and you have to constantly slow him down.

Good to hear of your great progress!
Logged
The opinions I express represent my own folly.

marko

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 9139
Re: Regex() expression language section ready for review...
« Reply #24 on: August 31, 2011, 03:07:26 am »

Indeed. There was another problem, in that if "xxxanything" came at the end of the string, that was it, it just ended, there was no semicolon to key from, so for those cases, I was getting failures. I tried a few things, messing around with pipes '|' for "Or" and double captures, and couldn't quite get there.

I now have:
Code: [Select]
if(regex([keywords],/#(!Places[^;]+)#/,0),

replace([R1],!Places\,),
Failure\No Places)&datatype=[list]
Let's see... This works because I'm saying, "Give me everything that comes after !Places that's not a semicolon", right? So, when it finds one, it stops, or when it gets to the end of the string, it stops. How neat is that?

This does the same job as a ridiculously long expression, in a tiny fraction of the time taken by that long expression. Amazing stuff.

Edit:
Now, about that "global" switch you mentioned a few days back...

Is it worth asking for this ability? I personally really, really need it to be able to do the same as the above and pull all the people out of my nested keywords field. I'm not certain what I'm asking for though...
Would a 'g' switch also capture (up to nine) instances automatically, for example, or do we just need that 'g' switch plus a touch of MrC magic?

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #25 on: August 31, 2011, 05:43:37 am »

I mean RE designers, not literally "you" vagskal.

I understood that. No offence taken.

You can use negative lookahead assertions

Thanks! I had not come around to the lookaround functions, but I see now how they can be necessary sometimes.
Logged

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #26 on: August 31, 2011, 06:28:46 am »

The \b anchor seems to think that an accented character (é, for example) is a word boundary. This expression finds Jalkéus at the end of the artist field:
Code: [Select]
Regex([Artist], /#\b(us)$#/,1,1)
Is the \b anchor supposed to work that way?

Another oddity is that this expression will not find the song "What's Going On", or anything else:
Code: [Select]
Regex([Name], /#\b(On)$#/,1,1)
Am I doing something stupid again?
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #27 on: August 31, 2011, 10:57:55 am »

The \b anchor seems to think that an accented character (é, for example) is a word boundary. This expression finds Jalkéus at the end of the artist field:
Code: [Select]
Regex([Artist], /#\b(us)$#/,1,1)
Is the \b anchor supposed to work that way?

\b means Word Boundary, and a word contains A-Za-z0-9_ characters.  It is a limited form of word, so you'll need another assertion here.


Another oddity is that this expression will not find the song "What's Going On", or anything else:
Code: [Select]
Regex([Name], /#\b(On)$#/,1,1)
Am I doing something stupid again?

This one works for me.  I have 61 one such songs, include What's Going On.
Logged
The opinions I express represent my own folly.

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #28 on: August 31, 2011, 11:40:34 am »

Thanks for the reply and your patience!

I guess the word boundary anchor and the predefined char classes are not as useful as I first thought. Do you know of an easy way to define a range that includes also accented characters? The Swedish alphabet, for example, ends with zåäö. Will a-ö include åäö? The tutorial I used indicates that the \w thing will include also accented characters. I guess [^\w], or simply \W, could be used instead of \b.

Do you know of a site that enumerates exactly which characters each predefined char class encompasses, like the [:punct:] class. I cannot find such listings when following the links you have provided.

Forget about the other RE. It was just me being stupid again and the RE engine being picky about having every single char entered the right way.

Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #29 on: August 31, 2011, 01:14:53 pm »

Indeed. There was another problem, in that if "xxxanything" came at the end of the string, that was it, it just ended, there was no semicolon to key from, so for those cases, I was getting failures. I tried a few things, messing around with pipes '|' for "Or" and double captures, and couldn't quite get there.

Alternation, looking for semicolon, or end of line, would solve this problem:

   xxx[^;]+(;|$)

I now have:
Code: [Select]
if(regex([keywords],/#(!Places[^;]+)#/,0),

replace([R1],!Places\,),
Failure\No Places)&datatype=[list]
Let's see... This works because I'm saying, "Give me everything that comes after !Places that's not a semicolon", right? So, when it finds one, it stops, or when it gets to the end of the string, it stops. How neat is that?

Right, it Matches !Places followed by 1 or more non-semicolon's, as far as it can go...

Neat.

This does the same job as a ridiculously long expression, in a tiny fraction of the time taken by that long expression. Amazing stuff.

That's exactly why I pushed so hard for them. :-)  Existing left, right, removeright, etc. functions while trivial to use, and too specific *when the job is generalized string matching and extraction*.  Nice to have them around, but too limiting to be useful across a broader range of problems.

Now, about that "global" switch you mentioned a few days back...

Is it worth asking for this ability? I personally really, really need it to be able to do the same as the above and pull all the people out of my nested keywords field. I'm not certain what I'm asking for though...
Would a 'g' switch also capture (up to nine) instances automatically, for example, or do we just need that 'g' switch plus a touch of MrC magic?

Matt probably spent a lot of time implementing RE support, and I certainly don't want to push here.  I understand they have their priorities and respect that.

I personally think it would complete the package, and would be tremendously useful.

But let me clarify what global is and is not, and what is required to be useful.  Say you have your string (e.g. your keywords)

   abc;def;gh;ijkl;

and you want to capture the stuff in between the semicolons (we'll take the semicolons too, for an easier RE).  We can always repeatedly match such patterns with grouping, and a quantifier:

   ^([^;]+;)+$

or match exactly, say, 6:

   ^([^;]+;){6}$

But with capturing comes the wrinkle.  You can capture the first and last one easily, and can even get the Nth one (a bit clumsy, but fine), but there is no way in RE's by themselves to get all of them as independent captures.

Instead, to implement global capture, the developers using the RE engine write a loop, which progressively runs the RE over spans of the string.  The RE engine has the ability to remember where it last stopped, so the developer can re-call the RE with the sub-string for subsequent matches.  So it would look like:

  while not at end of string {
        match current start of string against RE   <-- captures happen here
        set start of string to last position used by RE
  }

Of particular note, there is an implicit last-capture idea here.  The captures occur during matching, so only the last captured item would be available.  So this would not be any more powerful than just writing an expression that captures the last match (and we know we can do this).

The power comes when replacement (aka substitution) is implemented, and the global concept applies here.  With substitution comes the ability to replace matched patterns with specified strings (which can even include previous captures!).   We could, for example, append our keywords with X (where the stuff inside the last / / chars is the replacement text).

   keywords: abc;def;gh;ijkl;

   subst [keywords],  /([^;]+;)/,  /\1 (X)/    keywords now: abc X;def;gh;ijkl;

or we could even remove the keyword:

   subst [keywords],  /([^;]+;)/,  //            keywords now: def;gh;ijkl;

But what if we wanted to do that everywhere in the string?  This is where global comes in:

   Gsubst [keywords],  /([^;]+;)/,  /\1 X/     keywords now: abc X;def X;gh X;ijkl X;

or we could, like above, replace the keyword with nothing at all:

   Gsubst [keywords],  /([^;]+;)/,  //           keywords now: <empty>

Again, this is implemented in the loop construct mentioned above, where at each iteration, the captures are set, their positions in the string remembered by the RE engine, and the implementer does the necessary mechanics to replace the the sub-string with the specified replacement text.  So, global only has meaning with substitution.

When would this be useful? 

 1. when you want to globally replace matches (including with nothing, or amending the captures)
 2. when you have sub-strings in text you want removed, so that you can capture what remains

Let's say I have some file names, such as:

   foo_bar-some__thing__ (v22).pdf

and you want to get rid of the parens, underscores, and dashes, and replace each occurrence by a single space:

   Gsubst [filename (name)],  /[()_-]+/,  / /           filename now: foo bar some thing v22.pdf

Without global, you can only do this generally with one of the occurrences.

Or, I have a list of key / value pairs in some text field, such  as performers and the pairs are name: instrument as follows:

   trombone: sue; vocals: sally; drums: sam;

With

   Gsubst [performers],  /([^:]+): ([^;]+);/,  /\1;/

we'd obtain the list of performers sans instruments;

   sue; sally; sam

and

   Gsubst [performers],  /([^:]+): ([^;]+);/,  /\2;/

generates the instruments:

   trombone; vocals; drums;

So, I think it is tremendously useful, and completes the generalization of today's specific functions (this supersedes and generalizes Replace() and RemoveCharacters(), which are too specific, and supplements ListBuild()).

Some would argue that the examples are contrived.  Well, they are a little, but only for the sake of brevity and explanation of a more general concept - making the examples more complex doesn't serve much purpose other than making them more difficult to read.  On the other hand, we've already seen some requests (you, me, others) who would find this quite useful.  And then there are those who won't consider MC because the want this ability and other tools such as mp3tag support this (so they go there).
Logged
The opinions I express represent my own folly.

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #30 on: August 31, 2011, 01:36:32 pm »

Thanks for the reply and your patience!

I guess the word boundary anchor and the predefined char classes are not as useful as I first thought. Do you know of an easy way to define a range that includes also accented characters? The Swedish alphabet, for example, ends with zåäö. Will a-ö include åäö? The tutorial I used indicates that the \w thing will include also accented characters. I guess [^\w], or simply \W, could be used instead of \b.

Do you know of a site that enumerates exactly which characters each predefined char class encompasses, like the [:punct:] class. I cannot find such listings when following the links you have provided.

My pleasure.

Certain portions of these classes are locale dependent.  You can also use \w as you mention, and you can use Unicode ranges if you want too, such as:

   Regex([Name], /#([\u00c0-\u02b8])#/, 1)

Use the Windows Character map to see the range.  Basically, a range is just a monotonically increasing sequence of begin - end, where begin and end are character specifiers, and the sequences is defined as per ASCII or Unicode.  Be sure to check out Character Range here:

http://msdn.microsoft.com/en-us/library/bb982727.aspx

as this will generalize and broaden your understanding of what character ranges are (and are not), and what you can do with character ranges. 

When REs were developed, it was essentially a strictly ASCII world.  This stuff all got messy once Unicode was considered.  By then, though, certain sequences such as \w, \b, etc. were already defined.
Logged
The opinions I express represent my own folly.

rick.ca

  • Citizen of the Universe
  • *****
  • Posts: 3729
Re: Regex() expression language section ready for review...
« Reply #31 on: August 31, 2011, 03:17:53 pm »

Quote
So, I think it is tremendously useful, and completes the generalization of today's specific functions (this supersedes and generalizes Replace() and RemoveCharacters(), which are too specific, and supplements ListBuild()).

I would take this a step further, and suggest that—for the average user—this would just provide what they expect or hope would be possible in the first place. For example, understanding the current function can do things like replace all occurrences of "dog" with "cat" in a string field like [Name], such a user might reasonably (from their perspective) ask, "Why can't I do the same thing with [Keywords]." In fact, if their data is already "clean," they may more readily see applications for list fields that require this capability.

Whatever the implementation, there are benefits to be had for anyone willing to invest a little time learning how to use regex. But the most significant and readily available benefits probably don't require a particularly deep understanding of regex. It seems to me, however, those benefits do require the implementation include the "global" capability.
Logged

marko

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 9139
Re: Regex() expression language section ready for review...
« Reply #32 on: September 07, 2011, 03:00:40 am »

Figuring it may help to keep as many regex questions as possible in one thread, I'll ask this here, also, I know this is not factually correct, it's purely for the learning curve...

In answering another question, MrC gave the following regex example in reply:
Start simply; you don't need to create a new field, as you can use existing data.

Create a new expression column in a panes view.  Enter the following as the expression:

   If(Regex([File Type], /#^(flac|wav)$#/), Lossless, Lossy)

It looks for file types flac or wav, and outputs the label Lossless, otherwise it outputs the label Lossy.  Add more lossless types as necessary, each separated by the | character as above.
If(Regex([File Type], /#^(flac|wav|ogg|mpc)$#/),Lossless,Lossy) This is great. Anyone with the most rudimentary knowledge of MC expressions should be able to quickly see that this neatly replaces the need to use four nested "if(isequal(" expressions.

questions:
  • The 'beginning of line' and 'end of line' instructions ('^' and '$')... What do they mean exactly? I ask because if I remove them from the expression, If(Regex([File Type], /#(flac|wav|ogg|mpc)#/),Lossless,Lossy) it still works. I thought maybe they were there merely as 'good practice', so, I knocked up an alternative...
    If(Regex([File Type], /#^[3cpm]$#/),Lossless,Lossy)
    This should mark mp3 and mpc as lossless, but it fails. If I remove the hat and the dollar characters, it works. Why is that? At the end of this post, I thought I was close to understanding, but my findings down there do not bear out here, so the question still stands :)

  • Comparing the two examples above: If(Regex([File Type], /#[3cpm]#/),Lossless,Lossy) and If(Regex([File Type], /#(flac|wav|ogg|mpc)#/),Lossless,Lossy):
    The first will positively match mp3 and mpc, but if the characters in the second example are jumbled up, 'calf', for example, would not be a positive match for 'flac'. Is this "just the way it works"?

  • Continuing with the examples above, using [^3cpm] would reverse the expression results, mp3 and mpc would now be flagged as lossy files. Is there a way to say "Not flac or wav or ogg or mpc"? [^(flac|wav||ogg|mpc)] perhaps?


To further my learning, I tried to get the same match from the [filename] field: if(regex([filename],/#.*(fla|wav|mpc|ogg)#/),Lossless,Lossy). At first, it would have appeared to have worked, but closer examination proved otherwise, and demonstrates the reason that fully understanding this stuff before letting it loose on your library is, well, kind of important :)

MrC, can you correct, expand, confirm my theory:
Regex matches the entire string given, in this case, the filename field, against a given pattern. I have asked it to find anything in the filename (.*) followed by either fla or wav or mpc or ogg. If any of these are found, mark it as lossless, otherwise, mark it as lossy. Some mp3s were being marked as lossless because they contained things such as "Catriona and the Waves" (don't ask!!) or "E Flat Minor". Adding the dot extension marker to the pattern to match takes care of these, so, remembering that the dot needs to be escaped....
if(regex([filename],/#.*\.(fla|wav|mpc|ogg)#/),Lossless,Lossy)
anything, followed by a dot followed by fla or wav or mpc or ogg is marked lossless, everything else is marked lossy. The hat and the dollar do make a difference here. As given, this works, but if I add the hat and the dollar, 'fla' no longer positively matches 'flac'. I think I may have answered my own question here, because, removing the "\." and adding the hat and the dollar means that "Catriona and the Waves" no longer returns a false positive...

"^" is the regex equivalent of MC's [Starts with"
"$" is the regex equivalent of MC's "Ends with]

If that's right, "WooHoo!" :D

I've just previewed this post, and I find it messy and noisy. I've left it as-is because it's already taken too long to compose, and clearly demonstrates me flailing around, trying to get to grips with regex!! Hopefully it's comprehensible, despite the noise.

-marko

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #33 on: September 07, 2011, 03:45:16 am »

Maybe I can give you a couple of clues from my limited experience until MrC answers.

^[3cpm]$ does not work because what is inside the [] will match only one (1) character, either 3, c, p or m. And the ^ and $ will in this case consequently match only fields with just one (1) character.

.*(fla|wav|mpc|ogg) does not work because what is separated with | are separate strings anywhere in the field. .* means zero or one or more characters, so "fla" in the beginning of a field would produce a hit. (flac|wav|mpc|ogg)$ should work because then the RegEx is looking only at the end of each field.
Logged

Vincent Kars

  • Citizen of the Universe
  • *****
  • Posts: 1154
Re: Regex() expression language section ready for review...
« Reply #34 on: September 07, 2011, 05:24:21 am »

When building a list to emulate multiple Artist you need to use a common delimiter e.g. ;
As I have other delimiters I use something like
ListBuild(1,;,Replace(Replace(Replace(Replace(Replace([Album Artist];[Artist];[Composer];[Conductor],//,;),/ ,;),/,),.),;;,;))&datatype=
    I have the feeling that these nested Replace can be replaced by 1 Regex but as I’m totally unable to understand Regex can some kind soul supply an example?
Logged

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #35 on: September 07, 2011, 05:47:55 am »

I have the feeling that these nested Replace can be replaced by 1 Regex

I do not think this will be possible/really useful until the MC implementation of RegEx supports global search, i.e. the ability to find all instances of "/", for example, in the fields [Album Artist];[Artist];[Composer];[Conductor]. MC currently does not support global search.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #36 on: September 07, 2011, 11:25:45 am »

I see vagskal responded nicely, so my additions are supplemental.

In answering another question, MrC gave the following regex example in reply:If(Regex([File Type], /#^(flac|wav|ogg|mpc)$#/),Lossless,Lossy) This is great. Anyone with the most rudimentary knowledge of MC expressions should be able to quickly see that this neatly replaces the need to use four nested "if(isequal(" expressions.


Exactly.  And this is such a perfect example, I'll add it to one of the docs.

Quote from: marko
questions:
  • The 'beginning of line' and 'end of line' instructions ('^' and '$')... What do they mean exactly?

Their exact meaning - match at the beginning or ending, respectively, of a line.  These match a position, not any characters (and hence are called zero-width assertions - they consume no width in characters, and assert something is true).  They are true when the position of the RE engine within the string being matched is exactly the beginning or end of a line (and I won't go into exactly what is a line here, as it isn't relevant in MC's current implementation).

Quote from: marko
I ask because if I remove them from the expression, If(Regex([File Type], /#(flac|wav|ogg|mpc)#/),Lossless,Lossy) it still works. I thought maybe they were there merely as 'good practice', so, I knocked up an alternative...


They work in this case because file types happen to be short, and our examples are clear.  However, if there was a filetype named mpcL (eg. some lossy version of mpc), then your "mpc" alternate would match it too, and would cause the expression to report erroneously.  So the anchors are to ensure that the full file type is matched, exactly, and no sub-string matches are allowed.

Quote from: marko
If(Regex([File Type], /#^[3cpm]$#/),Lossless,Lossy)
This should mark mp3 and mpc as lossless, but it fails. If I remove the hat and the dollar characters, it works. Why is that? At the end of this post, I thought I was close to understanding, but my findings down there do not bear out here, so the question still stands :)


The [ ] construct is a single character match, where any of the characters listed inside are acceptable.  Since you have no file types named exactly "3" or "c" or "p" or "m", the RE must fail.

Quote from: marko
  • Comparing the two examples above: If(Regex([File Type], /#[3cpm]#/),Lossless,Lossy) and If(Regex([File Type], /#(flac|wav|ogg|mpc)#/),Lossless,Lossy):
    The first will positively match mp3 and mpc, but if the characters in the second example are jumbled up, 'calf', for example, would not be a positive match for 'flac'. Is this "just the way it works"?

I think the above explanation will satisfy this question.  Remember, brackets [ ] mean single character match, and while they look like just some alternate form of grouping, they are not (aside from the obvious fact that they do contain a list of acceptable characters to match).

Quote from: marko
  • Continuing with the examples above, using [^3cpm] would reverse the expression results, mp3 and mpc would now be flagged as lossy files. Is there a way to say "Not flac or wav or ogg or mpc"? [^(flac|wav||ogg|mpc)] perhaps?

You can use negative look ahead, but that expression is more complex than its worth.  Better to invert the outcome of the RE than to try to reverse the logic of an RE.

Quote from: marko
To further my learning, I tried to get the same match from the [filename] field: if(regex([filename],/#.*(fla|wav|mpc|ogg)#/),Lossless,Lossy). At first, it would have appeared to have worked, but closer examination proved otherwise, and demonstrates the reason that fully understanding this stuff before letting it loose on your library is, well, kind of important :)


This is where talking to yourself like the RE engine works helps avoid hopeful RE construction.  You were hoping .* would know to match all the front matter, and leave what was left to the suffixes listed in your alternation strings.  But what the expression really says is match zero or more of any character, repeat that as many times as you can, and then try to match one of the listed alternation strings and these strings can match anywhere.  So the pattern would match "aflac", "flac", "notaflac.ogg", and "flacflacfahfah.wav".

Quote from: marko
MrC, can you correct, expand, confirm my theory:
Regex matches the entire string given, in this case, the filename field, against a given pattern. I have asked it to find anything in the filename (.*) followed by either fla or wav or mpc or ogg. If any of these are found, mark it as lossless, otherwise, mark it as lossy. Some mp3s were being marked as lossless because they contained things such as "Catriona and the Waves" (don't ask!!) or "E Flat Minor". Adding the dot extension marker to the pattern to match takes care of these, so, remembering that the dot needs to be escaped....


Actually, Regex() will use the given string as input to match against the provided RE, but the provided RE may specify matching all possibilities ranging from nothing to everything.  The goal, and often trick, is to write an RE pattern that forces the exact matches you want, and no more or less.  In your ...Waves example, yes, the "wav" in the alternation pattern matched, so the RE was done - it matched.  The assertion that 0 or more characters, followed by the exact characters "wav" was true; as soon as that occurs, the RE engine is done.

Quote from: marko
if(regex([filename],/#.*\.(fla|wav|mpc|ogg)#/),Lossless,Lossy)
anything, followed by a dot followed by fla or wav or mpc or ogg is marked lossless, everything else is marked lossy. The hat and the dollar do make a difference here. As given, this works, but if I add the hat and the dollar, 'fla' no longer positively matches 'flac'. I think I may have answered my own question here, because, removing the "\." and adding the hat and the dollar means that "Catriona and the Waves" no longer returns a false positive...


Yes, that's why the anchors are available - they force the string matching to the beginning, or end, or a line, this eliminating sub-matches.

Of note, the expression to match lossless suffixes against the filename field is more simply stated as:

   regex([filename], /#\.(flac|wav|mpc|ogg)$#/

Notice the anchor at the end, and no leading .* is necessary (in fact, in this case it is entirely redundant and unless you are capturing the front matter, irrelevant).

Quote from: marko
"^" is the regex equivalent of MC's [Starts with"
"$" is the regex equivalent of MC's "Ends with]

If that's right, "WooHoo!" :D


Pretty much, but I won't get into why this isn't entirely accurate just yet.  For now, it is good enough, and definitely deserves a WooHoo!

Quote from: marko
I've just previewed this post, and I find it messy and noisy. I've left it as-is because it's already taken too long to compose, and clearly demonstrates me flailing around, trying to get to grips with regex!! Hopefully it's comprehensible, despite the noise.

-marko

The post was great - and surely helps folks who have the same questions and confusions.
Logged
The opinions I express represent my own folly.

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #37 on: September 07, 2011, 11:36:43 am »

my additions are supplemental.

You are very humble. :)
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #38 on: September 07, 2011, 11:39:14 am »

I do not think this will be possible/really useful until the MC implementation of RegEx supports global search, i.e. the ability to find all instances of "/", for example, in the fields [Album Artist];[Artist];[Composer];[Conductor]. MC currently does not support global search.

To clarify a minor point, it is substitution that is necessary.

The RE library employed by Matt / MC comes with a regex() function, which is in use now.  It also provides a regex_replace() function, which would perform substitutions (one or more "globally").  It is that function that we'd like to see made available in a RegexReplace() or RegexSubstitute() function.  This would a) make generally possible, and b) reduce expressions such as Vincent Kars'

   Replace(Replace(Replace(Replace(Replace([Album Artist];[Artist];[Composer];[Conductor],//,;),/ ,;),/,),.),;;,;))...

into something more manageable such as:

   RegexReplace([Album Artist];[Artist];[Composer];[Conductor], /#[/;]#/, ;, GLOBAL)

( I have not evaluated all the delimiters specified in the original expression above, but the idea holds regardless. )
Logged
The opinions I express represent my own folly.

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #39 on: September 09, 2011, 09:10:50 am »

I thought I should share a couple of RegEx() that I have found useful.

1. The MC grouping shows only numbers and A-Z and "Others". Accented characters like ÅÄÖ are sorted under "Others" together with (*'[, like in (Multiple Artists). I wanted for my Album Artist grouping all character to show, a separate entry for (Multiple Artists), all numbers sorted under # and leading The, A and An as well as non word or number characters (like in *NSYNC, 'Til Tuseday and ...And You Will Know Us by the Trail of Dead) to be ignored when grouping. This RE does that:
Code: [Select]
If(IsEqual([Album Artist (auto)],/(Multiple Artists/),1),[Album Artist (auto)],If(IsRange(Regex([Album Artist (auto)],/#(the |a |an )?.*?([\d\w])#/,2),1-9),#,[R2]))
I like having artists beginning with La, Le and Les sorted under L, but if you do not you can substitute (the |a |an ) with (the |a |an |le |la |les ).

The (the |a |an ) looks for any of those words followed by a space and put whatever it finds in container [R1]. This means that those words when they occur in the beginning of the field are removed (along with the trailing space). | means "or". Since not all artists begin with any of those words, we must make the removal optional by putting a "?" after the ). The .* part removes any non word or number characters. . means any character and * means repeat zero or as many times as possible. The following ? proved necessary, but I cannot explain why. \d means any digit and \w any word character. The surrounding square brackets means that any one (1) word character or digit will provide a hit - I could have used (\d|\w) instead like in the beginning of the RE. This makes the preceding .* part "eat" every character until a word character or a digit is found. The () surrounding the square brackets capture the single word character or digit and put it in container [R2] which is then used later in the MC expression.

2. MC's FormatDate([Date],decade) function only works with date type fields (and you cannot easily convert the content of a non date type field into a date type field). I have a custom integer type field ([Trackyear]) with the year a recording was originally released that I use for compilation albums. I wanted to group by decade using the original release year if any. This RE does that:
Code: [Select]
If(Regex(If(IsEmpty([Trackyear]),[Year],[Trackyear]),/#([12][0-9][0-9])#/,0),[R1]0's,Missing)
The ([12][0-9][0-9]) part could be substituted with (\d{3}). \d means any one (1) digit and {3} means repeat exactly 3 times.

I actually also have a custom integer field ([Live Date]) where I enter the year of recording for live performances and for cases where the studio recording took place long before the release. This RE takes into account also that field:
Code: [Select]
If(Regex(If(IsEmpty([Live Date]),If(IsEmpty([Trackyear]),[Year],[Trackyear]),[Live Date]),/#(\d{3})#/,0),[R1]0's,Missing)
I am sure MrC will correct any mistakes and point out possible enhancements.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #40 on: September 09, 2011, 04:12:53 pm »

Nice example.

...The following ? proved necessary, but I cannot explain why. \d means any digit and \w any word character.

This occurs because the RE engine wants to consume as much as it can.  The RE .*[\d\w] allowed the RE engine to consume everything up until a final digit or word character, and thus be TRUE.  So adding the non-greedy qualifier to the repetition allowed it to match minimally.

Care needs to be take whenever using 0 or more quantifiers, because they allow for many paths.  Always try to use something that suggests to the RE engine which path to take.  For example, do you really want ANYTHING to match just prior to the \d or \w characters, or do you really mean not-\d and not-w?  Using a more restrictive RE such as:

   Regex([Album Artist (auto)], /#^(the |a |an )?[^\d\w]*([\d\w])#/, 2)

probably more aligns with the requirements and allows you to remove the non-greedy qualifier.  Also, whenever you can, use anchors such as ^ and $ when you really mean to match at the beginning and/or end.  This significantly reduces the number of possible matches that must be attempted, especially when using quantifiers that allow 0 items (i.e. * and ?).  When the alternates "the ", "a ", and "an " do not match at the beginning of the string, the RE engine must try all other locations within the string to see if they might match, and when they do, then further test the remainder of the RE to see if it can still succeed.  If it cannot, it backs-up and starts again down the next possible path.  Think of looking for an insect in a tree, by following a tree trunk up through each branch, through each each twig, to each leaf, one branch/twig/leaf at a time until you've covered them all.  It is much less effort if someone indicates left-most branch, top twig, right leaf.
Logged
The opinions I express represent my own folly.

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #41 on: September 10, 2011, 04:13:00 am »

Thanks, MrC.

Now I am trying to clean up the sorting in my Album Artist pane with this expression:
Code: [Select]
If(IsEqual([Album Artist (auto)],/(Multiple Artists/),1),[Album Artist (auto)],Regex([Album Artist (auto)],/#/#^[^\d\w?]*([\d\w\?].*?)(?<=[\[("])[\])"]?(.*)#/,-1)[R1][R2])
But I cannot get it to work. The idea is to strip non word characters and digits from the beginning of the field and strip the first ], ) or " if a [, ) or " exists earlier in the field.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #42 on: September 10, 2011, 11:56:08 am »

Unfortunately, there are no look-behinds in this RE implementation.  And in general, they are fixed width anyway.

When you want to find a sequence such as:

   front_matter "bar" back_matter

You're really looking for:

  front_matter (quote non-quote+ quote)? back_matter

By lumping all the mid-stuff together into a single optional component where each sub-component must match, it eliminates the multiple optional sub-spans, which tend to do the wrong thing because there are too many possibilities, one of which won't be what you want.

Try this RE:

   ^[^?\w]*([?\w].*?)(?:["(\[]([^("\]\[)]+)[")\]])(.*)$
Logged
The opinions I express represent my own folly.

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #43 on: September 11, 2011, 04:36:02 am »

Unfortunately, there are no look-behinds in this RE implementation.

No wonder I could not get that to work. I do hope Matt tells us what RE flavour is used and points us to a reference for the functions in that flavour.

   ^[^?\w]*([?\w].*?)(?:["(\[]([^("\]\[)]+)[")\]])(.*)$

Thanks for the instructive example! I did not know that I could capture a part within a non capture (?:...) group.

I am afraid the expression does not do what I am after. First, it outputs nothing if the field does not contain "([)], which is the case with most of my artists.

I would like to strip the first ], ) or " only if a [, ( or " exists earlier in the field and has already been stripped.
[Low Budget] Blues Band => Low Budget Blues Band
[re:jazz] => re:jazz
The Charlatans [UK] => The Charlatans [UK]
"Fast" Eddie Clark => Fast Eddie Clark
Joe "Mr. Piano" Henderson => Joe "Mr. Piano" Henderson
...And You Will Know Us by the Trail of Dead => And You Will Know Us by the Trail of Dead

Is this possible?
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #44 on: September 11, 2011, 12:29:08 pm »

There may be an ambiguity in the input.  It appears you want brackets and the like removed, unless they are at the end (The Charlatans [UK]), so long as they are not also at the front ([re:jazz])?

Can you clarify?
Logged
The opinions I express represent my own folly.

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #45 on: September 11, 2011, 12:54:17 pm »

Thanks for your assistance! I cannot figure this out myself.

I have checked the data and as regards [] you are correct (I have only artists beginning with or ending with [...], no artists like "Henry [The Frogman] Clark"). With () it is different. I have artists beginning with (...), where the () should be removed, artists with (...) in the middle, "Clarence (Sonny) Clark", and artists ending with (), "Satie, Erik (1866-1925)". A quick check indicates that the () in the middle could/should be replaced by "", so if this bit is difficult leave it out.

Before trying out my failed expression I checked that I did only have artists with a pair of either "", () or [] (and not two of those pairs).

(I am actually trying out a field of my track artists, album artists and composers combined in one list type field , i.e. all people involved. I know this will not work until global search/replace is implemented, but I wanted to take into account all people variations I have.)
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #46 on: September 11, 2011, 05:14:32 pm »

So I'll group the ambiguous cases into the Remove Brackets category. 

case 1: [Low Budget] Blues Band  => remove brackets at beginning
case 2: The Charlatans [UK]        => do not remove brackets at end
case 3: [re:jazz]                        => CONFLICT: case 1, but also case 2

That is, case 3 cannot satisfy both case 1 and case 2 rules, so a new rule would need to be created.  But ignoring that, here's an example of how you might satisfy the requirement.  The idea is to take the output of one Regex and pass it to another, but conditionally:

regex(if(regex([Artist],
     /#^(.*?)(?:["(\[](?=[^)("\]\[]+[")\]]))(?:([^)("\]\[]+)[")\]])(.*)$#/, 0),
       [R1][R2][R3], [Artist]),
  /#^[^\w\d\["()\]]*(.*)$#/, 1)

The inner Regex looks for minimal amounts of front-matter, then uses positive look-ahead to scan for an opening quote, bracket, or paren followed by stuff followed by a closing quote, bracket, or parenIf that is seen, it consumes up to the closing quote, bracket, or paren, and then grabs possible back-matter.

This is placed in a conditional, and either the matched values or the original [Artist] is returned, and this is then passed to another Regex which eliminates the initial leading punctuation chars.

The reason for breaking this up into two Regex's is that it eliminates the difficulty of defining an RE to match the front-matter, while not interfering with the subsequent text.  The RE becomes too complex to manage in one shot.

note: Replace Artist with Album Artist (auto), which I used to simplify my tests.
Logged
The opinions I express represent my own folly.

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #47 on: September 12, 2011, 03:37:31 am »

Thanks, MrC! I am learning a lot.

I totally agree with your assessment that the RE becomes too complex to manage in one shot.  :)

Your two step approach made me think. Why not keep it as simple as possible?

regex(if(regex([Artist],
     /#^[\[("](.*)?[\])"]+(.*)$#/, 0),
       [R1][R2], [Artist]),
  /#^[^\w\d?]*(.*)$#/, 1)

The first RE catches only fields that begin with [(" and strips those characters at the beginning of the field and the following ])". Then the next RE can strip any other now word and non digit characters (and not ?, for ? and the Mysterians) at the beginning of the field. This seems to do what I am after.

I think your expression strips [] also at the end of the field, i.e. the [UK] part of The Charlatans [UK].
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Regex() expression language section ready for review...
« Reply #48 on: September 12, 2011, 11:38:03 am »

You're welcome.  I like your simplification!  It also defines/encodes a final rule to not strip trailing [stuff] as in "The Charlatans [UK]".

I realized at one point yesterday that \w\d is redundant - only \w is necessary since it includes \d ( i.e. \w is the character range: [a-zA-Z0-9_] ).  But somehow I added it back to the outermost RE.

This was a good, instructive example of how to take the conditional output of one Regex() and pass it to another Regex() for additional processing.
Logged
The opinions I express represent my own folly.

vagskal

  • Citizen of the Universe
  • *****
  • Posts: 1227
Re: Regex() expression language section ready for review...
« Reply #49 on: September 12, 2011, 11:51:09 am »

\w is the character range: [a-zA-Z0-9_] ).

So a word character can be digit? Strange. I wondered why your previous RE worked as regards numbers in the beginning.

One caveat with having this expression in a pane is that scroll by typing in the pane does not work reliably. Often the first letter typed, or the last letters, are not recognized.
Logged
Pages: [1] 2   Go Up