INTERACT FORUM

Please login or register.

Login with username, password and session length
Advanced search  
Pages: [1]   Go Down

Author Topic: Parsing a MUSICIANCREDITS or TMCL field  (Read 2741 times)

tiberiuspv

  • Recent member
  • *
  • Posts: 49
Parsing a MUSICIANCREDITS or TMCL field
« on: January 13, 2013, 02:23:42 pm »

Most of my music files have a MUSICIANCREDITS (for FLAC) or TMCL (for MP3) tag. The value of the field is a list of null-separated pairs, where each pair consists of an instrument and a musician, separated by a NULL. This is as prescribed by the MP3 tagging spec. My intent is to parse that field to provide a replacement for a multiple artist field (as soloists in the standard MC scheme).
First step is to define a MUSICIANCREDITS field as text and import the values from the file tags. MC replaces the NULLs in the field value with \x0D\x0A when reading the raw tag, which is fine. Then, I replace all those newlines with a semicolon, which I can do in an expression column with:
Code: [Select]
replace([MUSICIANCREDITS],/#
#/,;)
where the newline is escaped by the slash-sharp sequence. The first problem is that I cannot figure out how to do that in a calculated field expression, as I cannot insert the newline character (typing it closes the dialog, cut-and-paste stops at the newline).
The next step is to pick-up only the soloists, not the instruments. Converting to a list like in
Code: [Select]
listcombine([MUSICIANCREDITS],,/#
#/,;)&datatype=[list]
removes all duplicates and leaves you with a list which has each instrument and each soloist. Is there a simple way to remove the instruments (assuming I know exactly all the possible instruments)?
Alternatively, I could replace every other newline with a backslash, and the other newlines with a semicolon, to get a hierarchy instrument->soloist. But I can't figure out how to do that for 2 reasons: can't figure out how to put a newline in a regex, and can't figure out how to do global replacements using regex.
Am I missing the obvious? Thanks for any suggestion.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #1 on: January 13, 2013, 02:28:16 pm »

Quick question - have you tried making MUSICIANCREDITS a list field?

There is no way to insert newlines, btw.
Logged
The opinions I express represent my own folly.

tiberiuspv

  • Recent member
  • *
  • Posts: 49
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #2 on: January 13, 2013, 02:59:29 pm »

Yes, but it leaves it as a single item, containing CR/LF sequences (which is not surprising since lists are supposed to be semicolon-separated, hence my replace approach). If I could find a way to replace a newline sequence by semicolon in a calculated field definition, I would be fine. Is there any escape syntax that would allow me to do that? It seems the only escape syntax in expressions is /#...#/ or /x where x is any single character. Is there a way to escape an arbitrary hex character, in the spirit of \x0D ?
I also tried another tack: hexify the value, replace %0D%0A. But I can't figure out a way to un-hexify the result without a lot of pain and suffering (quite a few accents, umlauts, and the like...).
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #3 on: January 13, 2013, 03:06:18 pm »

MC's input plug-ins should handle the conversion from the non-printables.  List values should be converted and imported as semicolon-separated values.

Do you have a file I can test with here?
Logged
The opinions I express represent my own folly.

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #4 on: January 13, 2013, 03:21:21 pm »

Before you spend too much time on this also, let me also add there is no global replacement via regex().  But there are some clever techniques you can do to massage the data a bit.  Forget the hexify() route, and trying to change unprintable characters.  If you're seeing these, something is amok and should be investigated (hence my request for a sample file).

To help, I'll need to see what your values look like.
Logged
The opinions I express represent my own folly.

tiberiuspv

  • Recent member
  • *
  • Posts: 49
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #5 on: January 13, 2013, 03:46:04 pm »

Sure. What's the best way to upload a file? Or should I e-mail it to you?
Thanks for the help...
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #6 on: January 13, 2013, 04:07:15 pm »

If you do dropbox, you can place it there and PM me a location.

If the file is not too large for email, you can email address i PM you.
Logged
The opinions I express represent my own folly.

tiberiuspv

  • Recent member
  • *
  • Posts: 49
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #7 on: January 13, 2013, 04:15:43 pm »

Email sent, thanks.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #8 on: January 13, 2013, 04:52:25 pm »

So your FLAC file has values separated by 0d 0a (denoted by * below):

   MUSICIANCREDITS=Soprano*Dorothea Röschmann*Alto*Andreas Scholl*Tenor*Werner Güra*Bass*Klaus Häger*Choir*RIAS-Kammerchor

Typically FLAC files would have multiple key=value pairs:

MUSICIANCREDITS=Soprano Dorothea Röschmann
MUSICIANCREDITS=Alto Andreas Scholl
MUSICIANCREDITS=Tenor Werner Güra
MUSICIANCREDITS=Bass Klaus Häger
MUSICIANCREDITS=Choir RIAS-Kammerchor

with some separator in between the role and the name.

You could use the following expression to convert the string into an MC list:

replace(clean([MUSICIANCREDITS],3), / / , ;)

This gives you a semicolon separated list (the clean() function replaces each non-printable with a space, so your two character 0d 0a sequence is converted into two spaces).

At this point, there is no convenient or general way to globally reformat the values.  You could use regex() on up to 9 captures, or use ListItem() to pull even and odd values.  Its is cumbersome.

Here's the first 5 artists, with the roles stripped out:

regex(replace(clean([MUSICIANCREDITS],3), / / , ;);, /#(?:[^;]+;([^;]+;))(?:[^;]+;([^;]+;))?(?:[^;]+;([^;]+;))?(?:[^;]+;([^;]+;))?(?:[^;]+;([^;]+;))?(?:[^;]+;([^;]+;))?#/,-1)[R1] [R2] [R3] [R4] [R5]
Logged
The opinions I express represent my own folly.

tiberiuspv

  • Recent member
  • *
  • Posts: 49
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #9 on: January 13, 2013, 05:19:27 pm »

Clean() is the key - thanks for pointing that out. I did not know it also got rid of all non-printables (not mentioned in the excellent expressions wiki page). That solves most of my problem. Since my tagging is pretty picky, I should never have double spaces inside a name, so that is not an issue. I could also first replace all spaces with some special character, then apply clean, then replace the spaces back.
Getting the roles mixed in with the soloists is not a big problem, since there is a limited number of them. Picking every other value with regex() or ListItem() is, as you mention, cumbersome and hard to extend to the general case.
So, my current solution is
listbuild(1,;,replace(clean([MUSICIANCREDITS],3),/ / ,;))&datatype=
    as a new calculated field. I'll play with it on my full collection and see how it works out.
    Thanks for the help.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #10 on: January 13, 2013, 05:22:36 pm »

It occurred to me you might *not* want to use Clean() as that can muck with your contributor names.  So here's a regex() that handles the 0d 0a sequence directly.  This fudges a bit, but it will be fine - it looks for a two-character sequence of characters in the range of 0a to 0d.  Here's the ugly beast:

regex([MUSICIANCREDITS],
/#(?:[^\x0a-\x0d]+[\x0a-\x0d]{2}([^\x0a-\x0d]+))(?:[\x0a-\x0d]{2}[^\x0a-\x0d]+[\x0a-\x0d]{2}([^\x0a-\x0d]+))?(?:[\x0a-\x0d]{2}[^\x0a-\x0d]+[\x0a-\x0d]{2}([^\x0a-\x0d]+))?(?:[\x0a-\x0d]{2}[^\x0a-\x0d]+[\x0a-\x0d]{2}([^\x0a-\x0d]+))?(?:[\x0a-\x0d]{2}[^\x0a-\x0d]+[\x0a-\x0d]{2}([^\x0a-\x0d]+))?#/,-1)/
[R1]; [R2]; [R3]; [R4]; [R5]

Now if only there was a global regexreplace() function:

regexreplace([MUSICIANCREDITS], /#(?:[\x0a-\x0d]{2})?(?:[^\x0a-\x0d]+[\x0a-\x0d]{2}([^\x0a-\x0d]+))#/, \1, g)
Logged
The opinions I express represent my own folly.

tiberiuspv

  • Recent member
  • *
  • Posts: 49
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #11 on: January 13, 2013, 05:36:02 pm »

Yes, a global regexreplace() would be a godsend... This is probably the one missing function in the MC expression portfolio. I suspect the development team is right now more focused on  the video functionality than on the core engine feature set. It makes complete sense from a business standpoint (larger market) and much less risky (core engine changes are always major bug sources due to old undocumented behaviors that break code in unpredictable ways...). But I'm not too hopeful to see it any time soon.
I'll probably stick to the clean() approach, because I do have quit a few tracks with >9 soloists, which is the limit of how far you can scale the regex() approach.

Side question: in the replace() function, is there a way to escape characters in the \x0a style? It seems that the only escape sequences at that level are /#...#/ and \{single_character_in_base_ascii}.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #12 on: January 13, 2013, 05:38:47 pm »

No hex char tokens, I'm afraid.  Forward slash is the general escape character, and /##/ was introduced with Regex().
Logged
The opinions I express represent my own folly.

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #13 on: January 13, 2013, 05:44:02 pm »

Clean() is the key - thanks for pointing that out. I did not know it also got rid of all non-printables (not mentioned in the excellent expressions wiki page).

I updated the Clean() mode 3 description on the wiki page.  Thanks for pointing this out.
Logged
The opinions I express represent my own folly.

tiberiuspv

  • Recent member
  • *
  • Posts: 49
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #14 on: January 13, 2013, 06:48:28 pm »

No problem. Thanks a lot for the expression page - it is really excellent.
I just did this on my entire library and it works like a charm. It did point out quite a few tagging mistakes (TIPL vs TMCL confusions, some field typos here and there), so I have my work cut for the rest of the day fixing all those...
Keeping a clean and consistent tagging structure for a medium to large classical collection is a major pain. I am now trying to convert to MC as my main environment, but I still need to maintain compatibility with my existing Squeezebox-based setup and multiple third-party taggers. I may move to MC as my tagging environment, but I am not comfortable enough yet. My usual process is to do most of the tagging work in Excel, and copy the tags over using an Excel-friendly tagger. When you have a hammer, everything looks like a nail  :) .
Thanks again for your help.
Logged

MrC

  • Citizen of the Universe
  • *****
  • Posts: 10462
  • Your life is short. Give me your money.
Re: Parsing a MUSICIANCREDITS or TMCL field
« Reply #15 on: January 13, 2013, 07:02:23 pm »

There are a number of us SB users here, so you're in good hands.  I only tag in MC.  Rarely will I use mp3tag for testing, or a rare, esoteric need.

There are lots of shortcuts in MC for tagging - if you find yourself working hard, step back, and see if there is an easier way (or ask for help).
Logged
The opinions I express represent my own folly.
Pages: [1]   Go Up