INTERACT FORUM

Please login or register.

Login with username, password and session length
Advanced search  
Pages: [1]   Go Down

Author Topic: How to find nearly duplicate tagged fields  (Read 4245 times)

timwtheov

  • Galactic Citizen
  • ****
  • Posts: 354
How to find nearly duplicate tagged fields
« on: June 04, 2016, 11:23:34 pm »

I've run into a situation where my classical [Composition] field (can't remember if this is a stock field or one that I created) has a number of near-duplicate entries, usually in a given composition's opus number, if the latter is a single digit, e.g., I had Schumann's Carnaval for Piano Op. 09 but also Carnaval for Piano Op. 9 (no zero before the 9 in the latter, if you didn't notice). This happened because I had to uninstall and reinstall Musichi (from which I populate [Composer] and [Composition]), which has a "Clean" feature with will search the [Name], [Filename], and some other fields to populate [Composer], [Artist], and [Composition]; however, the program allows you to alter how the [Composition] field will look: Concerto for Violin in D major Op. 23, say, vs. Violin Concerto Op. 23 in D major, and I didn't get at least one of the settings as I had them before the reinstall.

What I'm wondering if I can do (probably via an expression) is search for any near-duplicate compositions within a given composer's set of compositions, like the Schumann example above, without manually having to go through each one.  I think the only problem areas are with single-digit opus numbers, as the basic naming order I had before the reinstall is right, e.g. Name, Instruments, Rank, Tonality, etc. Maybe by the first word in [Composition] within a composer? Or is this another MCUtils job?

Thanks!
Logged

ferday

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 1732
Re: How to find nearly duplicate tagged fields
« Reply #1 on: June 05, 2016, 12:50:30 am »

You can smartlist for any field

~dup=[composition] for example

you can do multiple fields (separate by comma) but they both have to match exactly

So you'd still need some manual efforts since they aren't *exactly* a match.  I think it may be possible...ill think on it until blgentry comes up with a great solution :)
Logged

Werner

  • Junior Woodchuck
  • **
  • Posts: 72
Re: How to find nearly duplicate tagged fields
« Reply #2 on: June 05, 2016, 06:31:57 am »

You can so ist manually. The effort is not. Too big: take a pane view, add in the upper part a pane with "Composition". Now you see here all the titles from your library. The nearly duplicates will be not very far from each other. Click on the one you want to correct and click once more (not a double click, but a little bit more slowly). Now the name is highlighted and you can change it. So all the tracks are changed at once. Do this for all the "nearly duplicates" and your library is clean.
(I hope this is intelligible in spite of my adventurous english...)
Logged

blgentry

  • Regular Member
  • Citizen of the Universe
  • *****
  • Posts: 8014
Re: How to find nearly duplicate tagged fields
« Reply #3 on: June 05, 2016, 07:53:41 am »

I don't know enough about classical music cataloging to be very sure about this.  But it sort of looks like the OP number is the main identifier and the rest is just words around it.  If that's true, maybe you should try to capture the OP number to a different field, and clean it in the process using the number of digits you want to standardize on.

After you do that, you could identify any fields that had the OP number "wrong" in terms of the number of digits and substitute in the correct number.  Or perhaps the second step would be to try to parse out the other identifying words like "(performance type) for (instruments) in (key)" .  If you had all of that parsed correctly, you could then rebuild the composition field based on the other data you had:  OP number, performance type, instruments, and key.

Regex would almost certainly have to be used in parsing the existing composition field.  I can probably help with that.

Just some thoughts this morning.

Brian.
Logged

timwtheov

  • Galactic Citizen
  • ****
  • Posts: 354
Re: How to find nearly duplicate tagged fields
« Reply #4 on: June 05, 2016, 08:40:42 am »

@ferday
Yeah, I actually tried the smartlist thing but couldn't get it to work because the compositions are almost duplicates but not quite.  I think you'd have to have part of the string constant and look for that, like the first 8 or 10 characters in the string.

@Werner
Thanks for the suggestion: that might work, though it would probably take longer than my patience would allow, just because there are hundreds and hundreds of compositions in my main classical view.

@blgentry
It's tricky with opus numbers, though, because not all compositions have them, or they're not always called "opus." For example, Mozart had his works cataloged by someone named "Kochel" (if memory serves), so everything is K. x and not Op. x. Same with Haydn or Schubert, too ("Hob. x" and "D. x" respectively). Anyway, I think the Regex you'd have to do would be a certain number of characters at the beginning of the string--how many, though, is also tricky, since some composition names are short (e.g., Verdi's opera "Aida (1871)") and some are really long (Vivaldi's "Concerto da camera for Violins & Cello & Oboe & Violin & Bassoon in D Major RV.  90 "Il gardellino""--also note the "RV" here instead of "opus"). I don't know: maybe it's not possible, and I'll have to do it manually. Sigh.
Logged

blgentry

  • Regular Member
  • Citizen of the Universe
  • *****
  • Posts: 8014
Re: How to find nearly duplicate tagged fields
« Reply #5 on: June 05, 2016, 08:54:59 am »

Regex can have optional parts and the number of characters doesn't matter.  As a very simple example, you could write something like this:

/#.+(OP|RV|K)(.*/s*)([0-9]+)#/

That would capture anything starting with OP, or RV, or K with an optional "." , optional spaces, and then one or more digits.  You could then output what you wanted, probably something like:

[R1]. [R3]

This would output the prefix, a period, a single space, and then the digits.  Everything else in the field would be totally ignored.

This example is simplified.  It would probably have to be tweaked several times to account for different structures in your composition fields.  I think it's do-able as a project to extract Opus (and friends) numbers.

Brian.
Logged

ferday

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 1732
Re: How to find nearly duplicate tagged fields
« Reply #6 on: June 05, 2016, 10:42:52 am »

by using an expression to filter a list into smaller chunks at least the manual labour is lessened somewhat.  Blgentry's regex would do a good job to minimise the list length I would guess.  You could then parse that list further and extract into separate tags

i cant offer any help with MCUtils right now but it is at the top of my list of things to learn, it very well may offer a better solution for you
Logged

timwtheov

  • Galactic Citizen
  • ****
  • Posts: 354
Re: How to find nearly duplicate tagged fields
« Reply #7 on: June 05, 2016, 12:35:39 pm »

Thanks all for the help!

Brian, do I just plug what you wrote into a Regex() parenthesis? I'm asking because I keep getting a "syntax error" in my expression view. Is it something to do with the [R1], etc.? Sorry, Regex is new to me, and therefore it kind of makes my head hurt!

[Edit] Realized I wasn't including [Composition] in the expression. Duh! 
Logged

blgentry

  • Regular Member
  • Citizen of the Universe
  • *****
  • Posts: 8014
Re: How to find nearly duplicate tagged fields
« Reply #8 on: June 05, 2016, 01:06:08 pm »

My regex above is untested.  It's also just the expression itself, which is the second parameter to the regex() function.  It sounds like maybe you have it working now.  If you want more specific help, post the expression you're working with and I'll see if I can help.

Good luck.

Brian.
Logged

timwtheov

  • Galactic Citizen
  • ****
  • Posts: 354
Re: How to find nearly duplicate tagged fields
« Reply #9 on: June 05, 2016, 01:13:33 pm »

Here it is:  Regex([Composition], /#.+(OP|RV|K)(.*/s*)([0-9]+)#/)

I used it and found 1 Beethoven Sonata that I was able to change, but I'll probably have to test it on some of the files I tagged in Musichi since I reinstalled.

[Edit] Just edited the above by adding in "|D|HOB" and found a few more . . . .
Logged

timwtheov

  • Galactic Citizen
  • ****
  • Posts: 354
Re: How to find nearly duplicate tagged fields
« Reply #10 on: June 05, 2016, 01:48:47 pm »

Added a screen shot below of compositions that meet the Regex criteria ("1").
Logged

blgentry

  • Regular Member
  • Citizen of the Universe
  • *****
  • Posts: 8014
Re: How to find nearly duplicate tagged fields
« Reply #11 on: June 05, 2016, 03:29:43 pm »

You're running regex kind of like a search.  That's fine and will show you what matches.  But it's not what I had intended.  I had intended that you'd use the regex to pull apart the composition and put parts of it in other fields.  In this case, the OP number is what I was trying to extract.

To use regex in this manner, you want to put the regex in an expression column, so you can see what it is going to output.  You'll want to run it with additional parameters, which suppresses it's "1" and "0" output, and then allows you to pick parts of the regex out and output them instead.  Something like:

Regex([Composition], /#.+(OP|RV|K)(.*/s*)([0-9]+)#/, -1, 0)[R1]. [R3]

For some further info here are some links:

A short lesson I wrote on Regex:
http://yabb.jriver.com/interact/index.php?topic=97996.0

The expression language reference on regex:
http://wiki.jriver.com/index.php/String_Manipulation_Functions#Regex

Good luck and let me know if I can help.

Brian.
Logged

timwtheov

  • Galactic Citizen
  • ****
  • Posts: 354
Re: How to find nearly duplicate tagged fields
« Reply #12 on: June 05, 2016, 07:59:16 pm »

Wow! Thanks! I'd read the wiki stuff on Regex, but your tutorial made all the crazy symbols clear. Maybe someone should link what you wrote to the wiki? I'm sure it would help others. 

I'll play around with the expression you offered, but a question about a symbol not covered in your tutorial, though maybe it's on the wiki: what does /s do in [R2]?
Logged

blgentry

  • Regular Member
  • Citizen of the Universe
  • *****
  • Posts: 8014
Re: How to find nearly duplicate tagged fields
« Reply #13 on: June 05, 2016, 08:05:37 pm »

Glad the tutorial was helpful.

The /s sequence is a character type.  It means "space characters".  I believe it will match on spaces, tabs, and perhaps embedded newlines. Anything that is "white space".  Another that is similar is /d which means "digit characters" (aka numbers).

Because I'm thinking of it, it's worth mentioning something I noticed:  Some of your OP identifiers have a form like this:

OP. 10/2

I'm not sure what the number after the slash means, but if it's important to capture the slash and the number after it, the regex needs to be altered a bit to include slashes too.  It's not hard to do.  If it's worth capturing, all you have to do is add it to the character range.  The range looks like this right now:

[0-9]

You could just change it to:

[0-9/]

I haven't tested it, but that should work. 

Brian.
Logged

timwtheov

  • Galactic Citizen
  • ****
  • Posts: 354
Re: How to find nearly duplicate tagged fields
« Reply #14 on: June 05, 2016, 11:12:59 pm »

Thanks again: that makes sense, with regard to /s and /d.

With the opus number thing you noticed, some opus numbers have more than one work associated with them: Chopin's many solo piano works, for example, tend to be labeled this way, with maybe more than 1 Nocturne on 1 opus number (Nocturne #14 is part the second in the Op. 48 set--hence, Op. 48/2.  These numbers (opus numbers, I mean) are simply work numbers (opus means "work" in Latin), sometimes labeled thus by the composers but more often labeled that way by a publisher or cataloger after the fact.  You tutor me/us in Regex/expression language, I tutor in classical music!

Anyway, I don't think the expression needs to be altered for this issue because I don't think that second number was affected by my set-up error after the Musichi reinstall.  But thanks though!   
Logged
Pages: [1]   Go Up