INTERACT FORUM

Please login or register.

Login with username, password and session length
Advanced search  
Pages: [1] 2 3   Go Down

Author Topic: Duplicates Finder PlugIn  (Read 23576 times)

jleerigby

  • Guest
Duplicates Finder PlugIn
« on: January 12, 2004, 09:31:52 am »

Don't get all excited by the title.  There's no way I could build it!

I'm looking for a plugin that will do a better job of identifying potential duplicates than the current duplicates feature in MC.

I know that we (King and me) discussed this before in this thread and agreed that it would be best if MC integrated the functionality rather than rely on a plugin.  However, I think we all know that they have bigger changes to concentrate on at the moment so we could be waiting a really long time.

I think that, if developed, this could be one of the most useful 'must have' plugins for MC since Lyrics Finder and Playing Now.

The 'Duplicates Finder' would search your library and tag potential duplicates.  Then you would use a view scheme to review items that have been tagged by the plugin and decide whether to:
  • delete them
  • tag them as confirmed duplicates and then exclude these from all your view schemes and smartlists except album-based ones (that's what I do).
The plug in would do a better job than MC of finding possible dups as it would not need to find an exact match.  An example of one way to do this would be to run some code against every artist / track title to produce a check name e.g.:
  • Remove all characters within brackets
  • Remove all characters (inc spaces) except 0-9 and a-z
  • Concatenate the first 5 characters (or less) of the artist name with the first 5 characters (or less) of the track title.


e.g. Fleetwood Mac - You make lovin' fun becomes FleetYouma and
Fleetwood Mac - You make loving fun (Radio edit) becomes FleetYouma

The above rules are just an example.  Someone much cleverer than me could come up with something that would be even slicker at finding duplicates.  In fact you could have have multiple sets of rules and have the plug in run each in sequence, recording the matches as it goes along.

The plug in could then tag all tracks (in some nominated custom field) whose check name appears more than once with a tag indicating they are duplicates.  Alternatively, the plug in could simply add the check name (FleetYouMa) to the custom field and users could then find and review the duplicates using MC's built in dups feature.

The plug in would offer an option to skip tracks where the nominated field is not blank.  This way you can change the tag to something like 'Play', 'Don't play' or 'NoDup' once you've reviewed it.

Does this make sense?

Now then, are there any coders out there that fancy a blast at this (ki..cough..cough.....ng).  Like I said, I think it would become a 'must have' utility for MC users, especially those with large libraries.

Logged

nila

  • Guest
Re:Duplicates Finder PlugIn
« Reply #1 on: January 12, 2004, 09:40:07 am »

There's already a duplicate finder plugin out there?
Logged

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #2 on: January 12, 2004, 01:24:18 pm »

There's already a duplicate finder plugin out there?
Where?  I can't find it.
Logged

nila

  • Guest
Re:Duplicates Finder PlugIn
« Reply #3 on: January 12, 2004, 02:57:26 pm »

http://www.musicex.com/cgi-bin/downloads/plugins.pl?type=10&start=0&end=10&page=1

There you go.

The .reg file needs to be edited.
Created your own with this in it:

Windows Registry Editor Version 5.00

[HKEY_CURRENT_USER\Software\JRiver\Media Jukebox\Plugins\Interface\DubFinder]
"IVersion"=dword:00000001
"Company"="2via Beratung"
"Version"="1.0.01"
"URL"="http://www.2via.de"
"Copyright"="(C) Copyright 2002 by 2via Beratung"
"PluginMode"=dword:00000000
"ProdID"="MJDubFinder.DubFinder"

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\Software\JRiver\Media Jukebox\Plugins\Interface\DubFinder]
"IVersion"=dword:00000001
"Company"="2via Beratung"
"Version"="1.0.01"
"URL"="http://www.2via.de"
"Copyright"="(C) Copyright 2002 by 2via Beratung"
"PluginMode"=dword:00000000
"ProdID"="MJDubFinder.DubFinder"


That should make it work.

Couple of bugs I ran into but it seemed ok.
The guys e-mail is there too so you can e-mail him
Maybe he'll provide the source code.
Logged

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #4 on: January 12, 2004, 04:45:52 pm »

Thanks Nila.  I just finished reinstalling and setting up MC again due to some problems I had which I think were due to plugins so i'm a bit nervous about installing them (especially when it says for V8!).  Can you tell me:

- Have you used it with 10 yet?
- What were the bugs?
- Does it do a similar job to what I described?

Thanks for your help and for the regedit.
Logged

mooseman

  • Regular Member
  • Junior Woodchuck
  • **
  • Posts: 64
  • I'm a llama!
Re:Duplicates Finder PlugIn
« Reply #5 on: January 13, 2004, 12:55:52 am »

I get a plugin cannot be found or created error.

Where is this internal dup finder people keep talking about, I can't find it. ?
Logged

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #6 on: January 13, 2004, 02:11:56 am »

It's a rule (under Modifier) that you apply to a smartlist or view scheme.  You can tell MC to list any tracks where for example [artist, name] is held more than once in the library.  Because it relies on an exact match it's not very effective.
Logged

nila

  • Guest
Re:Duplicates Finder PlugIn
« Reply #7 on: January 13, 2004, 02:47:45 am »

JLee - what errors were you getting that you reckon are from plugins?

To be honest I REALLY doubt your problems were from plugins.
As far as I'm aware none of them are powerful or low level enough to make any real effect to your system at all. The only thing they MIGHT effect is MC and even then that could be fixed by uninstalling them.

They are an easy scape goat I know but I'm pretty sure they're the wrong scape goat.
I've installed it on v10 and it didn't work too great for me.
It used to work great though back with v9 when I had it.
The guy provides his e-mail address and so MIGHT be willing to update it to make sure it worked for v10 or if you asked him and he still had it he'd probably be willing to give you the source code so that one of the other plugin makers could update it.

Also, it saying for v8 doesn't mean a whole lot.
The SDK has not been modified much since then and even the few changes that were made were just enhancements, nothing disappeared so all functionality from then should still work.



Mooseman - did you create the reg file with the info I gave below and then install the reg file?

Also, if that doesn't work then browse to the install dir and type:
regsvr32 <plugin_name.ocx> in a dos prompt.
That should fix it.

Make sure you have the VB runtimes installed.
Logged

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #8 on: January 13, 2004, 08:41:20 am »

This duplicates plugin is a good idea. I would like to add another twist to it.

Would it be cool to find dupes given a list of tracks ?

That way if one knows whats on a compilation , could be in txt form from say the web or elsewhere, feed the list in and see how many of the tracks were already in the library.

Getting the track name or artist in the right order in the list wont be too much of an issue as it could do a search for both and display what ever it found.

Currently its possible to find dupes by entering strings in the search bar but thats slow, tedious and only practical on a track by track basis.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #9 on: January 13, 2004, 11:34:37 am »

I have been thinking about this

Last night i was doing some reading also since i am here in atlanta (using the hotel computer right now)

that lets say i find out where the media file starts - the ID3 tag

then lets say i -128 bytes from thar and do a md5 fingerprint on the media data

or maybe only do a md5 fingerprint on the first 1024bytes of the media file - the id3v2 tag no matter if the file was cut off i should get the same finger print if the file was encoded with the same encoder. Then if you take a file that was cut off like 1\2 way thru it would have the same fingerprint of a file that was fully intact.

there should be a tag marker on ID3v2 tags to tell you where the tag ends, and the id3v1 tag is always 128 bytes at the end of the mp3. the problem comes in with other formats, they are not always the same.

if i could get this info added to the  SDK (where does the tag start\end) it could be done. or could be done just for MP3s at this point since i have a better understanding on how they work.

any imput on this?
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #10 on: January 13, 2004, 12:02:28 pm »

best thing is to try it out..but i am skeptical cos of the inability to detect the same track for diff encoders.

TO get it to work will require some sort of pattern recognition, you know when they get those tapes of that mad terrorist with the long beard and do a speech analysis on it and say itsr eally him.

you get what i am saying king...the software neecds to be able to somehow recognise the audio and say yah...thas a dupe !!
Logged

nila

  • Guest
Re:Duplicates Finder PlugIn
« Reply #11 on: January 13, 2004, 12:17:40 pm »

lol.

There is some shareware that can supposedly do CRC's for just the audio - they must know how to do what your talking about.

I reckon though it all sounds too complex.

Just do it based on name and artist name and try different permutations of it etc.


Presume All single artist albums AREN'T duplicates - so ONLY check the files that aren't single artist complete albums - then match THESE against all the other files - cuts down on the amount of work having to be done.

Then you can just break the name up into strings seperated with " " and see how many parts of the name match.
Then return probability of it being a match.
Logged

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #12 on: January 13, 2004, 03:01:22 pm »

Quote
Presume All single artist albums AREN'T duplicates - so ONLY check the files that aren't single artist complete albums - then match THESE against all the other files - cuts down on the amount of work having to be done.

Then you can just break the name up into strings seperated with " " and see how many parts of the name match.
Then return probability of it being a match.

That sounds like a practical way of doing it.

Or if king is really hell bent on trying out his idea. He should try and find a hashing algorithm that will produce CRC values that are close to the original and then use the concept of probability to say how close they are to the original.

Problem is i think the crcs used tend to give crc values that are very far away from the orignal so its hard to say with any probability how simialr tracks might be.

In other words crcs are good to show if tracks dont identically match but may not be so good at showing how similar they are.
Logged

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #13 on: January 13, 2004, 04:49:31 pm »

Quote
Presume All single artist albums AREN'T duplicates - so ONLY check the files that aren't single artist complete albums

Please No!  This would defeat the object of having of having the dups finder.  One of the main advantages is to be able to keep greatest hits albums in your library but tag the duplicated tracks as dups so you can exclude them from other non album-based view schemes.  E.g. If I list all tracks by Madonna, despite the fact that many of her tracks appear on original albums and greatest hits albums and compilations when I look in the artist view I never see the same track listed twice.  That's because in my library this view scheme is set up to exclude the tag [Duplicate]='Don't Play'.

I just need a slicker way of indentifying the tracks that need this [Duplicate] tag updating.

King.  It's really great that you're even thinking about helping us.  I agree with Nila that something that compares [Artist],[Name] using some clever rules will be more robust.
Logged

nila

  • Guest
Re:Duplicates Finder PlugIn
« Reply #14 on: January 14, 2004, 02:55:50 am »

And also, crc checking would take AGES per file as it'd mean a LOT of physical disk reading.
Times that by 10,000 songs and thats a SLOW way of checking.

With the names you'd just need to do one call to get the filename and then from there on the rest would just be processing it so it'd all be running in RAM. ALOT faster.
Logged

urlybird

  • Regular Member
  • Junior Woodchuck
  • **
  • Posts: 57
  • .^.^.^......... ..^.^.^..
Re:Duplicates Finder PlugIn
« Reply #15 on: January 14, 2004, 11:44:06 am »

I have used Dpeg for Video and jpg files. Haven't tried it on Audio files but it is supposed to work. Checkit out via trial at http://www.somewareonthe.net/gotdupes/.

It does duplicate checking through a variety of comparisons (eg filenames, CRC, MD5 etc).

Logged

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #16 on: January 17, 2004, 07:53:46 am »

Quote
I have used Dpeg for Video and jpg files. Haven't tried it on Audio files but it is supposed to work

Care to test it out for us on audio ?

...since u have it. try removing upto 30s from the beginning & end and see whether it detects similar tracks.

Or better encode using two diff encoders and see whether it can tell the diff (or not) between the same track versions.

I looked at the page and they tend to emphasize images over audio leading me to suspect it does not do audio as well.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #17 on: January 17, 2004, 08:15:11 am »

I will Test this out later, it looks like it could be something we have been looking for.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #18 on: January 17, 2004, 04:10:54 pm »

It works.
It has problems
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #19 on: January 18, 2004, 04:02:46 am »

Care to be more specific..... King !

When does it work and where does it not ?
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #20 on: January 18, 2004, 06:33:48 am »

out of 36,000 files it found 1 match using CRC Or MD5 and there are many more it could have matched on.

so if there is 1 byte that is changed on the other file it does not match.

it does have a by name match, but MC10 has that and it is much faster in MC10

It does not Exclude the Id3v2 tags it counts them into the checksum

if a file is cut off it is not the same file

to list the matches after all the files have been scanned and you want to maybe try another way you must rescan all the files again. and it takes some time to scan 36,000 files.

"Cursory Scan" speeds up the scan, but also makes it crash.

Update:

I have Talked to the programer Bradley Davidson and atempting to get him to make some changes to make the program work better.

He has a problem with removing the tags and doing a crc check, but in my last e-mail i told him he chould just make a copy of the MP3 then strip the non mpeg data then do the crc\md5 check and life will be good.

I also droped him a address to this thread.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #21 on: January 18, 2004, 11:51:39 am »

Good Job King !!  :)

i follow you with the stripping tags info  & stuff but as i said in the past, u need a way to tell how similar stuff is. Maybe the fuzzy logic part of his program could work here.

crcs are good for telling whether stuff is an identical match or not. A way needs to be found to tell how similar stuff is with a probability rating.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #22 on: January 18, 2004, 12:26:01 pm »

well i was wondering if you took maybe the first 10 seconds (selectable) of a show and had a finger print of the signwave and all that data could be recorded and used as a fingerprint and then this data could also be used to maybe make a match.

this is the reason I would like to find a way to Get UV data for a UV meter this UV data could be used to make Dupe Checks.

once you had that data a CRC and or MD5 fingerprint could be done on that data (to make it shorter)
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

midknyte

  • Regular Member
  • Recent member
  • *
  • Posts: 5
  • Yo!
Re:Duplicates Finder PlugIn
« Reply #23 on: January 18, 2004, 01:10:17 pm »

Ok, it is time that I jump into this thread so that what I say does not get lost in translation...

CRC/MD5 Calculation, so far, is not a *problem*, but rather just a difference of opinion and implimentation which I am more than willing to change to meet the needs of you the users.

I am also interested in exploring more meaningful ways of drawing matches from things such as substring matching of ID tags that make sense to this audience and perhaps meta tags based on waveform analysis.  Fuzzy logic similar to what I have done with images...

It is true that the program serves images as a primary focus - that is what spawned its creation, however, it has been requested over the last year or so that I extend the matching to other filetypes - music in particular.  Hence its' ability now to search for other file type families (even custom).  It is a very flexible and extensible program and right now you only need to teach me how you want to use it so that it can be made to suit your needs so that I can add more Search Modes.

Play nice.  Don't gang up on me.  You have my interest and attention.

Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #24 on: January 18, 2004, 02:12:55 pm »

Bradley Davidson Aka: midknyte

Is The Programmer Of Dpeg

"d'peg! - The Duplicate Media Manager"

His Link:
http://www.somewareonthe.net/gotdupes/

Quote
Don't gang up on me.

I am sure that will not happen

If Anyone has anything to say Now Would Be A Good Time.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #25 on: January 18, 2004, 03:08:40 pm »

Thanks King and midknyte for thaking the time to look at this.  I can't add much to what I said before as you lost me when you started talking about CRC and MD5.  

As mentioned before, a 'clever' comparison of the artist name and track title would do the job for me, and would probably run much more quickly too.  I'll keep watching this thread with interest though.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #26 on: January 18, 2004, 03:51:54 pm »

Quote
you lost me when you started talking about CRC and MD5

MD5, sha1 and Sha384 are hashes and used to verify data integrity but can also be used to compare data if it is the same then it is the same file.

CRC = Cyclical Redundancy Check and was one of the first methods used when transfering data by a high speed 300 baud modem X-Modem And X-Modem CRC. Xmodem Basicly just checked to see if there was 128 bytes in the data block. CRC acualy counted the value in the bytes sent and then computered a checksum this was sent with the block and when it got to the other side the rec computer would count the valuse of the data sent then compared it to the checksum if it was the same then data good, save it, get next block.

so basicly it is just a way to check the data.

as midknyte said

Quote
meta tags based on waveform analysis

maybe another way and that would be to compare a waveform of the sounds in the media file and then compare them with others.

I am however not an expert in this like midknyte is
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #27 on: January 18, 2004, 04:23:04 pm »

Yes - but what if the track is basically the same but different.  That didn't come out very well!

I mean what if I have a version from an album and another version from a compilation (like the radio edit).  One might be slightly shorter than the other but not different enough for me to justify keeping 2 copies.  That's why I thought that artist name / track title comparison would still be best.  The rules used could be configurable e.g include comparison of duration too and specify +/- n seconds as a parameter.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #28 on: January 18, 2004, 05:47:52 pm »

Quote
That's why I thought that artist name / track title comparison would still be best.

Well thats well in good but your Assumeing that the names\fields are correct.

what happens if you have

(Sittin' On) The Dock Of The Bay - Otis Redding

but the file is really

(Sittin' On) The Dock Of The Bay - Sergio Mendes And Brasil '66

or even something like

The Who-My Generation - The Very Best Of-19-Who Are You.mp3

and miss labled, by using tags you could be deleting a file you do not want to delete and you need to fix the tags.

by using something that has a bit more inteligence you can then find that your miss labled file
Quote
The Who-My Generation - The Very Best Of-19-Who Are You.mp3
matches another song byte for byte you might then find this file has been tagged wrong by someone and you can then correct the tags or again delete it.

a Tag based system has it's place but another system for finding dups is needed. this even comes more important for users who have many files and may have downloaded some of them from the net.

there are some P2P programs that use this method to rackup users under one file name. the file is the same but the tags may change or are wrong but if one user has a totaly dif song name than 30 other users the chances are that that one user is wrong.

Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #29 on: January 18, 2004, 05:55:18 pm »

I can see the merit in both approaches.  

In my case it's unlikely that the files would be mistagged as badly as you describe as i've done a pretty careful job of tagging on the whole.

It's more likely a typo in the name or some extra data that's causing the mismatch - typically:
- extra stuff in brackets
- apostrophes (Cant / Can't)
- & / And
- Dr. / Dr
etc.

If I could weed these out I'd find a LOT of duplicates and be very happy.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #30 on: January 18, 2004, 06:48:19 pm »

Quote
In my case it's unlikely that the files would be mistagged as badly as you describe

yes this is true if you take it direct from CD.

with more and more ways to download media files and it will increase in the future. a better way of weeding out the dups is needed, or at least more power to the user if needed.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #31 on: January 18, 2004, 09:00:56 pm »

Quote
there are some P2P programs that use this method to rackup users under one file name. the file is the same but the tags may change or are wrong but if one user has a totaly dif song name than 30 other users the chances are that that one user is wrong.

Does this work if the same track is encoded differently ?

If you look at programs like wavelab, when a wav file is opened it creates another file ( internal) called a peaks file, this is a representation of the actual wav form, it allows wavlab to re-open the file faster at a later date. Not sure if cooledit has the same method too.

If a similar way can be found to generate a peaks ( call it whatever u want) and then a system was found to create a fingerprint of it. And there was a way to compare 2 diff fingerprints and say to what extent they matched or not. we might have what you are looking for ....King.

Using CRC is fine, but given two CRC values of the supposedly same file (diff encoded tho), can you divine how similar those 2 files are ?

..The answer to that question will determine how well you can find a dupe blindly.
Logged

midknyte

  • Regular Member
  • Recent member
  • *
  • Posts: 5
  • Yo!
Re:Duplicates Finder PlugIn
« Reply #32 on: January 18, 2004, 11:07:09 pm »

You're getting ahead of yourselves.

"What if the file is not byte for byte the same?"  Same for images, hence, d'peg! does not delete on its' own unless it can confirm that they are indeed duplicate files.  So whatever we come up with, it will be safe.

And what if they are close?  Human review phase of the program...

Logged

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #33 on: January 19, 2004, 05:00:44 am »

simple test

- encode track A to 160kbs
- encode track A to 128 kbs

both encodes are the same track.

Is there a software program out there that can detect that the two are the same or similar using a mechanism based on their content rather than tags ?

a blind check so to speak. Found a cpl of papers while googling..

Finding similar things quickly in large collections

Measurement of Similarity in Music

Not to sleight JLee suggestions on munging track names, this is by far the easier way of dupe checking.

If we cant find a solution to the former, might just have to do it his way.

Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #34 on: January 19, 2004, 11:57:56 am »

Today i sat down and created a Plug-In for Media Center That Will Basicly Do The Same Thing Dpeg does.

By that I mean it will scan a mpeg and come up with a MD5 Fingerprint, it will stick this info in a field the user wants (I Created a field Called MD5Fingerprint) and saves it to the ID3v2 tag

Sample MD5: 8A8728890FCC22D46164925027F81C8C
Name: Faded Rose, The
Artist: Adventures_Of_The_Falcon_-_1945_To_1952
Duration: 24:27





this is all well in good but since it just changed the file the MD5 is off a bit.

what needs to happen is find a way to eliminate the scan of the Id3v2 and Id3v1 tags since they will change depending on what side of the bed the user gets up on.

since the SDK does not support deleting Id3v1 and id3v2 tags i may need to find a way around this.

My idea is copy the MP3 to Temp\temp.mp3

then remove the Id3v2 and id3v1 tags then scan the file

since this will be the same file with out the tags use that MD5 and save it to the orginal.

=============================================

Comments?
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #35 on: January 19, 2004, 12:40:43 pm »

proof of concept, does it pass the simple test i mentioned.

You can worry about the other details afterwards :)
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #36 on: January 19, 2004, 12:58:24 pm »

No, and the only way it would is if there was a waveform type system

I think thats how J Rivers Fingerprint works, the problem is we do not have access to the Fingerprint data, and i would not know how to create such a program.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #37 on: January 19, 2004, 01:40:39 pm »

Quote
No, and the only way it would is if there was a waveform type system

Right.

Quote
by using something that has a bit more inteligence you can then find that your miss labled file

What kind of intelligence would MD5 give ?  I'm lost here king.

you would have a unique ID, but then how would this help to find dupes.

Quote
I think thats how J Rivers Fingerprint works,

JRiver Fingerprint ?? did not know they had an app like that.


Quote
the problem is we do not have access to the Fingerprint data, and i would not know how to create such a program.

Its tricky for sure. i was googling on this topic and came up with interesting stuff.

This ideal thing we want is this

If you live in the UK, you can dial up a number and let your phone listen to a track and it will ID it for you here

Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #38 on: January 19, 2004, 01:54:54 pm »

Quote
What kind of intelligence would MD5 give

well It will compare the Mpeg data only if it is a match then it is an exact copy no matter what the tags say.

what i can also do with this data now is i could create my own database on my server. and maybe save tags that could be pulled up by users.

BTW I just solved the ID3v1\Id3v2 problem.

My program now copies the file to a temp folder, then removes the Id3v1 and ID3v2 tag and the Mpeg is evaluated (this should never change) and in the tests i just made it works well.

this however will work with mp3, but will not work on other file types.

this as we talked about will not evaluate between bit rates, encoders that generated the mp3 etc..

Quote
JRiver Fingerprint ?? did not know they had an app like that.

this is built into MC9 and MC10 and if you select files, right click and use the option to submit to yadb it will evaluate the media file and create a fingerprint. this will be then submited to YADB and is used for By File Lookups.

Most people do not know this option is there and has nothing to do with CD submiting. and i also think they could\should allow a user after a file is encoded auto submit this info YADB. at One point the comment was "Good Idea" but never seen it happen in MC9.

also the fingerprint data was aval at one time in the database but was removed. i guess they figured no one would need it or they did not want others to get the data that was generated (Pick one).
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #39 on: January 19, 2004, 02:22:45 pm »

Quote
It will compare the Mpeg data only if it is a match then it is an exact copy no matter what the tags say.

Correct

..but is it likely that you have tracks that are exactly the same ?  ( given same enc rate etc, encoders)

i'm waiting for you to test this out on your MASSIVE library and see your comments.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #40 on: January 19, 2004, 02:54:10 pm »

Quote
..but is it likely that you have tracks that are exactly the same ?

Yes, people who download files offten do.

I for one have many OTR files with the same title, the problem is it is part 1 of 22 and the part 1 of 22 is missing in the tag. so it is possible that i could have dups of the shows.

it is also possible users as i said could have the wrong tags in the file.

by using the MD5Fingerprint field and using MC9's\MC10's ~dup option to find the duplicate files i could then make sure i have no dups and if the file is really tagged correctly.

and i am running like 5,000 files thru it now, none of them should have dups if there are it should be just a few.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #41 on: January 19, 2004, 04:23:52 pm »

Test Results:

After 550 files

6 dups found

1 file found that has the wrong tags, and is not even the same artist but was a match with another file with the correct tags.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

midknyte

  • Regular Member
  • Recent member
  • *
  • Posts: 5
  • Yo!
Re:Duplicates Finder PlugIn
« Reply #42 on: January 19, 2004, 06:12:02 pm »

The afformentioned problem with the exe has been fixed and new version posted.  Would crash if one ran a cursory scan where a  folder had a single quote it in (prematurely terminated an SLQ statement)...

I will be changing the checksum routines to strip the ID tags.  A very good observation.

Those interested, please come by and download a copy and play with the ID info string and substring matching that the program has already and then suggest to me what semi-intelligent sub matching of this info would be helpful in your real world situations.

http://www.GotDupes.com

Does anyone know of any dll's or ocx's that do or would facilitate plotting of the VU info of a song file?

Thanks...

Logged

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #43 on: January 20, 2004, 02:47:43 am »

Great progress King.  Can't wait to give it a try.

This bit worries me a bit though
Quote
this as we talked about will not evaluate between bit rates, encoders that generated the mp3 etc..

....'cos I think many of my dups come from different sources e.g. a track on a compilation is likely to have been encoded differently to a track on the artists album.  Also I have a lot of NOW! albums and these tend to have slightly different versions of the tracks than what is on the album.

Is there any chance you could make 2 versions of the plug in?  One that compares artist/trackname (smartly) and one that compares fingerprints.  That way we could run both one after the other.  Then we could start the human review stage after running both and get an increased chance of success.

I'm gonna go away and try to think of some smart rules (excel formula style).

Still I'm well impressed with what you've achieved so far.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #44 on: January 20, 2004, 05:58:34 am »

Quote
One that compares artist/trackname (smartly) and one that compares fingerprints.

the program i made does not compare the files you can do that like i said in MC10 using the Duplicate function, and using the field that has the MD5 checksum.

I have found a few files that came up with the same MD5 and are not the same file. this can be overcome by including duration in on your ~Duplicate search in MC10. i was also thinking that i could add an option (normaly on) that would include the duration when computing the Checksum, this should get less of a false match (maybe).

midknyte's program will allow you to pick a second match option not sure if duration is one of them but it might be a good option for him to include.

midknyte

About the VU, been looking I had the same thought I guess.

the file may need to be played for a few seconds (set by user?) with min\max settings
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #45 on: January 20, 2004, 07:57:10 am »

Duration will not help for me for the same reasons I stated above e.g:
Quote
Also I have a lot of NOW! albums and these tend to have slightly different versions of the tracks than what is on the album.

I respect your views on this and the fact that it is you guys who have the knowledge and are prepared to put in the work on this.  However, if this plug in goes in the 'fingerprint' direction then it will not work for me so I'll respectfully bow out here.

I'll export my library to excel again and come up with some quick sexy formulas and macros that will trawl through find the potential dups and offer a dup by dup on sreen review.  Then I'll print out the ones I mark as dups and manually tag them.

For info I've come up with a formula that seems to be working really well with artist and track title.  First it strips out
  • all chars within brackets
  • all these words:and, mister, mr, misses, mrs, doctor, dr, the, feat, featuring, ft, with, th, st, street, road, rd
  • spaces and all non a-z / 0-9 characters
Then it does 4 passes to find dups and if anyone of these four finds another track with the same value it's flagged for review:
  • pass 1 concatenates the first 5 char of artist name with first 5 char of track name
  • pass 2 = first 5 of artist with last 5 of track
  • pass 3 = last 5 of artist with last 5 of track
  • pass 4 = last 5 char of artist with last 5 char of track.


Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #46 on: January 20, 2004, 09:53:57 am »

JLee

Thats what we are talking about and midknyte is trying to think of a way that what your saying can be done thru doing a waveform sample of the music.

if you have not tried his program you should give it a shot since it does have alot of the things your asking for short of the waveform sample.

maybe if he can figure out how to do waveform sampling he could include it with other options you requested.

I am sure if you give midknyte some time and some input in what users need and why, with samples he might add it to his program.

I however am not trying to replace his program, and my program is only ment to match Exact data from the media file, that is not left upto User input error like what you want.

Like i said his program does some of the things you requested.

========================================================

On another note scanning 50,000+ mp3's 3087 of them match so there is about 1,543 dups taking in account the fingerprint and the duration of the file.

if you take the duration out there is 3135 files that match, so just using the checksum has an very error rate on the matches.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

Zarius

  • Regular Member
  • World Citizen
  • ***
  • Posts: 178
  • Addicted to smilies.
Re:Duplicates Finder PlugIn
« Reply #47 on: January 20, 2004, 07:26:20 pm »

KingSparta:

I like that plugin :)  Any chance of grabbing it for a test?  Also, I thought MD5 was pretty accurate... guess I was wrong :)

I can see advantages in both forms of dupe checking and look forward to the development of midknyte's program too :)
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20063
Re:Duplicates Finder PlugIn
« Reply #48 on: January 20, 2004, 07:45:16 pm »

When I Get Done Playing With It.

The Pictures Above Have Been Updated.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio, Music
https://MyAAGrapevines.com
https://centercitybbs.com
Fayetteville, NC, USA

Zarius

  • Regular Member
  • World Citizen
  • ***
  • Posts: 178
  • Addicted to smilies.
Re:Duplicates Finder PlugIn
« Reply #49 on: January 20, 2004, 09:53:29 pm »

Okay, looking forward to giving it a whirl through my library :)
Logged
Pages: [1] 2 3   Go Up