INTERACT FORUM

Windows => Third Party Plug-ins, Programs, and Skins => Topic started by: jleerigby on January 12, 2004, 09:31:52 am

Title: Duplicates Finder PlugIn
Post by: jleerigby on January 12, 2004, 09:31:52 am
Don't get all excited by the title.  There's no way I could build it!

I'm looking for a plugin that will do a better job of identifying potential duplicates than the current duplicates feature in MC.

I know that we (King and me) discussed this before in this thread (http://yabb.jriver.com/interact/index.php?board=3;action=display;threadid=17901;start=msg125297#msg125297) and agreed that it would be best if MC integrated the functionality rather than rely on a plugin.  However, I think we all know that they have bigger changes to concentrate on at the moment so we could be waiting a really long time.

I think that, if developed, this could be one of the most useful 'must have' plugins for MC since Lyrics Finder and Playing Now.

The 'Duplicates Finder' would search your library and tag potential duplicates.  Then you would use a view scheme to review items that have been tagged by the plugin and decide whether to:The plug in would do a better job than MC of finding possible dups as it would not need to find an exact match.  An example of one way to do this would be to run some code against every artist / track title to produce a check name e.g.:

e.g. Fleetwood Mac - You make lovin' fun becomes FleetYouma and
Fleetwood Mac - You make loving fun (Radio edit) becomes FleetYouma

The above rules are just an example.  Someone much cleverer than me could come up with something that would be even slicker at finding duplicates.  In fact you could have have multiple sets of rules and have the plug in run each in sequence, recording the matches as it goes along.

The plug in could then tag all tracks (in some nominated custom field) whose check name appears more than once with a tag indicating they are duplicates.  Alternatively, the plug in could simply add the check name (FleetYouMa) to the custom field and users could then find and review the duplicates using MC's built in dups feature.

The plug in would offer an option to skip tracks where the nominated field is not blank.  This way you can change the tag to something like 'Play', 'Don't play' or 'NoDup' once you've reviewed it.

Does this make sense?

Now then, are there any coders out there that fancy a blast at this (ki..cough..cough.....ng).  Like I said, I think it would become a 'must have' utility for MC users, especially those with large libraries.

Title: Re:Duplicates Finder PlugIn
Post by: nila on January 12, 2004, 09:40:07 am
There's already a duplicate finder plugin out there?
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 12, 2004, 01:24:18 pm
There's already a duplicate finder plugin out there?
Where?  I can't find it.
Title: Re:Duplicates Finder PlugIn
Post by: nila on January 12, 2004, 02:57:26 pm
http://www.musicex.com/cgi-bin/downloads/plugins.pl?type=10&start=0&end=10&page=1

There you go.

The .reg file needs to be edited.
Created your own with this in it:

Windows Registry Editor Version 5.00

[HKEY_CURRENT_USER\Software\JRiver\Media Jukebox\Plugins\Interface\DubFinder]
"IVersion"=dword:00000001
"Company"="2via Beratung"
"Version"="1.0.01"
"URL"="http://www.2via.de"
"Copyright"="(C) Copyright 2002 by 2via Beratung"
"PluginMode"=dword:00000000
"ProdID"="MJDubFinder.DubFinder"

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\Software\JRiver\Media Jukebox\Plugins\Interface\DubFinder]
"IVersion"=dword:00000001
"Company"="2via Beratung"
"Version"="1.0.01"
"URL"="http://www.2via.de"
"Copyright"="(C) Copyright 2002 by 2via Beratung"
"PluginMode"=dword:00000000
"ProdID"="MJDubFinder.DubFinder"


That should make it work.

Couple of bugs I ran into but it seemed ok.
The guys e-mail is there too so you can e-mail him
Maybe he'll provide the source code.
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 12, 2004, 04:45:52 pm
Thanks Nila.  I just finished reinstalling and setting up MC again due to some problems I had which I think were due to plugins so i'm a bit nervous about installing them (especially when it says for V8!).  Can you tell me:

- Have you used it with 10 yet?
- What were the bugs?
- Does it do a similar job to what I described?

Thanks for your help and for the regedit.
Title: Re:Duplicates Finder PlugIn
Post by: mooseman on January 13, 2004, 12:55:52 am
I get a plugin cannot be found or created error.

Where is this internal dup finder people keep talking about, I can't find it. ?
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 13, 2004, 02:11:56 am
It's a rule (under Modifier) that you apply to a smartlist or view scheme.  You can tell MC to list any tracks where for example [artist, name] is held more than once in the library.  Because it relies on an exact match it's not very effective.
Title: Re:Duplicates Finder PlugIn
Post by: nila on January 13, 2004, 02:47:45 am
JLee - what errors were you getting that you reckon are from plugins?

To be honest I REALLY doubt your problems were from plugins.
As far as I'm aware none of them are powerful or low level enough to make any real effect to your system at all. The only thing they MIGHT effect is MC and even then that could be fixed by uninstalling them.

They are an easy scape goat I know but I'm pretty sure they're the wrong scape goat.
I've installed it on v10 and it didn't work too great for me.
It used to work great though back with v9 when I had it.
The guy provides his e-mail address and so MIGHT be willing to update it to make sure it worked for v10 or if you asked him and he still had it he'd probably be willing to give you the source code so that one of the other plugin makers could update it.

Also, it saying for v8 doesn't mean a whole lot.
The SDK has not been modified much since then and even the few changes that were made were just enhancements, nothing disappeared so all functionality from then should still work.



Mooseman - did you create the reg file with the info I gave below and then install the reg file?

Also, if that doesn't work then browse to the install dir and type:
regsvr32 <plugin_name.ocx> in a dos prompt.
That should fix it.

Make sure you have the VB runtimes installed.
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 13, 2004, 08:41:20 am
This duplicates plugin is a good idea. I would like to add another twist to it.

Would it be cool to find dupes given a list of tracks ?

That way if one knows whats on a compilation , could be in txt form from say the web or elsewhere, feed the list in and see how many of the tracks were already in the library.

Getting the track name or artist in the right order in the list wont be too much of an issue as it could do a search for both and display what ever it found.

Currently its possible to find dupes by entering strings in the search bar but thats slow, tedious and only practical on a track by track basis.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 13, 2004, 11:34:37 am
I have been thinking about this

Last night i was doing some reading also since i am here in atlanta (using the hotel computer right now)

that lets say i find out where the media file starts - the ID3 tag

then lets say i -128 bytes from thar and do a md5 fingerprint on the media data

or maybe only do a md5 fingerprint on the first 1024bytes of the media file - the id3v2 tag no matter if the file was cut off i should get the same finger print if the file was encoded with the same encoder. Then if you take a file that was cut off like 1\2 way thru it would have the same fingerprint of a file that was fully intact.

there should be a tag marker on ID3v2 tags to tell you where the tag ends, and the id3v1 tag is always 128 bytes at the end of the mp3. the problem comes in with other formats, they are not always the same.

if i could get this info added to the  SDK (where does the tag start\end) it could be done. or could be done just for MP3s at this point since i have a better understanding on how they work.

any imput on this?
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 13, 2004, 12:02:28 pm
best thing is to try it out..but i am skeptical cos of the inability to detect the same track for diff encoders.

TO get it to work will require some sort of pattern recognition, you know when they get those tapes of that mad terrorist with the long beard and do a speech analysis on it and say itsr eally him.

you get what i am saying king...the software neecds to be able to somehow recognise the audio and say yah...thas a dupe !!
Title: Re:Duplicates Finder PlugIn
Post by: nila on January 13, 2004, 12:17:40 pm
lol.

There is some shareware that can supposedly do CRC's for just the audio - they must know how to do what your talking about.

I reckon though it all sounds too complex.

Just do it based on name and artist name and try different permutations of it etc.


Presume All single artist albums AREN'T duplicates - so ONLY check the files that aren't single artist complete albums - then match THESE against all the other files - cuts down on the amount of work having to be done.

Then you can just break the name up into strings seperated with " " and see how many parts of the name match.
Then return probability of it being a match.
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 13, 2004, 03:01:22 pm
Quote
Presume All single artist albums AREN'T duplicates - so ONLY check the files that aren't single artist complete albums - then match THESE against all the other files - cuts down on the amount of work having to be done.

Then you can just break the name up into strings seperated with " " and see how many parts of the name match.
Then return probability of it being a match.

That sounds like a practical way of doing it.

Or if king is really hell bent on trying out his idea. He should try and find a hashing algorithm that will produce CRC values that are close to the original and then use the concept of probability to say how close they are to the original.

Problem is i think the crcs used tend to give crc values that are very far away from the orignal so its hard to say with any probability how simialr tracks might be.

In other words crcs are good to show if tracks dont identically match but may not be so good at showing how similar they are.
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 13, 2004, 04:49:31 pm
Quote
Presume All single artist albums AREN'T duplicates - so ONLY check the files that aren't single artist complete albums

Please No!  This would defeat the object of having of having the dups finder.  One of the main advantages is to be able to keep greatest hits albums in your library but tag the duplicated tracks as dups so you can exclude them from other non album-based view schemes.  E.g. If I list all tracks by Madonna, despite the fact that many of her tracks appear on original albums and greatest hits albums and compilations when I look in the artist view I never see the same track listed twice.  That's because in my library this view scheme is set up to exclude the tag [Duplicate]='Don't Play'.

I just need a slicker way of indentifying the tracks that need this [Duplicate] tag updating.

King.  It's really great that you're even thinking about helping us.  I agree with Nila that something that compares [Artist],[Name] using some clever rules will be more robust.
Title: Re:Duplicates Finder PlugIn
Post by: nila on January 14, 2004, 02:55:50 am
And also, crc checking would take AGES per file as it'd mean a LOT of physical disk reading.
Times that by 10,000 songs and thats a SLOW way of checking.

With the names you'd just need to do one call to get the filename and then from there on the rest would just be processing it so it'd all be running in RAM. ALOT faster.
Title: Re:Duplicates Finder PlugIn
Post by: urlybird on January 14, 2004, 11:44:06 am
I have used Dpeg for Video and jpg files. Haven't tried it on Audio files but it is supposed to work. Checkit out via trial at http://www.somewareonthe.net/gotdupes/.

It does duplicate checking through a variety of comparisons (eg filenames, CRC, MD5 etc).

Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 17, 2004, 07:53:46 am
Quote
I have used Dpeg for Video and jpg files. Haven't tried it on Audio files but it is supposed to work

Care to test it out for us on audio ?

...since u have it. try removing upto 30s from the beginning & end and see whether it detects similar tracks.

Or better encode using two diff encoders and see whether it can tell the diff (or not) between the same track versions.

I looked at the page and they tend to emphasize images over audio leading me to suspect it does not do audio as well.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 17, 2004, 08:15:11 am
I will Test this out later, it looks like it could be something we have been looking for.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 17, 2004, 04:10:54 pm
It works.
It has problems
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 18, 2004, 04:02:46 am
Care to be more specific..... King !

When does it work and where does it not ?
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 18, 2004, 06:33:48 am
out of 36,000 files it found 1 match using CRC Or MD5 and there are many more it could have matched on.

so if there is 1 byte that is changed on the other file it does not match.

it does have a by name match, but MC10 has that and it is much faster in MC10

It does not Exclude the Id3v2 tags it counts them into the checksum

if a file is cut off it is not the same file

to list the matches after all the files have been scanned and you want to maybe try another way you must rescan all the files again. and it takes some time to scan 36,000 files.

"Cursory Scan" speeds up the scan, but also makes it crash.

Update:

I have Talked to the programer Bradley Davidson and atempting to get him to make some changes to make the program work better.

He has a problem with removing the tags and doing a crc check, but in my last e-mail i told him he chould just make a copy of the MP3 then strip the non mpeg data then do the crc\md5 check and life will be good.

I also droped him a address to this thread.
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 18, 2004, 11:51:39 am
Good Job King !!  :)

i follow you with the stripping tags info  & stuff but as i said in the past, u need a way to tell how similar stuff is. Maybe the fuzzy logic part of his program could work here.

crcs are good for telling whether stuff is an identical match or not. A way needs to be found to tell how similar stuff is with a probability rating.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 18, 2004, 12:26:01 pm
well i was wondering if you took maybe the first 10 seconds (selectable) of a show and had a finger print of the signwave and all that data could be recorded and used as a fingerprint and then this data could also be used to maybe make a match.

this is the reason I would like to find a way to Get UV data for a UV meter this UV data could be used to make Dupe Checks.

once you had that data a CRC and or MD5 fingerprint could be done on that data (to make it shorter)
Title: Re:Duplicates Finder PlugIn
Post by: midknyte on January 18, 2004, 01:10:17 pm
Ok, it is time that I jump into this thread so that what I say does not get lost in translation...

CRC/MD5 Calculation, so far, is not a *problem*, but rather just a difference of opinion and implimentation which I am more than willing to change to meet the needs of you the users.

I am also interested in exploring more meaningful ways of drawing matches from things such as substring matching of ID tags that make sense to this audience and perhaps meta tags based on waveform analysis.  Fuzzy logic similar to what I have done with images...

It is true that the program serves images as a primary focus - that is what spawned its creation, however, it has been requested over the last year or so that I extend the matching to other filetypes - music in particular.  Hence its' ability now to search for other file type families (even custom).  It is a very flexible and extensible program and right now you only need to teach me how you want to use it so that it can be made to suit your needs so that I can add more Search Modes.

Play nice.  Don't gang up on me.  You have my interest and attention.

Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 18, 2004, 02:12:55 pm
Bradley Davidson Aka: midknyte

Is The Programmer Of Dpeg

"d'peg! - The Duplicate Media Manager"

His Link:
http://www.somewareonthe.net/gotdupes/ (http://www.somewareonthe.net/gotdupes/)

Quote
Don't gang up on me.

I am sure that will not happen

If Anyone has anything to say Now Would Be A Good Time.
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 18, 2004, 03:08:40 pm
Thanks King and midknyte for thaking the time to look at this.  I can't add much to what I said before as you lost me when you started talking about CRC and MD5.  

As mentioned before, a 'clever' comparison of the artist name and track title would do the job for me, and would probably run much more quickly too.  I'll keep watching this thread with interest though.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 18, 2004, 03:51:54 pm
Quote
you lost me when you started talking about CRC and MD5

MD5, sha1 and Sha384 are hashes and used to verify data integrity but can also be used to compare data if it is the same then it is the same file.

CRC = Cyclical Redundancy Check and was one of the first methods used when transfering data by a high speed 300 baud modem X-Modem And X-Modem CRC. Xmodem Basicly just checked to see if there was 128 bytes in the data block. CRC acualy counted the value in the bytes sent and then computered a checksum this was sent with the block and when it got to the other side the rec computer would count the valuse of the data sent then compared it to the checksum if it was the same then data good, save it, get next block.

so basicly it is just a way to check the data.

as midknyte said

Quote
meta tags based on waveform analysis

maybe another way and that would be to compare a waveform of the sounds in the media file and then compare them with others.

I am however not an expert in this like midknyte is
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 18, 2004, 04:23:04 pm
Yes - but what if the track is basically the same but different.  That didn't come out very well!

I mean what if I have a version from an album and another version from a compilation (like the radio edit).  One might be slightly shorter than the other but not different enough for me to justify keeping 2 copies.  That's why I thought that artist name / track title comparison would still be best.  The rules used could be configurable e.g include comparison of duration too and specify +/- n seconds as a parameter.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 18, 2004, 05:47:52 pm
Quote
That's why I thought that artist name / track title comparison would still be best.

Well thats well in good but your Assumeing that the names\fields are correct.

what happens if you have

(Sittin' On) The Dock Of The Bay - Otis Redding

but the file is really

(Sittin' On) The Dock Of The Bay - Sergio Mendes And Brasil '66

or even something like

The Who-My Generation - The Very Best Of-19-Who Are You.mp3

and miss labled, by using tags you could be deleting a file you do not want to delete and you need to fix the tags.

by using something that has a bit more inteligence you can then find that your miss labled file
Quote
The Who-My Generation - The Very Best Of-19-Who Are You.mp3
matches another song byte for byte you might then find this file has been tagged wrong by someone and you can then correct the tags or again delete it.

a Tag based system has it's place but another system for finding dups is needed. this even comes more important for users who have many files and may have downloaded some of them from the net.

there are some P2P programs that use this method to rackup users under one file name. the file is the same but the tags may change or are wrong but if one user has a totaly dif song name than 30 other users the chances are that that one user is wrong.

Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 18, 2004, 05:55:18 pm
I can see the merit in both approaches.  

In my case it's unlikely that the files would be mistagged as badly as you describe as i've done a pretty careful job of tagging on the whole.

It's more likely a typo in the name or some extra data that's causing the mismatch - typically:
- extra stuff in brackets
- apostrophes (Cant / Can't)
- & / And
- Dr. / Dr
etc.

If I could weed these out I'd find a LOT of duplicates and be very happy.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 18, 2004, 06:48:19 pm
Quote
In my case it's unlikely that the files would be mistagged as badly as you describe

yes this is true if you take it direct from CD.

with more and more ways to download media files and it will increase in the future. a better way of weeding out the dups is needed, or at least more power to the user if needed.
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 18, 2004, 09:00:56 pm
Quote
there are some P2P programs that use this method to rackup users under one file name. the file is the same but the tags may change or are wrong but if one user has a totaly dif song name than 30 other users the chances are that that one user is wrong.

Does this work if the same track is encoded differently ?

If you look at programs like wavelab, when a wav file is opened it creates another file ( internal) called a peaks file, this is a representation of the actual wav form, it allows wavlab to re-open the file faster at a later date. Not sure if cooledit has the same method too.

If a similar way can be found to generate a peaks ( call it whatever u want) and then a system was found to create a fingerprint of it. And there was a way to compare 2 diff fingerprints and say to what extent they matched or not. we might have what you are looking for ....King.

Using CRC is fine, but given two CRC values of the supposedly same file (diff encoded tho), can you divine how similar those 2 files are ?

..The answer to that question will determine how well you can find a dupe blindly.
Title: Re:Duplicates Finder PlugIn
Post by: midknyte on January 18, 2004, 11:07:09 pm
You're getting ahead of yourselves.

"What if the file is not byte for byte the same?"  Same for images, hence, d'peg! does not delete on its' own unless it can confirm that they are indeed duplicate files.  So whatever we come up with, it will be safe.

And what if they are close?  Human review phase of the program...

Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 19, 2004, 05:00:44 am
simple test

- encode track A to 160kbs
- encode track A to 128 kbs

both encodes are the same track.

Is there a software program out there that can detect that the two are the same or similar using a mechanism based on their content rather than tags ?

a blind check so to speak. Found a cpl of papers while googling..

Finding similar things quickly in large collections (http://research.microsoft.com/research/sv/PageTurner/similarity.htm)

Measurement of Similarity in Music (http://www.music-cog.ohio-state.edu/Huron/Publications/orpen.similarity.text.html)

Not to sleight JLee suggestions on munging track names, this is by far the easier way of dupe checking.

If we cant find a solution to the former, might just have to do it his way.

Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 19, 2004, 11:57:56 am
Today i sat down and created a Plug-In for Media Center That Will Basicly Do The Same Thing Dpeg does.

By that I mean it will scan a mpeg and come up with a MD5 Fingerprint, it will stick this info in a field the user wants (I Created a field Called MD5Fingerprint) and saves it to the ID3v2 tag

Sample MD5: 8A8728890FCC22D46164925027F81C8C
Name: Faded Rose, The
Artist: Adventures_Of_The_Falcon_-_1945_To_1952
Duration: 24:27

(http://www2.spartasoft.com:8080/images/md5.jpg)

(http://www2.spartasoft.com:8080/images/md5(2).jpg)

this is all well in good but since it just changed the file the MD5 is off a bit.

what needs to happen is find a way to eliminate the scan of the Id3v2 and Id3v1 tags since they will change depending on what side of the bed the user gets up on.

since the SDK does not support deleting Id3v1 and id3v2 tags i may need to find a way around this.

My idea is copy the MP3 to Temp\temp.mp3

then remove the Id3v2 and id3v1 tags then scan the file

since this will be the same file with out the tags use that MD5 and save it to the orginal.

=============================================

Comments?
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 19, 2004, 12:40:43 pm
proof of concept, does it pass the simple test i mentioned.

You can worry about the other details afterwards :)
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 19, 2004, 12:58:24 pm
No, and the only way it would is if there was a waveform type system

I think thats how J Rivers Fingerprint works, the problem is we do not have access to the Fingerprint data, and i would not know how to create such a program.
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 19, 2004, 01:40:39 pm
Quote
No, and the only way it would is if there was a waveform type system

Right.

Quote
by using something that has a bit more inteligence you can then find that your miss labled file

What kind of intelligence would MD5 give ?  I'm lost here king.

you would have a unique ID, but then how would this help to find dupes.

Quote
I think thats how J Rivers Fingerprint works,

JRiver Fingerprint ?? did not know they had an app like that.


Quote
the problem is we do not have access to the Fingerprint data, and i would not know how to create such a program.

Its tricky for sure. i was googling on this topic and came up with interesting stuff.

This ideal thing we want is this (http://www.research.philips.com/InformationCenter/Global/FArticleDetail.asp?lArticleId=2416&lNodeId=842&channel=842&channelId=N842A2416)

If you live in the UK, you can dial up a number and let your phone listen to a track and it will ID it for you here (http://www.shazam.com)

Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 19, 2004, 01:54:54 pm
Quote
What kind of intelligence would MD5 give

well It will compare the Mpeg data only if it is a match then it is an exact copy no matter what the tags say.

what i can also do with this data now is i could create my own database on my server. and maybe save tags that could be pulled up by users.

BTW I just solved the ID3v1\Id3v2 problem.

My program now copies the file to a temp folder, then removes the Id3v1 and ID3v2 tag and the Mpeg is evaluated (this should never change) and in the tests i just made it works well.

this however will work with mp3, but will not work on other file types.

this as we talked about will not evaluate between bit rates, encoders that generated the mp3 etc..

Quote
JRiver Fingerprint ?? did not know they had an app like that.

this is built into MC9 and MC10 and if you select files, right click and use the option to submit to yadb it will evaluate the media file and create a fingerprint. this will be then submited to YADB and is used for By File Lookups.

Most people do not know this option is there and has nothing to do with CD submiting. and i also think they could\should allow a user after a file is encoded auto submit this info YADB. at One point the comment was "Good Idea" but never seen it happen in MC9.

also the fingerprint data was aval at one time in the database but was removed. i guess they figured no one would need it or they did not want others to get the data that was generated (Pick one).
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 19, 2004, 02:22:45 pm
Quote
It will compare the Mpeg data only if it is a match then it is an exact copy no matter what the tags say.

Correct

..but is it likely that you have tracks that are exactly the same ?  ( given same enc rate etc, encoders)

i'm waiting for you to test this out on your MASSIVE library and see your comments.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 19, 2004, 02:54:10 pm
Quote
..but is it likely that you have tracks that are exactly the same ?

Yes, people who download files offten do.

I for one have many OTR files with the same title, the problem is it is part 1 of 22 and the part 1 of 22 is missing in the tag. so it is possible that i could have dups of the shows.

it is also possible users as i said could have the wrong tags in the file.

by using the MD5Fingerprint field and using MC9's\MC10's ~dup option to find the duplicate files i could then make sure i have no dups and if the file is really tagged correctly.

and i am running like 5,000 files thru it now, none of them should have dups if there are it should be just a few.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 19, 2004, 04:23:52 pm
Test Results:

After 550 files

6 dups found

1 file found that has the wrong tags, and is not even the same artist but was a match with another file with the correct tags.
Title: Re:Duplicates Finder PlugIn
Post by: midknyte on January 19, 2004, 06:12:02 pm
The afformentioned problem with the exe has been fixed and new version posted.  Would crash if one ran a cursory scan where a  folder had a single quote it in (prematurely terminated an SLQ statement)...

I will be changing the checksum routines to strip the ID tags.  A very good observation.

Those interested, please come by and download a copy and play with the ID info string and substring matching that the program has already and then suggest to me what semi-intelligent sub matching of this info would be helpful in your real world situations.

http://www.GotDupes.com (http://www.GotDupes.com)

Does anyone know of any dll's or ocx's that do or would facilitate plotting of the VU info of a song file?

Thanks...

Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 20, 2004, 02:47:43 am
Great progress King.  Can't wait to give it a try.

This bit worries me a bit though
Quote
this as we talked about will not evaluate between bit rates, encoders that generated the mp3 etc..

....'cos I think many of my dups come from different sources e.g. a track on a compilation is likely to have been encoded differently to a track on the artists album.  Also I have a lot of NOW! albums and these tend to have slightly different versions of the tracks than what is on the album.

Is there any chance you could make 2 versions of the plug in?  One that compares artist/trackname (smartly) and one that compares fingerprints.  That way we could run both one after the other.  Then we could start the human review stage after running both and get an increased chance of success.

I'm gonna go away and try to think of some smart rules (excel formula style).

Still I'm well impressed with what you've achieved so far.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 20, 2004, 05:58:34 am
Quote
One that compares artist/trackname (smartly) and one that compares fingerprints.

the program i made does not compare the files you can do that like i said in MC10 using the Duplicate function, and using the field that has the MD5 checksum.

I have found a few files that came up with the same MD5 and are not the same file. this can be overcome by including duration in on your ~Duplicate search in MC10. i was also thinking that i could add an option (normaly on) that would include the duration when computing the Checksum, this should get less of a false match (maybe).

midknyte's program will allow you to pick a second match option not sure if duration is one of them but it might be a good option for him to include.

midknyte

About the VU, been looking I had the same thought I guess.

the file may need to be played for a few seconds (set by user?) with min\max settings
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 20, 2004, 07:57:10 am
Duration will not help for me for the same reasons I stated above e.g:
Quote
Also I have a lot of NOW! albums and these tend to have slightly different versions of the tracks than what is on the album.

I respect your views on this and the fact that it is you guys who have the knowledge and are prepared to put in the work on this.  However, if this plug in goes in the 'fingerprint' direction then it will not work for me so I'll respectfully bow out here.

I'll export my library to excel again and come up with some quick sexy formulas and macros that will trawl through find the potential dups and offer a dup by dup on sreen review.  Then I'll print out the ones I mark as dups and manually tag them.

For info I've come up with a formula that seems to be working really well with artist and track title.  First it strips out Then it does 4 passes to find dups and if anyone of these four finds another track with the same value it's flagged for review:

Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 20, 2004, 09:53:57 am
JLee

Thats what we are talking about and midknyte is trying to think of a way that what your saying can be done thru doing a waveform sample of the music.

if you have not tried his program you should give it a shot since it does have alot of the things your asking for short of the waveform sample.

maybe if he can figure out how to do waveform sampling he could include it with other options you requested.

I am sure if you give midknyte some time and some input in what users need and why, with samples he might add it to his program.

I however am not trying to replace his program, and my program is only ment to match Exact data from the media file, that is not left upto User input error like what you want.

Like i said his program does some of the things you requested.

========================================================

On another note scanning 50,000+ mp3's 3087 of them match so there is about 1,543 dups taking in account the fingerprint and the duration of the file.

if you take the duration out there is 3135 files that match, so just using the checksum has an very error rate on the matches.
Title: Re:Duplicates Finder PlugIn
Post by: Zarius on January 20, 2004, 07:26:20 pm
KingSparta:

I like that plugin :)  Any chance of grabbing it for a test?  Also, I thought MD5 was pretty accurate... guess I was wrong :)

I can see advantages in both forms of dupe checking and look forward to the development of midknyte's program too :)
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 20, 2004, 07:45:16 pm
When I Get Done Playing With It.

The Pictures Above Have Been Updated.
Title: Re:Duplicates Finder PlugIn
Post by: Zarius on January 20, 2004, 09:53:29 pm
Okay, looking forward to giving it a whirl through my library :)
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 21, 2004, 03:49:52 pm
JLee

One other thing i was thinking about, and had about an hour today to work on it is to take the artist name and song name strip it down and create a MD5 Hash for that.

Take a look at the updated pictures.

Basicly All Non A-Z Chrs Are Striped.

Song Names that are Like: "867-5309" would be striped down to Nothing, in this case the string reverts to the orginal string.

"867-5309/jenny" Charted At 02 In 1982

Listening to: '867-5309/jenny' from 'Sounds Of The Eighties (1982)' by 'Tommy Tutone' on Media Center 10
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 22, 2004, 02:52:14 pm
Quote
Concatenate the first 5 characters (or less) of the artist name with the first 5 characters (or less) of the track title.

e.g. Fleetwood Mac - You make lovin' fun becomes FleetYouma and
Fleetwood Mac - You make loving fun (Radio edit) becomes FleetYouma

this sounds good, but how about we make this user selectable

0 = all of artist name and Song Name
5 = upto 5 chrs from each
6 = upto 6 chrs from each
etc....
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 22, 2004, 04:45:16 pm
Quote
Concatenate the first 5 characters (or less) of the artist name with the first 5 characters (or less) of the track title.

e.g. Fleetwood Mac - You make lovin' fun becomes FleetYouma and
Fleetwood Mac - You make loving fun (Radio edit) becomes FleetYouma

this sounds good, but how about we make this user selectable

0 = all of artist name and Song Name
5 = upto 5 chrs from each
6 = upto 6 chrs from each
etc....
Great minds think alike hey King!  I made an excel macro that does a really good job of finding Dups.  It uses 5 char of each but I was just think about adding options like:
- High accuracy / more possible matches (this would use 7 char)
- Balanced (5 char)
- Low Accuracy / more possible matches (3 char).

Because I don't have your expertise I'm not able to actually update the tags from my program so I tile it with MC and use a specially constructed view scheme in tagging mode to manually work through and correct the dups.

My previous effort with Dups found about 1.5K out of 23K.  This macro instantly found a possible 3.7K duplicates!

It has a BIG drawback in that I cannot review and tag directly from the program.  I tried to use the program with a batch file and mjextman.exe commands to add the appropriate tracks to playing now where I could tag them (clumsy I know!).  This wouldn't work for me as any attempt to paste filename data into excel crashed MC due to the number of bytes being copied (my filenames are very long!).

However, it does prove that the algorithm works - for me anyway and the number of potential duplicates found was staggering.

I'll paste a link to it here in a few mins.
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 22, 2004, 05:38:39 pm
Here it is (http://mysite.freeserve.com/jleerigby/otherfiles/dupfind.zip)
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 22, 2004, 06:06:16 pm
Ok Picture Is Again Updated.

Min Is 5 Chrs, Max 256 Chrs Per Field (Artist name And Song Name)

I think This Is About It For Now, I May Play With It For A Day Or So And Then Put Out A Build For Everyone To Play With.
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 23, 2004, 02:31:23 am
Ok Picture Is Again Updated.

Min Is 5 Chrs, Max 256 Chrs Per Field (Artist name And Song Name)

I think This Is About It For Now, I May Play With It For A Day Or So And Then Put Out A Build For Everyone To Play With.


King.  Are you planning to do the 4-pass thing I mentioned earlier.  Through running my macro I've found a lot of instances where this has been useful.  As a reminder it concatenates:
- First 5 artist with Last 5 track name
- Last 5 with First 5
- First 5 with First 5
- Last 5 with Last 5

This helps as often there are extra characters added to the beginning or end of artist / song names e.g.

George Michael & Aretha Frankin - I knew you were waiting
George Michael - I knew you were waiting for me
Aretha Frankin & George Michael - I knew you were waiting for me

Title: Re:Duplicates Finder PlugIn
Post by: Zarius on January 23, 2004, 08:07:09 am
I have found a few files that came up with the same MD5 and are not the same file. this can be overcome by including duration in on your ~Duplicate search in MC10. i was also thinking that i could add an option (normaly on) that would include the duration when computing the Checksum, this should get less of a false match (maybe).

Hmm... upon doing some reading on MD5 it seems incredibly unlikely to have any files with the same MD5 hash... (eg: this url (http://www.forensics-intl.com/art12.html) or searching google for MD5,hash,clash and/or duplicate).... just wondering if I'm mistunderstanding how or what you are doing the MD5 on I'm assuming you did a 128bit MD5 on the whole mp3 file minus the tag header.)
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 23, 2004, 08:48:12 am
Quote
One other thing i was thinking about, and had about an hour today to work on it is to take the artist name and song name strip it down and create a MD5 Hash for that.

What diff will

making a MD5 Hash on the name do ? vs just using the filename/song name ?

Seems like an unnecessary extra step to me.
Title: Re:Duplicates Finder PlugIn
Post by: Zarius on January 23, 2004, 09:06:37 am
What diff will making a MD5 Hash on the name do ? vs just using the filename/song name? Seems like an unnecessary extra step to me.

Doing a MD5 on the name will allow you to find files with the same name, but the data may be different... then it's up to the user to determine whether they are the same.

This differs from MC's duplicate name checker in that KingSparta's MD5'ing strips non A-Z chars from the name before making the MD5... that's as much as I know at the moment.
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 23, 2004, 12:56:01 pm
Quote
Doing a MD5 on the name will allow you to find files with the same name, but the data may be different... then it's up to the user to determine whether they are the same.

Still not convinced... why not just compare the file names or whatever instead of creating a hash and then doing the compare of the hash ...

King...care to clue us in here ?
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 23, 2004, 01:10:52 pm
Quote
This differs from MC's duplicate name checker in that KingSparta's MD5'ing strips non A-Z chars from the name before making the MD5... that's as much as I know at the moment.

and also you can limit the scearch field Length.

So if you had

Car Wash ('98 (Remix)' by Rose Royce
and
Car Wash (2) by Rose Royce

and limited the string to lets say (10 chrs) you would have

Rose Royce Car Wash (

since we then strip spaces, and non alpaha chrs and convert all to lowercase

roseroycarwash

and also convert all "&" chrs To "and"

the chances may be bettter in matching.

the need for this to be a MD5 string is nothing more than I Can, it matters not then if it was left in text form or not.

Quote
Hmm... upon doing some reading on MD5 it seems incredibly unlikely to have any files with the same MD5 hash...

well it is unlikely but it happens, and i have looked into this and there are files that come up with the same MD5 hash. as a matter of fact when talking to someone from J river this was an issue when they were making there fingerprinting system so they also use some other elements from what i could understand.

as a sample from MusicBrainz

TRM Id: aa141094-b06b-4c2a-8925-3fbe55866974

is a song from Alan Jackson - Drive For Daddy, And Also A Song From Incubus

sure this could be made into a 64bit or 128 bit hash but that may be going a bit overboard.


Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 23, 2004, 01:57:28 pm
Quote
So if you had

Car Wash ('98 (Remix)' by Rose Royce
and
Car Wash (2) by Rose Royce

and limited the string to lets say (10 chrs) you would have

Rose Royce Car Wash (

since we then strip spaces, and non alpaha chrs and convert all to lowercase

roseroycarwash

and also convert all "&" chrs To "and"

the chances may be bettter in matching.

I think my algorithm is a bit more aggressive King so will find more matches.  I do the stripiing out first in this order:

1. Replace x with x (From a configurable list)
2. Get rid of anything inside brackets
3. Non a-z / 0-9 (I think 0-9 needs to stay as it's relevant in many artist names)

Only after this is done do I strip down to 5 characters.

Is there anyway you can accomodate the first 5 / last 5 thing I mentioned earlier using the George Michael & Aretha example?  I can see how this would make things more complicated but I really think that this is what's made the difference when I've tested it with my macro.

Any views?

[Edit - just read PM - but I don't understand what this hash thing is? Aren't we just talking about adding a tag to a field in MC that we can filter in panes and view schemes?  

This needs a bit of thought when I'm sober but I'm thinking something like....We could have a separate field for pass1, pass2, pass3 pass4 etc.  You can get MC to check for dups on each field in turn.  So if you don't catch it on the review of pass1 matches you wil get it on the review of pass2 etc.]
Title: Re:Duplicates Finder PlugIn
Post by: midknyte on January 23, 2004, 02:47:45 pm
I have updated d'peg! (as a beta - download instructions below) to calculate the CRC and MD5 tags w/o the IDTags.

Also, it already does some amount of special character and numeral ignoring in filenames by way of match modes called Basename and Basename (SubString).   Matching against IDTags is available too.

You can play the files from within the matching interface for comparisons.  Once registered, it allows you to do scans against offline files (on CDs) without reloading them.

Get the skinny here.

http://www.GotDupes.com (http://www.GotDupes.com)

Download the full install right now here

http://www.somewareonthe.net/anonftp/installdpeg610a.exe (http://www.somewareonthe.net/anonftp/installdpeg610a.exe)

Note - if the above link does not work, it means that I have posted a new version and the filename has changed.  Go to site download page instead.

Once you have it installed, here as an exe with the changes to the CRC and MD5 calculations.  I am waiting to post them into the next version until after I have a chance to look around some more for the ability to do some waveform analysis.

http://www.somewareonthe.net/anonftp/beta/dpeg.zip (http://www.somewareonthe.net/anonftp/beta/dpeg.zip)


Title: Re:Duplicates Finder PlugIn
Post by: midknyte on January 23, 2004, 04:38:39 pm
Good news.  I am on the cusp of audio fingerprinting solution.  But this raises a question.

The program, as it scans your music files, would likely have to load and play a few seconds of each file in order to sample it and generate the fingerprint.  Obviously time consuming (moreso than loading a picture and generating the fingerprint that it already does), though it is a task that you would leave the machine to do all by itself while you leave to do other things with your time.

My question is - is this acceptable?
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 23, 2004, 05:58:36 pm
>> I have updated d'peg!
Way Cool

>> My question is - is this acceptable?
i think that depends on how long it is i guess, but i would think the answer is yes.

BTW: Both is outstandnig news for your program!

Mark
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 23, 2004, 09:00:05 pm
I just Posted a beta build of my plug-in on my FTP Server

At IP Address: 66.57.193.58
User: anonymous
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 24, 2004, 05:16:09 am
Quote
the need for this to be a MD5 string is nothing more than I Can, it matters not then if it was left in text form or not.

My point precisely !!!

i would strongly recomend following JLee's ideas..King.

It would be interesting to see what your reaction to them is when you try it out on your library. I'm betting it will reveal lots of new dupes.

Fingerpint method requires more research to be effective. Fingerprints based on hashes are not very useful to 99% of ppl that would need dupe checking.

Fuzzy fingerprinting looks to be more promising, based on the audio properties. It's an interesting topic, something i hope to learn more about.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 24, 2004, 05:27:57 am
Quote
Fingerpint method requires more research to be effective.

I don't agree

Only when you fingerprint the file name would i agree

If you used the file name hash with the duration it will find all the dups and is 100% (so that i have found going thru my 50,000+ files)

Quote
Fuzzy fingerprinting looks to be more promising, based on the audio properties.
Yes it would be nice, but also comes with it's own problems.

Quote
i would strongly recomend following JLee's ideas.
don't se it happening just yet, the program makes Hashs, thats all it is ment to do. and they can be used in Media Center to find possible dups.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 24, 2004, 05:30:24 am
If anyone has installed it, did it work?

I may need to change something in the install package if not.
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 24, 2004, 06:44:46 am
Quote
Fingerpint method requires more research to be effective.   

I don't agree

Only when you fingerprint the file name would i agree

when i say "effective: i refer you to my simple test. encode 2 files with diff  bit rates. If the program can tell they are the same which they are then it is working. By "program" i am referring to any program not yours specifically ...King.

your 100% method will work for a very small % of files or in your case which is quite unique very well. Fact that you say these files are downloaded off the net, in my experience many times files are often incomplete, not accurately tagged or more commonly encoded using diff encoders. Maybe the OTR world is more standardised.

The problem is i have lots of dupes that are not 100% exact ( which i suspect is a common occurrence). I need a way to be able to tell that files are similar.


Quote
Fuzzy fingerprinting looks to be more promising, based on the audio properties.   
Yes it would be nice, but also comes with it's own problems.

Sure, its quite challenging. I saw a cpl of papers the other day that went in to the gory details. The theory itself is quite complex.

But if JLee's method works for the majority of files, the incentive to develop a "real" fingerprint checker is moot. Maybe i should not use the term fingerprint as it menas unique. I am referring more to a program that can detect similarities by sound.

Question is how effective is JLee's method ? im guessing pretty good. It seems to address the shortcomings of the dupe checker built into MC. What are they again...JLee ?

I just wish JRiver would sort this out themselves.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 24, 2004, 08:50:35 am
Quote
Question is how effective is JLee's method ?

Maybe good if you riped the files your self.

not too good if they are downloaded and could have any tags.

============================================

I am still trying to figure out a way to do an Audio fingerprint that could be used to compare the files when compressed at dif bit rates.

I have sent a few e-mails to a few companies, i will see what i get back, if i get something i will add it to the program.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 24, 2004, 10:14:31 am
I just installed it on my wifes computer found a few mistakes in the install program and fixed them.

I created a web page for it now and you can try it at this link. it has pictures of how to setup Media Center And Use the Duplicate function along with some directions about the program.

http://www.spartasoft.com (http://www.spartasoft.com)
Title: Re:Duplicates Finder PlugIn
Post by: KeystoneCop on January 24, 2004, 10:50:50 am
Thanks.. They key was deleting the folder. LOOKS FINE NOW.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 24, 2004, 11:03:31 am
glad it's working now.

let me know how the matches turn out etc....
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 24, 2004, 01:09:03 pm
Quote
Question is how effective is JLee's method ? im guessing pretty good. It seems to address the shortcomings of the dupe checker built into MC. What are they again...JLee ?

If you have excel installed try it out for yourself from the link above.  You've nothing got lose as it doesn't touch your files.  You just cut and paste your library list from MC into the excel sheet and hit the various buttons.  You'll see whether the algorithm works.  

For me it finds a lot of dups but it also finds a lot that are not dups.  That's not an issue as I just click one button and it skips to the next.

Try it out first on a smallish number of files.  When I run the macro that initially analyses the filenames on my 2.2 Ghz machine against 27000 files it takes about 30 mins and uses 100% CPU.  I just go off and do something else while it's doing this bit.  When it's done you can review each dup individually.  I resize excel and tile it with MC so I can see both.
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 24, 2004, 06:44:21 pm
glad it's working now.

let me know how the matches turn out etc....

Just installed it and did a random test of 500 and the results looked really promising.  The text hash found a possible 65 duplicates whilst the file hash found none.

My algorithm found 81 duplicates on the same files.  The ones that the plugin didn't find, which is as expected given our previous discussions, were those where the differences were at the start of the artist / name rather than the end.

I'm now waiting for it to run through my whole library (just the text search).  This will be a real timesaver King.  Thank you.
Title: Re:Duplicates Finder PlugIn
Post by: c1c9k72 on January 24, 2004, 07:54:22 pm
Just downloaded King's plug-in, and it's great.  If this is just the first version, I can't wait to see what other features he'll be adding in later.

One thing which I'm curious to see if it gets implimented is to use the MD5 Hash to error check existing songs.  Would it be possible to have it scan a song with an existing MD5 hash to see if it's been damaged?  I'm not sure how they are created, but would a single-bit error cause a shift in the Hash?

Anyway, a great new addition to an already great program.  Thanks.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 24, 2004, 08:07:49 pm
Quote
Would it be possible to have it scan a song with an existing MD5 hash to see if it's been damaged?
Yes it could be done.

not sure how you would want to be notified that the hash has changed. I would hate for the batch to stop just to tell you the hash has changed. Maybe a Verify Field with "OK" if it passed the check.

Quote
I'm not sure how they are created, but would a single-bit error cause a shift in the Hash?
yes, it would change the hash
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 25, 2004, 01:51:04 am
Quote
One thing which I'm curious to see if it gets implimented is to use the MD5 Hash to error check existing songs.  Would it be possible to have it scan a song with an existing MD5 hash to see if it's been damaged?  I'm not sure how they are created, but would a single-bit error cause a shift in the Hash?

Do you mean sfv instead of MD5 ?

If you make any modifications to the file tagging etc, then sfv won't match.

If you really mean MD5, I don't know if there are any programs out there that will create a MD5 hash of just the audio content of the program ( ignoring the tagging part). Which could then be tested for errors ?

Upto now i have been using a program called mp3bookhelper (http://mp3bookhelper.sourceforge.net) that creates an sfv of the audio portion, the author of mp3bookhelper calls it sv.
Title: Re:Duplicates Finder PlugIn
Post by: Zarius on January 25, 2004, 04:41:34 am
If you really mean MD5, I don't know if there are any programs out there that will create a MD5 hash of just the audio content of the program ( ignoring the tagging part). Which could then be tested for errors ?

Er......... [size=-2](from this same thread)[/size]
 
My program now copies the file to a temp folder, then removes the Id3v1 and ID3v2 tag and the Mpeg is evaluated (this should never change) and in the tests i just made it works well.

this however will work with mp3, but will not work on other file types.

this as we talked about will not evaluate between bit rates, encoders that generated the mp3 etc..
I have updated d'peg! (as a beta - download instructions below) to calculate the CRC and MD5 tags w/o the IDTags.
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 25, 2004, 05:02:34 am
King, Some initial suggestions having reviewed part of my library:

- Please strip out the unwanted characters first and then compare what's left with however many characters we choose (5 for me).  This way unwanted characters are not using up our 5.
- Please strip out 'The'
- Don't replace '&' with 'AND'. Just strip both out of the matching process as they add little value.
- Or better still... Offer an option where we can specify that we want to replace x with y (where y could also be blank if we choose)  I would use this to replace 'Featuring', 'Feat' and 'Ft'.
- Offer an option to remove text inside brackets.

Let me know what you think as I'd rather wait for a new build if there'll be one before running through the whole library as the human review part is time consuming.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 25, 2004, 08:55:32 am
Quote
Please strip out the unwanted characters first and then compare what's left with however many characters we choose (5 for me).  This way unwanted characters are not using up our 5.
Ok I Can See This, Working On It Now

Quote
Please strip out 'The'
OK, Working On It

Quote
Or better still... Offer an option where we can specify that we want to replace x with y (where y could also be blank if we choose)  I would use this to replace 'Featuring', 'Feat' and 'Ft'.
- Offer an option to remove text inside brackets.
Ok

Quote
- Don't replace '&' with 'AND'. Just strip both out of the matching process as they add little value.

I Don't See what this matters by removing both is the same as having both.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 25, 2004, 10:05:50 am
How About This, This Should Do Everything you Ask

(http://www2.spartasoft.com:8080/images/md5(6).jpg)

(http://www2.spartasoft.com:8080/images/md5(7).jpg)

Two Things To Remember What You Type In Will:

1. Not Change Any Tags

2. If You Type In Remove "The" It Will Remove "The" From All Words That Have "The" In It. To Over Come This " The " Will Remove Only "The"

Samples: For Delete " The " Using "The Beatles At The Tower"

After: "Beatles At Tower"

Samples: For Replace Using "&" Replace With "And" Using "The Beatles & Jerry Springer"

After: "The Beatles And Jerry Springer"
Title: Re:Duplicates Finder PlugIn
Post by: KeystoneCop on January 25, 2004, 10:17:45 am
KING..  THIS IS FANTASTIC..  I am only playing with the File Hash right now, but it is working 100% so far for me.  boy did I find miss labled songs.  Only small thing is if you do duplicates on MD5FileHash , and no dumplicates on name, artist you miss some of the bad ones, any easy way to only show the files where md5filehash is the same but name artist are not the same (I don't want to use MD5TextHash, as I am trying to get the names to be the same.)

Not a big deal, Cause THIS IS GREAT STUFF..
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 25, 2004, 10:23:59 am
Quote
any easy way to only show the files where md5filehash is the same

no Clue If you can do this by adding Duplicates option or "no Duplicates" or a combo of both

you might try Adding Modifyer "duplicates" with Hash Field, And Add Modifyer "No Duplicates" Using Title Or something
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 25, 2004, 11:07:52 am
King - You are awesome!  King of all Plug Ins.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 25, 2004, 11:26:01 am
You Can Download It Now. Version 0.0.2

It However does not have verify in it yet.

Version 0.0.3 Will Save To A Verify Field Like

FileHash=OK TextHash=Changed Verified On: 1/25/2004 1:23:39 PM

FileHash=OK TextHash=OK Verified On: 1/25/2004 1:23:39 PM



Title: Re:Duplicates Finder PlugIn
Post by: KeystoneCop on January 25, 2004, 05:56:36 pm
This has Been GREAT.  I got rid of over 500 duplicates that had differnt names using the FILEHASH.  I never found a file that was not the same.

Now Playing with the TEXTHASH.  Having cleaned a lot with the FILEHASH this was not as staright forward. I noticed if I had a BLANK artist, I did not get a code.  (no I don't wan't one when it is missing).  what I think would be nice would be a hash code for name only, or name + Artist.  I am sure you are busy getting back lots of disk space, so just not a major thing.



Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 25, 2004, 06:58:00 pm
Quote
name + Artist
it does this

Name Only Not Sure Thats Wise

===============================

New Option Comming...

Binary Read Hash

A User Can Set The Number Of bytes To Read from The File, Then The Program Will Do A Hash On That Info.

Since File Hash Does It On The Whole File What Happens if The Dup File Is Not The Same Size but Was Cut A Second? Well The Hash Will Change.

What This Will Do Is Show You Possible Dups, And Allow You To Verify, And Select The Longer File Of The Two.

I think I may Need A Few Days with this To Read All My Files Again, and some more testing, but I have a working copy.
Title: Re:Duplicates Finder PlugIn
Post by: KeystoneCop on January 25, 2004, 10:43:59 pm
Quote
Name Only Not Sure Thats Wise
Not one of my best quailities, but I listen to the music before delete unless I am sure. much like the first dup program..

Anyway.. This Is GREAT.  

Rather than number of bytes anyway to deal with duration  + or - ? (I got this idea from some old KINGSPARTA requests)
Title: Re:Duplicates Finder PlugIn
Post by: c1c9k72 on January 27, 2004, 10:28:09 am
King,

Still loving this plug-in, but I've having occasional Application errors from Media Jukebox when I run it.  It doesn't always do it, but when it does, it refers to an instruction referencing memory at different locations, then forces the program to shut down.  Is anyone else having this trouble, or could it be just me?
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 27, 2004, 10:38:44 am
by the way i updated the plug-in.

it now will do a binary read the begining of the file and create a hash.

so incase one file is a dupe and it has been cut off it may still match where the whole file hash would not.

===============================================

about the crash, I somtimes get that when MC is just sitting there for no reason.

Not sure why, it may or may not have something to do with the plug-in, but i think it don't other than the fact it maybe telling MC to save the tags and MC craps out on it. But like i said i have had times when Mc starts to save the database and crashes like that when no plug-ins are running.
Title: Re:Duplicates Finder PlugIn
Post by: c1c9k72 on January 27, 2004, 10:56:41 am
King,

Thanks for the update.  I've found that it's certain songs that trigger the error, though I can't figure out what they have in common.  I'm planning on reencoding them and seeing if the new copies cause the same trouble.

A request, if it's not too much trouble: In the MD5Verify attribute, would it be possible to have it print them in such a way as not to have '=.'  I'm trying to create smartlists for altered FileHashes and now BinHashes, and while I'm not completely sure, I don't think smartlists deal well with equals-signs in the equations.

Thanks again.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 27, 2004, 11:12:19 am
Quote
way as not to have '=.'

sure, what would be good?
Title: Re:Duplicates Finder PlugIn
Post by: c1c9k72 on January 27, 2004, 11:23:23 am
Just off the top of my head, maybe shift

BinHash=OK FileHash=Changed TextHash=OK Verified On: 1/27/2004 12:10:44 PM

to

BinHash0 FileHash1 TextHash0 1/27/2004 12:10:44 PM

Oh, and just for your information, after re-encoding those songs at the same bitrate, I haven't had the error again.  So, it's not the plug-in, but some aspect of the song.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 27, 2004, 11:33:18 am
Seems Kind Of Encoded

How about i just change "=" To ":"

about the other problem.

if the plug-in crashed you would see diagnal lines along the program with a message box telling you what the error was.

when Media Center Crashes it is a totaly diffrent error message
Title: Re:Duplicates Finder PlugIn
Post by: c1c9k72 on January 27, 2004, 11:57:08 am
I remember seeing plug-in errors before, and getting that diagonal effect.  I've also gotten a smartlist to work with the present system, so unless you really like another method, there's no reason to change it.  At least, none I can think of.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 27, 2004, 12:15:27 pm
ok
Title: Re:Duplicates Finder PlugIn
Post by: KeystoneCop on January 27, 2004, 05:46:06 pm
KIng, Each version gets better and better.  The ability to knock out special characters in the text hash really made it find a lot of matches for me.  THANKS

Now, you knew this was comming..  I still think it would be good to have seperate hash fields for name and artist.  Then I could do find all duplicate artist hash, and clean them up.. and the same for name.

 :D :D :D BUT I STILL THINK THIS IS FANTASTIC :D :D :D



I have not played much with the partial filehash  too slow to do in the daytime, maybe I will try it tonight..  what are you finding to be a good setting ?
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 27, 2004, 06:04:24 pm
Quote
what are you finding to be a good setting ?

> 11,000+

I think that may be min soon.
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 29, 2004, 05:30:21 pm
King.  I just finished my manual review of all my files using Text hash with v0.02.  I thought it would be a good idea to tag all the files that your plug in found and then run my macro against everything that was left to see what I managed to find that your plug in didn't.

Your plugin found 1651 possible duplicates ( I don't know how many were actual confirmed duplicates after the human review - probably about 10-15% at a guess which is good).

My macro, which looks at the combination of artist start, artist end,  name start,  name end,  found a further 1324 duplicates which the plug in didn't find.  Of these the human review confirmed that 177 were in fact duplicates.

I'm sorting through the results to see what the main reasons were but an obvious one that cropped up many times was that items in brackets need to be stripped out.  e.g.

Elvis - Teddy Bear
Elvis - (Let me be your) Teddy Bear

I'll give you some more examples when I've been through them.

Thanks again for saving me all this time.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 29, 2004, 05:52:36 pm
Quote
brackets need to be stripped out

I might work on this

Quote
files using Text hash with v0.02.

the binary read hash may render more hits in 0.0.4
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 30, 2004, 12:48:50 am
OK tried the plugin.

It stopped working when it found a bad file. This was a bit upsetting as i left it running over night and found it only got through 20% of the total.

Could you make the plugin keep track of bad files so they can be reviewed later but STILL continue running with the remaining tracks in playing now ? You can add this in the verify section if you want.

Other thoughts were what are the optimal # of bytes to use for text hash, the default is 30 bytes. if this were to change then i need to re-run against all the files again !!!  Why the additional steps ?

It would be preferable to dump this text hash way of doing things for the above reason.

Ideally it would not need to create hashes for doing a text compare, it would be able to take whatever you threw at it in playing now and do a super dupe check like find duplicates does. Following whatever rules were specified in text replace/text remove.
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 30, 2004, 01:15:30 am
Quote
Other thoughts were what are the optimal # of bytes to use for text hash, the default is 30 bytes. if this were to change then i need to re-run against all the files again !!!  Why the additional steps ?
I use just 5 characters for the compare i.e. only the first 5 charcters are used to determine whether or not it's a duplicate.  This works pretty well.  Using 30 characters kind of defeats the object.
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 30, 2004, 01:18:21 am
Quote
use just 5 characters for the compare i.e. only the first 5 charcters are used to determine whether or not it's a duplicate.  This works pretty well.  Using 30 characters kind of defeats the object.

Noted.

Another thing King...remembering what rules or settings between invocations would be nice.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 30, 2004, 05:30:33 am
Quote
use just 5 characters for the compare i.e. only the first 5 charcters are used to determine whether or not it's a duplicate.  This works pretty well.  Using 30 characters kind of defeats the object.

Noted.

Another thing King...remembering what rules or settings between invocations would be nice.

Mine does

what ones don't it remember?

When In The Plug-in don't exit MC, exit the plug-in back to MC then exit MC
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 30, 2004, 08:22:32 am
Another question King ?

Why are you creating a temp file ?

Can't you just access the mp3 portion from the original file, save that inernal to your program and create a hash off it. Memory is faster than I/O.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 30, 2004, 10:14:01 am
Quote
Can't you just access the mp3 portion from the original file

If I was Matt I Could

With the tools i have access to I Can't and do not want to touch the orginal file and get blamed for something, if somthing goes wrong.

I am not Matt and now you made me feel really bad about my self.
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 30, 2004, 12:13:54 pm
Quote
With the tools i have access to I Can't and do not want to touch the orginal file and get blamed for something, if somthing goes wrong.

This is a very good reason. i suppose better safe & slow then sorry.

However you would be reading from the original file, not modifying it. I got the idea on watching how fast sfv checkers work, your plugin is doing the same thing.

When i created the custom columns ..i unset 'Store in file tags'..only writing to the library.

Quote
I am not Matt and now you made me feel really bad about my self.

...yah king  ;)    ..i think you're tougher than that.

Everything i meant so far to be taken as constructive criticism.

On another note...i can't seem to get MD5BinHash to take, the field is empty for some reason. the spelling is right.

i got dupes with MD5text hash & duration.  I have not tunred it loose on my whole library yet.

MD5MP3Hash gave weird results, pairing it up with duration yielded 1800 dupes...i must be doing something wrong..just cant see what..?

Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 30, 2004, 12:41:53 pm
Quote
On another note...i can't seem to get MD5BinHash to take, the field is empty for some reason.

I Broke Something

About the other, i will look at it to see if i broke anything else.
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 30, 2004, 02:12:42 pm
more observations.

kings_playlist = 1900 files

three smartlists

1. Task -- possible duplicates = playlistid==kings_playlist ~dup=[artist],[name] ~sort=[artist],[name]

gives 157 files ( that's our baseline ie what MC can do)


2. Kings_MDTextHash_#1 =playlistid==kings_playlist  ~dup=MD5TextHash,Duration ~sort=[artist],[name]

gives 20 files for dupes.

3. Kings_MDTextHash_#2=playlistid==kings_playlist  ~dup=MD5TextHash ~sort=[artist],[name]

gives 184 files for dupes


King's plugin found 27 more files :) ( if you leave out Duration for dupe modifier)

Well done !!

Now if it could find those dupes w/o making any text hashes, that would be killer. Hey didn't you say you did not want to touch the files...King ;D


For Text Hash Use 5 Bytes Max was used




 
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 30, 2004, 02:47:17 pm
if anyone has 0.0.4 download 0.0.6

it seems sometimes when reading the mp3 it was not closing the file so the next file was not getting copied. so the file hash was the same.

I will look for more bugs.
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on January 30, 2004, 03:02:42 pm
Tried JLee's macro too on that sample list.

However im not sure how to list the total number of dupes it found.
This was after a review of the whole list. There were a few false postives but not many.


The status bar says 326 records found. If this is true, King's got some catching up to do :)

Wish there was a way to sort on artist,name then i could say for certain.

King is it just a simple uninstall/install to upgrade to a newer version of your plugin ?

Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 30, 2004, 03:45:17 pm
hit_ny.  Just click Show unresolved to show all items that I think are possible dups but that you've not reviewed.  If you want to look at them all click show unresolved then change the filter in the checked status column to All.

When you click reviewed, no dup or skip it just updates the checked status column and moves to the next one that has not been statused.  You can go back and review the ones you've marked by filtering on the status column.

I think King's plug in will catch most if he makes a change to strip out characters in brackets which is what I do.  I also check the ends of the names ignoring the beginning which catches a few more but King's plugin will get 90% of what my macro finds.

To install the latest version of his plug in just do an uninstall before installing the new one.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 30, 2004, 04:27:43 pm
Quote
I think King's plug in will catch most if he makes a change to strip out characters in brackets which is what I do.

I will.
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 31, 2004, 02:54:26 am
e-mail to King from me:
Quote
I've attached the list of duplicates that I found which the plug in didn't so you can get some ideas of what might improve the plug in.  I haven't tried v0.04 yet but these files did not get the same text hash with v0.02 and I don't understand why:

F6FAA6513B202C8F8F62CEA76C7A0518 Faith Hill - Breathe
2C7D76B9F6EF2BF4C96E4497051577E6 Faith Hill - Breathe (Tin Tin Out Radio Mix)

Not sure what happened with these 2 files that should have been given the same text hash but King as ever is working on it.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on January 31, 2004, 05:45:21 am
Quote
Faith Hill - Breathe (Tin Tin Out Radio Mix)

Have Not gotten to this yet, maybe later today
Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on January 31, 2004, 07:27:48 am
No worries king.  All my dupes are tagged now.   :)
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on February 01, 2004, 02:48:42 am
OK here's a review of 0.0.8

This build seems quite stable , i fed it 1908 files and it worked well, no crashes nothing. switching  between plugin view & other view schemes stops the plugin but on returning to the plugin view and clicking on batch start resumes where it left off.

Settings
----------
replace Text (Tab) ( find & and replace with And)
Remove text (artist field & song field ,remove text between)
used "(" and ")" in the two red boxes

These settings are important and are recommended.

Results with 1908 files
----------------------------
MC's finds 157 duplicates on its own.

using just MD5TextHash in Duplicates found 294 files.

JLee's macro found 303.

So Kings plugin is pretty close. Some tweaking in the replace tabs might fix this.

(Not surprisingly. just tried this out to see what came up)
- using MD5MP3Hash in duplicates found 4 files.
- using MD5BinHash in duplicates found 8 files.

The above 2 hashes are useful when tagging is inaccurate or non-existent, BinHash being more useful as it only samples a few kb from the beginning rather than the whole file.

Otherwise texthash does the trick pretty well, for extra speed, the other hashes could be unchecked in the plugin setup.

Suggestions
----------------
- i found the MD5TexHash to be the most useful, it found nearly twice as many dupes as MC on its own but the downside is the waiting for the text hashes to be computed. Maybe i am impatient because i don't know the ideal settings to use for the replace text, etc tabs and unwilling to wait to see the results of any experiments. It took 4 hrs to do all the hashes for 1908 files on a 700Mhz P3.

I suppose once the ideal settings are known, the text hash computation is a one time operation.

It would be ideal if it did not require this extra step and performed the dupe checking just given a playlist.

But i'm not sure how to display the results then in the familiar fashion using the Duplicates modifier. Something needs to be written to a tag to be able to use this modifier.

I'm not aware of how much the SDK exposes ( and whether its feasible) but a possible solution might be to do the text compares internal to the plugin, store the results and then pass them to a pre-defined smartlist.


- Suggestion for a new feature, given a track list for an album, find all existing files. This could be helpful when looking at new albums on the web to see how many of the album tracks were already in the library. User would just copy+paste a track list of the web into a tabbed window and the plugin would display any matches.


Well done King and thanks for the effort !


Title: Re:Duplicates Finder PlugIn
Post by: jleerigby on February 01, 2004, 06:10:28 am
Just one more suggestion King:

Change the name to something more meaningful.  I don't think many MC newbies will know what MD5 or text hash or file hash means.  I suggest something like Advanced Duplicates Finder.

If you agree I'd change the subect heading too on the link in the main MC9 forum.

This is such a useful addition to MC that I hope lots of others will use it too.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on February 01, 2004, 06:37:10 am
Quote
Change the name to something more meaningful.  I don't think many MC newbies will know what MD5 or text hash or file hash means.

It means somthing to me

Quote
I suggest something like Advanced Duplicates Finder.

it's not a duplicate finder, it makes hashes

MC does the dup finder and better than a VB program ever will, VB is just too slow for this.
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on February 01, 2004, 07:35:46 am
this is a list i use with some of my programs to match better

A List Of Replacements To Think About for the Atist name (most include spaces before and after).

    " / “ = “ AND “
    " \ “ = “ AND “
    " - “ = “ AND “
    "-N-” = “ AND “
    "(AND “ = “ AND “
    "[AND “ = “ AND “
    "{AND “ = “ AND “
    "(DUET WITH” = “ AND “
    "DUET WITH” = “ AND “
    "(WITH “ = “ AND “
    "(WIT “ = “ AND “
    "[WITH “ = “ AND “
    "[WIT “ = “ AND “
    "{WITH “ = “ AND “
    "{WIT “ = “ AND “
    " WITH “ = “ AND “
    " WIT “ = “ AND “
    "(WTH “ = “ AND “
    "(&” = “ AND “
    "&” = “ AND “
    "(F/” = “ AND “
    "F/” = “ AND “
    "(W/” = “ AND “
    " W/” = “ AND “
    "(VS. “ = “ AND “
    "(VS “ = “ AND “
    "VS.” = “ AND “
    " VS “ = “ AND “
    "*VS*” = “ AND “
    "INTRODUCING” = “ AND “
    "(FEATURING” = “ AND “
    "FEATURING” = “ AND “
    "(FEAT.” = “ AND “
    "FEAT.” = “ AND “
    "(FEAT” = “ AND “
    "FEAT” = “ AND “
    "(FT.” = “ AND “
    "FT.” = “ AND “
    "(FT “ = “ AND “
    " FT “ = “ AND “
    " F.” = “ AND “
    "(F.” = “ AND “
    " F “ = “ AND “
    "MEETS” = “ AND “
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on February 01, 2004, 07:52:05 am
replacements for the title

"WANT A” = “WANNA"
"WANNA” = “WANT TO"
"I WILL” = “I'LL"
" DA “ = “ THE "
"NOIZE” = “NOISE"
"BOYZ” = “BOYS"
"DAYZ” = “DAYS"
"COLOUR” = “COLOR"
"ING” = “IN'"
" CUM “ = “ COME "
" MAMA “ = “ MAMMA "
" MOMMA “ = “ MAMMA "
"WOMEN” = “WOMAN"
"WOMAN” = “WOMEN"
" YEA “ = “ YEAH "
"LITES” = “LIGHTS"
" LUV “ = “ LOVE "
"THANG” = “THING"
" YA “ = “ YOU "
"NITE” = “NIGHT"
"GOTTA” = “GOT TO"
"I AM” = “I'M"
"GONNA” = “GOING TO"
"GIMME” = “GIVE ME"
" DAT “ = “ THAT "
"'N'“ = “AND"
"-N-” = “AND"
"WHATTA” = “WHAT A"
"WANNABE” = “WANT TO BE"
" & “ = “AND"
" TILL “ = “ TIL "
" LIL “ = “ LITTLE "
" THA “ = “ THE "
"HIPPY” = “HIPPIE"
"THERE WILL” = “THERE'LL"
" DAMN “ = “ DAM "
"SHE IS” = “SHE'S"
"BREAK” = “BRAKE"
"SHOULD HAVE” = “SHOULD'VE"
"SHOULDA” = “SHOULD'VE"
"COULD HAVE” = “COULD'VE"
"WE ARE” = “WE'RE"
"YOUR” = “YOU'RE"
"YOU ARE” = “YOU'RE"
"UNTILL” = “UNTIL"
" U “ = “ YOU "
" N' “ = “ AND "
" N “ = “ AND "
" R “ = “ ARE "
Title: Re:Duplicates Finder PlugIn
Post by: hit_ny on February 01, 2004, 12:17:21 pm
Thnks for the tips King.

Do they have to be ALL CAPS ? Cos i used mixed case.

btw...i redid the texthashes only with the new settiings for 1908 files, i was amazed to see them all  regenerated  in under a min !!!!
Title: Re:Duplicates Finder PlugIn
Post by: KingSparta on February 01, 2004, 12:24:22 pm
Quote
btw...i redid the texthashes only with the new settiings for 1908 files, i was amazed to see them all  regenerated  in under a min !!!!

if you don't have the binary and by file on it should not take too long.

you also need to watch, the option on the setup page if it already has a hash a new one will not be generated and it will be bypassed, you should turn that off when trying to regenerate the hash.

Quote
Do they have to be ALL CAPS ?

no, thats how i had it already typed, the program converts both the "artist name" and "Name" temp string to Ucase along with the search string the user types in, this way it will match no matter what the case the user types.
Title: Re:Duplicates Finder PlugIn
Post by: bspachman on February 04, 2004, 10:42:10 am
I've been following this thread for while with some interest. I stumbled on some interesting information that may or may not be useful:

From http://www.macintouch.com

Quote
[Rick Lesniak] A friend of a friend tracked this down: CDDB uses a "waveform recognition" algorithm: Gracenote MusicID

and

Quote
[Mark Rogstad] I learned this only a few months ago, but CDDB and others use a technology called "audio fingerprinting" to identify songs, not any digital bits.

        * Audio Fingerprinting Technology (http://www.icasit.org/ecommerce/audio_fingerprint.htm)
        * Just Hum a Few Bars (http://mixonline.com/ar/audio_hum_few_bars/) [Mix]

and

Quote
[Kees Huyser] Gracenote has a patent  (http://www.gracenote.com/corporate/patents.html)[that] describes

        "a fuzzy comparison algorithm suitable for determining whether two audio CDs are exactly or approximately the same. The fuzzy comparison algorithm proceeds as follows. For each of the two audio CDs to be compared, one determines the lengths of all the tracks in the recordings in milliseconds. One then shifts all track lengths to the right by eight bits, in effect performing a truncating division by 2^8 =256. One then goes through both of the recordings track by track, accumulating as one proceeds two numbers, the match total and the match error. These numbers are both initialized to zero at the start of the comparison. For each of the tracks, one increments the match total by the shifted length of that track in the first CD to be compared, and one increments the match error by the absolute value of the difference between the shifted lengths of the track in the two CDs. When one gets to the last track in the CD with the fewer number of tracks, one continues with the tracks in the other CD, incrementing both the match total and the match error by the shifted lengths of those tracks. Following these steps of going through the tracks, the algorithm then divides the match error by the match number, subtracts the resulting quotient from 1, and converts the difference to a percentage which is indicative of how well the two CDs match."

Interesting stuff.... (Not least because CDDB gets to come up again around here! :) )

Brad
Title: Re:Duplicates Finder PlugIn
Post by: zak326 on April 09, 2004, 10:05:12 am
I have been using a propgram by the name of
Dpeg which I bought a couple of years ago and never uysed then found it againin a google search. It has all kinds of options for finding and deleting dupes.

this is the link.  http://www.somewareonthe.net. I usually don't recommend things like this, but this seems to be flexible enough to work in most cases. they have both a freeware and paid version. give it a shot.

bytw i have no interest in this company, etc. just trying to be helpful. ::) ;D