INTERACT FORUM

Please login or register.

Login with username, password and session length
Advanced search  
Pages: 1 2 [3]   Go Down

Author Topic: Duplicates Finder PlugIn  (Read 22858 times)

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #100 on: January 29, 2004, 05:30:21 pm »

King.  I just finished my manual review of all my files using Text hash with v0.02.  I thought it would be a good idea to tag all the files that your plug in found and then run my macro against everything that was left to see what I managed to find that your plug in didn't.

Your plugin found 1651 possible duplicates ( I don't know how many were actual confirmed duplicates after the human review - probably about 10-15% at a guess which is good).

My macro, which looks at the combination of artist start, artist end,  name start,  name end,  found a further 1324 duplicates which the plug in didn't find.  Of these the human review confirmed that 177 were in fact duplicates.

I'm sorting through the results to see what the main reasons were but an obvious one that cropped up many times was that items in brackets need to be stripped out.  e.g.

Elvis - Teddy Bear
Elvis - (Let me be your) Teddy Bear

I'll give you some more examples when I've been through them.

Thanks again for saving me all this time.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20048
Re:Duplicates Finder PlugIn
« Reply #101 on: January 29, 2004, 05:52:36 pm »

Quote
brackets need to be stripped out

I might work on this

Quote
files using Text hash with v0.02.

the binary read hash may render more hits in 0.0.4
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio
https://MyAAGrapevines.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #102 on: January 30, 2004, 12:48:50 am »

OK tried the plugin.

It stopped working when it found a bad file. This was a bit upsetting as i left it running over night and found it only got through 20% of the total.

Could you make the plugin keep track of bad files so they can be reviewed later but STILL continue running with the remaining tracks in playing now ? You can add this in the verify section if you want.

Other thoughts were what are the optimal # of bytes to use for text hash, the default is 30 bytes. if this were to change then i need to re-run against all the files again !!!  Why the additional steps ?

It would be preferable to dump this text hash way of doing things for the above reason.

Ideally it would not need to create hashes for doing a text compare, it would be able to take whatever you threw at it in playing now and do a super dupe check like find duplicates does. Following whatever rules were specified in text replace/text remove.
Logged

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #103 on: January 30, 2004, 01:15:30 am »

Quote
Other thoughts were what are the optimal # of bytes to use for text hash, the default is 30 bytes. if this were to change then i need to re-run against all the files again !!!  Why the additional steps ?
I use just 5 characters for the compare i.e. only the first 5 charcters are used to determine whether or not it's a duplicate.  This works pretty well.  Using 30 characters kind of defeats the object.
Logged

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #104 on: January 30, 2004, 01:18:21 am »

Quote
use just 5 characters for the compare i.e. only the first 5 charcters are used to determine whether or not it's a duplicate.  This works pretty well.  Using 30 characters kind of defeats the object.

Noted.

Another thing King...remembering what rules or settings between invocations would be nice.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20048
Re:Duplicates Finder PlugIn
« Reply #105 on: January 30, 2004, 05:30:33 am »

Quote
use just 5 characters for the compare i.e. only the first 5 charcters are used to determine whether or not it's a duplicate.  This works pretty well.  Using 30 characters kind of defeats the object.

Noted.

Another thing King...remembering what rules or settings between invocations would be nice.

Mine does

what ones don't it remember?

When In The Plug-in don't exit MC, exit the plug-in back to MC then exit MC
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio
https://MyAAGrapevines.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #106 on: January 30, 2004, 08:22:32 am »

Another question King ?

Why are you creating a temp file ?

Can't you just access the mp3 portion from the original file, save that inernal to your program and create a hash off it. Memory is faster than I/O.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20048
Re:Duplicates Finder PlugIn
« Reply #107 on: January 30, 2004, 10:14:01 am »

Quote
Can't you just access the mp3 portion from the original file

If I was Matt I Could

With the tools i have access to I Can't and do not want to touch the orginal file and get blamed for something, if somthing goes wrong.

I am not Matt and now you made me feel really bad about my self.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio
https://MyAAGrapevines.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #108 on: January 30, 2004, 12:13:54 pm »

Quote
With the tools i have access to I Can't and do not want to touch the orginal file and get blamed for something, if somthing goes wrong.

This is a very good reason. i suppose better safe & slow then sorry.

However you would be reading from the original file, not modifying it. I got the idea on watching how fast sfv checkers work, your plugin is doing the same thing.

When i created the custom columns ..i unset 'Store in file tags'..only writing to the library.

Quote
I am not Matt and now you made me feel really bad about my self.

...yah king  ;)    ..i think you're tougher than that.

Everything i meant so far to be taken as constructive criticism.

On another note...i can't seem to get MD5BinHash to take, the field is empty for some reason. the spelling is right.

i got dupes with MD5text hash & duration.  I have not tunred it loose on my whole library yet.

MD5MP3Hash gave weird results, pairing it up with duration yielded 1800 dupes...i must be doing something wrong..just cant see what..?

Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20048
Re:Duplicates Finder PlugIn
« Reply #109 on: January 30, 2004, 12:41:53 pm »

Quote
On another note...i can't seem to get MD5BinHash to take, the field is empty for some reason.

I Broke Something

About the other, i will look at it to see if i broke anything else.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio
https://MyAAGrapevines.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #110 on: January 30, 2004, 02:12:42 pm »

more observations.

kings_playlist = 1900 files

three smartlists

1. Task -- possible duplicates = playlistid==kings_playlist ~dup=[artist],[name] ~sort=[artist],[name]

gives 157 files ( that's our baseline ie what MC can do)


2. Kings_MDTextHash_#1 =playlistid==kings_playlist  ~dup=MD5TextHash,Duration ~sort=[artist],[name]

gives 20 files for dupes.

3. Kings_MDTextHash_#2=playlistid==kings_playlist  ~dup=MD5TextHash ~sort=[artist],[name]

gives 184 files for dupes


King's plugin found 27 more files :) ( if you leave out Duration for dupe modifier)

Well done !!

Now if it could find those dupes w/o making any text hashes, that would be killer. Hey didn't you say you did not want to touch the files...King ;D


For Text Hash Use 5 Bytes Max was used




 
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20048
Re:Duplicates Finder PlugIn
« Reply #111 on: January 30, 2004, 02:47:17 pm »

if anyone has 0.0.4 download 0.0.6

it seems sometimes when reading the mp3 it was not closing the file so the next file was not getting copied. so the file hash was the same.

I will look for more bugs.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio
https://MyAAGrapevines.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #112 on: January 30, 2004, 03:02:42 pm »

Tried JLee's macro too on that sample list.

However im not sure how to list the total number of dupes it found.
This was after a review of the whole list. There were a few false postives but not many.


The status bar says 326 records found. If this is true, King's got some catching up to do :)

Wish there was a way to sort on artist,name then i could say for certain.

King is it just a simple uninstall/install to upgrade to a newer version of your plugin ?

Logged

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #113 on: January 30, 2004, 03:45:17 pm »

hit_ny.  Just click Show unresolved to show all items that I think are possible dups but that you've not reviewed.  If you want to look at them all click show unresolved then change the filter in the checked status column to All.

When you click reviewed, no dup or skip it just updates the checked status column and moves to the next one that has not been statused.  You can go back and review the ones you've marked by filtering on the status column.

I think King's plug in will catch most if he makes a change to strip out characters in brackets which is what I do.  I also check the ends of the names ignoring the beginning which catches a few more but King's plugin will get 90% of what my macro finds.

To install the latest version of his plug in just do an uninstall before installing the new one.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20048
Re:Duplicates Finder PlugIn
« Reply #114 on: January 30, 2004, 04:27:43 pm »

Quote
I think King's plug in will catch most if he makes a change to strip out characters in brackets which is what I do.

I will.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio
https://MyAAGrapevines.com
Fayetteville, NC, USA

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #115 on: January 31, 2004, 02:54:26 am »

e-mail to King from me:
Quote
I've attached the list of duplicates that I found which the plug in didn't so you can get some ideas of what might improve the plug in.  I haven't tried v0.04 yet but these files did not get the same text hash with v0.02 and I don't understand why:

F6FAA6513B202C8F8F62CEA76C7A0518 Faith Hill - Breathe
2C7D76B9F6EF2BF4C96E4497051577E6 Faith Hill - Breathe (Tin Tin Out Radio Mix)

Not sure what happened with these 2 files that should have been given the same text hash but King as ever is working on it.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20048
Re:Duplicates Finder PlugIn
« Reply #116 on: January 31, 2004, 05:45:21 am »

Quote
Faith Hill - Breathe (Tin Tin Out Radio Mix)

Have Not gotten to this yet, maybe later today
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio
https://MyAAGrapevines.com
Fayetteville, NC, USA

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #117 on: January 31, 2004, 07:27:48 am »

No worries king.  All my dupes are tagged now.   :)
Logged

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #118 on: February 01, 2004, 02:48:42 am »

OK here's a review of 0.0.8

This build seems quite stable , i fed it 1908 files and it worked well, no crashes nothing. switching  between plugin view & other view schemes stops the plugin but on returning to the plugin view and clicking on batch start resumes where it left off.

Settings
----------
replace Text (Tab) ( find & and replace with And)
Remove text (artist field & song field ,remove text between)
used "(" and ")" in the two red boxes

These settings are important and are recommended.

Results with 1908 files
----------------------------
MC's finds 157 duplicates on its own.

using just MD5TextHash in Duplicates found 294 files.

JLee's macro found 303.

So Kings plugin is pretty close. Some tweaking in the replace tabs might fix this.

(Not surprisingly. just tried this out to see what came up)
- using MD5MP3Hash in duplicates found 4 files.
- using MD5BinHash in duplicates found 8 files.

The above 2 hashes are useful when tagging is inaccurate or non-existent, BinHash being more useful as it only samples a few kb from the beginning rather than the whole file.

Otherwise texthash does the trick pretty well, for extra speed, the other hashes could be unchecked in the plugin setup.

Suggestions
----------------
- i found the MD5TexHash to be the most useful, it found nearly twice as many dupes as MC on its own but the downside is the waiting for the text hashes to be computed. Maybe i am impatient because i don't know the ideal settings to use for the replace text, etc tabs and unwilling to wait to see the results of any experiments. It took 4 hrs to do all the hashes for 1908 files on a 700Mhz P3.

I suppose once the ideal settings are known, the text hash computation is a one time operation.

It would be ideal if it did not require this extra step and performed the dupe checking just given a playlist.

But i'm not sure how to display the results then in the familiar fashion using the Duplicates modifier. Something needs to be written to a tag to be able to use this modifier.

I'm not aware of how much the SDK exposes ( and whether its feasible) but a possible solution might be to do the text compares internal to the plugin, store the results and then pass them to a pre-defined smartlist.


- Suggestion for a new feature, given a track list for an album, find all existing files. This could be helpful when looking at new albums on the web to see how many of the album tracks were already in the library. User would just copy+paste a track list of the web into a tabbed window and the plugin would display any matches.


Well done King and thanks for the effort !


Logged

jleerigby

  • Guest
Re:Duplicates Finder PlugIn
« Reply #119 on: February 01, 2004, 06:10:28 am »

Just one more suggestion King:

Change the name to something more meaningful.  I don't think many MC newbies will know what MD5 or text hash or file hash means.  I suggest something like Advanced Duplicates Finder.

If you agree I'd change the subect heading too on the link in the main MC9 forum.

This is such a useful addition to MC that I hope lots of others will use it too.
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20048
Re:Duplicates Finder PlugIn
« Reply #120 on: February 01, 2004, 06:37:10 am »

Quote
Change the name to something more meaningful.  I don't think many MC newbies will know what MD5 or text hash or file hash means.

It means somthing to me

Quote
I suggest something like Advanced Duplicates Finder.

it's not a duplicate finder, it makes hashes

MC does the dup finder and better than a VB program ever will, VB is just too slow for this.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio
https://MyAAGrapevines.com
Fayetteville, NC, USA

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20048
Re:Duplicates Finder PlugIn
« Reply #121 on: February 01, 2004, 07:35:46 am »

this is a list i use with some of my programs to match better

A List Of Replacements To Think About for the Atist name (most include spaces before and after).

    " / “ = “ AND “
    " \ “ = “ AND “
    " - “ = “ AND “
    "-N-” = “ AND “
    "(AND “ = “ AND “
    "[AND “ = “ AND “
    "{AND “ = “ AND “
    "(DUET WITH” = “ AND “
    "DUET WITH” = “ AND “
    "(WITH “ = “ AND “
    "(WIT “ = “ AND “
    "[WITH “ = “ AND “
    "[WIT “ = “ AND “
    "{WITH “ = “ AND “
    "{WIT “ = “ AND “
    " WITH “ = “ AND “
    " WIT “ = “ AND “
    "(WTH “ = “ AND “
    "(&” = “ AND “
    "&” = “ AND “
    "(F/” = “ AND “
    "F/” = “ AND “
    "(W/” = “ AND “
    " W/” = “ AND “
    "(VS. “ = “ AND “
    "(VS “ = “ AND “
    "VS.” = “ AND “
    " VS “ = “ AND “
    "*VS*” = “ AND “
    "INTRODUCING” = “ AND “
    "(FEATURING” = “ AND “
    "FEATURING” = “ AND “
    "(FEAT.” = “ AND “
    "FEAT.” = “ AND “
    "(FEAT” = “ AND “
    "FEAT” = “ AND “
    "(FT.” = “ AND “
    "FT.” = “ AND “
    "(FT “ = “ AND “
    " FT “ = “ AND “
    " F.” = “ AND “
    "(F.” = “ AND “
    " F “ = “ AND “
    "MEETS” = “ AND “
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio
https://MyAAGrapevines.com
Fayetteville, NC, USA

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20048
Re:Duplicates Finder PlugIn
« Reply #122 on: February 01, 2004, 07:52:05 am »

replacements for the title

"WANT A” = “WANNA"
"WANNA” = “WANT TO"
"I WILL” = “I'LL"
" DA “ = “ THE "
"NOIZE” = “NOISE"
"BOYZ” = “BOYS"
"DAYZ” = “DAYS"
"COLOUR” = “COLOR"
"ING” = “IN'"
" CUM “ = “ COME "
" MAMA “ = “ MAMMA "
" MOMMA “ = “ MAMMA "
"WOMEN” = “WOMAN"
"WOMAN” = “WOMEN"
" YEA “ = “ YEAH "
"LITES” = “LIGHTS"
" LUV “ = “ LOVE "
"THANG” = “THING"
" YA “ = “ YOU "
"NITE” = “NIGHT"
"GOTTA” = “GOT TO"
"I AM” = “I'M"
"GONNA” = “GOING TO"
"GIMME” = “GIVE ME"
" DAT “ = “ THAT "
"'N'“ = “AND"
"-N-” = “AND"
"WHATTA” = “WHAT A"
"WANNABE” = “WANT TO BE"
" & “ = “AND"
" TILL “ = “ TIL "
" LIL “ = “ LITTLE "
" THA “ = “ THE "
"HIPPY” = “HIPPIE"
"THERE WILL” = “THERE'LL"
" DAMN “ = “ DAM "
"SHE IS” = “SHE'S"
"BREAK” = “BRAKE"
"SHOULD HAVE” = “SHOULD'VE"
"SHOULDA” = “SHOULD'VE"
"COULD HAVE” = “COULD'VE"
"WE ARE” = “WE'RE"
"YOUR” = “YOU'RE"
"YOU ARE” = “YOU'RE"
"UNTILL” = “UNTIL"
" U “ = “ YOU "
" N' “ = “ AND "
" N “ = “ AND "
" R “ = “ ARE "
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio
https://MyAAGrapevines.com
Fayetteville, NC, USA

hit_ny

  • Citizen of the Universe
  • *****
  • Posts: 3310
  • nothing more to say...
Re:Duplicates Finder PlugIn
« Reply #123 on: February 01, 2004, 12:17:21 pm »

Thnks for the tips King.

Do they have to be ALL CAPS ? Cos i used mixed case.

btw...i redid the texthashes only with the new settiings for 1908 files, i was amazed to see them all  regenerated  in under a min !!!!
Logged

KingSparta

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 20048
Re:Duplicates Finder PlugIn
« Reply #124 on: February 01, 2004, 12:24:22 pm »

Quote
btw...i redid the texthashes only with the new settiings for 1908 files, i was amazed to see them all  regenerated  in under a min !!!!

if you don't have the binary and by file on it should not take too long.

you also need to watch, the option on the setup page if it already has a hash a new one will not be generated and it will be bypassed, you should turn that off when trying to regenerate the hash.

Quote
Do they have to be ALL CAPS ?

no, thats how i had it already typed, the program converts both the "artist name" and "Name" temp string to Ucase along with the search string the user types in, this way it will match no matter what the case the user types.
Logged
Retired Military, Airborne, Air Assault, And Flight Wings.
Model Trains, Internet, Ham Radio
https://MyAAGrapevines.com
Fayetteville, NC, USA

bspachman

  • MC Beta Team
  • Citizen of the Universe
  • *****
  • Posts: 888
Re:Duplicates Finder PlugIn
« Reply #125 on: February 04, 2004, 10:42:10 am »

I've been following this thread for while with some interest. I stumbled on some interesting information that may or may not be useful:

From http://www.macintouch.com

Quote
[Rick Lesniak] A friend of a friend tracked this down: CDDB uses a "waveform recognition" algorithm: Gracenote MusicID

and

Quote
[Mark Rogstad] I learned this only a few months ago, but CDDB and others use a technology called "audio fingerprinting" to identify songs, not any digital bits.

        * Audio Fingerprinting Technology
        * Just Hum a Few Bars [Mix]

and

Quote
[Kees Huyser] Gracenote has a patent [that] describes

        "a fuzzy comparison algorithm suitable for determining whether two audio CDs are exactly or approximately the same. The fuzzy comparison algorithm proceeds as follows. For each of the two audio CDs to be compared, one determines the lengths of all the tracks in the recordings in milliseconds. One then shifts all track lengths to the right by eight bits, in effect performing a truncating division by 2^8 =256. One then goes through both of the recordings track by track, accumulating as one proceeds two numbers, the match total and the match error. These numbers are both initialized to zero at the start of the comparison. For each of the tracks, one increments the match total by the shifted length of that track in the first CD to be compared, and one increments the match error by the absolute value of the difference between the shifted lengths of the track in the two CDs. When one gets to the last track in the CD with the fewer number of tracks, one continues with the tracks in the other CD, incrementing both the match total and the match error by the shifted lengths of those tracks. Following these steps of going through the tracks, the algorithm then divides the match error by the match number, subtracts the resulting quotient from 1, and converts the difference to a percentage which is indicative of how well the two CDs match."

Interesting stuff.... (Not least because CDDB gets to come up again around here! :) )

Brad
Logged

zak326

  • Regular Member
  • Recent member
  • *
  • Posts: 49
  • consultant
Re:Duplicates Finder PlugIn
« Reply #126 on: April 09, 2004, 10:05:12 am »

I have been using a propgram by the name of
Dpeg which I bought a couple of years ago and never uysed then found it againin a google search. It has all kinds of options for finding and deleting dupes.

this is the link.  http://www.somewareonthe.net. I usually don't recommend things like this, but this seems to be flexible enough to work in most cases. they have both a freeware and paid version. give it a shot.

bytw i have no interest in this company, etc. just trying to be helpful. ::) ;D
Logged
Pages: 1 2 [3]   Go Up