Interesting point about the embedded thumbnails--especially the technique of adding them back...
In my quest to filter out dupes, I have attempted to strip out non-essential tags, including all images and the old v1 tags at the end of the files. With MC12, my ripped library grew to about 300,000 tunes and I could/can manipulate files by the thousands, retagging and filtering with DoubleKiller, which has far greater capability than any other duping app with its numerous options. By tweaking its options, I have in the last 2 years reduced my library to about 250,000, with MC12 and my old dual-core CPUs, and with still, by my guess, about 30,000 sound-alike dupes which I don't want, which no known app can filter out.
An apparently defunct startup app called "Sloud Music Content Inspector" held the greatest promise for me but their beta app crashes both my XP desktops on test runs. Searching on "sloud" will produce their promising website and their prospectus, which details their proposed technical approach to de-duping:
"...the core system is written in portable C++ programming language and does not depend on any particular operating system. The client interface is written in C++ for Microsoft(R) Windows(TM) family of operating systems. Real-time recognition or voice pitch, based on algorithms without use of Fast Fourier Transform (FFT). The technology can work on systems with severely limited resources, such as smart mobile phones. Algorithms of pitch correction of recognized sounds with respect to psycho-physiological characteristics of human audio sensory system. The algorithms improve the quality of sound recognition. Calculation of sound duration and composition of MIDI score based on the recognized sound. Algorithms of automatic correction of MIDI score in order to remove false scores and improve overall recognition of the music."
I have noticed over the years that de-duping is not a big concern here on the Forums, especially in libraries the size of mine. I don't claim to know much about the mp3 technical format but I notice that in the MC database (calculated) columns there is frequently a substantial non-correlation between "duration" and "file size" which doesn't make sense to me... For example, if the "duration" is in fact a calculation of only audio content less beginning and ending blank space, this calculation would go a long way in de-duping... Of course it still doesn't account for discrepancies in the 'content' from encoding/converting various formats into mp3. Thus my attempts to "clean" my tags are an incomplete and vain attempt to delete "sound-alike" duplicates. Said in another way, it is my perception that narrowing down "duration" to actual audio content only could help in the final screening for dupes--is this possible? (re-reading this tells me that silent space would have to be removed somehow to arrive at a true duration)
Today's strategy is to top off my MC12 library (where virtually no files have thumbnails), then import the library into MC14 (shouldn't have to 'add new files'), then add my next large batch of rips, which won't have images removed, and check the 'add' rate...does this make sense?