Topic: Calculate Audio Checksum (Read 7359 times)

RD James · « **on:** June 15, 2016, 01:45:14 pm »

Edit: This was split from the main topic.
The feature request is for Media Center to calculate the Audio Checksum of a file in the import/analysis process and store it in the library.

Some form of file integrity checking built into the import/analysis process.
As I posted in another section of the forum, I just had Windows mark one of my hard drives as corrupt.
I've run file recovery software and got many of my files back, but this leaves me with some big problems:

1. Many of the files "recovered" are only partial or otherwise corrupt files and only the filename/folder structure is intact.
2. Many of these files are named incorrectly and have lost all metadata since JRiver sees them as other files, but they appear to work.
3. Many of these files appear to have recovered successfully.

But I have no way of knowing which of these files are actually good.
So far I've tried rebuilding thumbnails - anything which won't build a thumbnail is bad.
And clearing the tags then analyzing audio - some of the files either won't analyze, or return junk values. (like -400dB!)
So that's a way of removing many of the corrupted files.

But I see no way of actually verifying that the rest of the files are good without actually watching them from start to finish. That is 41.5 days of video footage to go through according to JRiver.

If there was some sort of fingerprinting process for audio and video files, it would solve a lot of these problems.
1. I would be able to see if the checksums differ at all. (bad file)
2. It would be able to identify files based on their checksum. So a file that has been renamed/moved could be re-imported into the library and the metadata restored because JRiver knows that its checksum matches a file in the library.

This would also be useful for periodic scans of the library to check for bit-rot.

blgentry · « **Reply #1 on:** June 15, 2016, 03:41:25 pm »

Quote from: RD James on June 15, 2016, 01:45:14 pm

As I posted in another section of the forum, I just had Windows mark one of my hard drives as corrupt.
I've run file recovery software and got many of my files back, but this leaves me with some big problems:

My sympathies on the drive crash. It always sucks to lose data. Which is why my backup process is now more mature than it was a few years ago.

Quote

If there was some sort of fingerprinting process for audio and video files, it would solve a lot of these problems.
1. I would be able to see if the checksums differ at all. (bad file)
2. It would be able to identify files based on their checksum. So a file that has been renamed/moved could be re-imported into the library and the metadata restored because JRiver knows that its checksum matches a file in the library.

This would also be useful for periodic scans of the library to check for bit-rot.

You don't know this, but I respect your knowledge of Audio/Video and MC. But the suggestions above aren't appropriate for media software. They are functions for backup software or drive scrubbing software. Again, my sympathies and general respect. But I think this is shooting in the wrong direction.

Brian.

RD James · « **Reply #2 on:** June 15, 2016, 04:49:26 pm »

Quote from: blgentry on June 15, 2016, 03:41:25 pm

You don't know this, but I respect your knowledge of Audio/Video and MC. But the suggestions above aren't appropriate for media software. They are functions for backup software or drive scrubbing software. Again, my sympathies and general respect. But I think this is shooting in the wrong direction.

I will freely admit that I am not very knowledgeable in this area.
Am I wrong in thinking that it is possible to have a file corrupted for whatever reason, which would change the checksum, and then have your backup software copy that to all your drives in place of the good file because it only sees an external change - it doesn't know whether that external change is corruption or intentional editing of the file.
If you catch it in time, you should have multiple copies of a file to recover from - but only if you catch it in time.

Independent verification that the file is intact (via JRiver) seemed like a good way to handle that.

As an example, here is what happens with two different programs when I make a copy of the same MP3 file.
1. The first copy only has the file renamed.
2. The second copy is renamed and has the tags removed.

HashCheck calculates the same checksums for the original and the copy, but different checksums for the file where the tags were changed.
dBpoweramp's "Audio CRC" test calculates the same checksums for all three files.

I believe that what dBpoweramp does is decode the files to WAV and then calculates the CRC of only the audio data.
But dBpoweramp only spits out a text file when you run the Audio CRC check, it doesn't store that in a database and allow you to verify that it is correct at a later date.
It seemed like JRiver would be able to do this as part of the audio analysis stage, as it has to decode the audio for that already.
I was hoping it might also be possible to do the same thing for video files - or at least store a checksum of the audio data contained within when run through the analyzer - that's probably sufficient.

Since JRiver is also wanting to improve things like metadata, I wondered if calculating the checksums might also be useful for that.

mwillems · « **Reply #3 on:** June 15, 2016, 06:49:34 pm »

Likewise apologies for the disk failure

Quote from: RD James on June 15, 2016, 04:49:26 pm

I will freely admit that I am not very knowledgeable in this area.
Am I wrong in thinking that it is possible to have a file corrupted for whatever reason, which would change the checksum, and then have your backup software copy that to all your drives in place of the good file because it only sees an external change - it doesn't know whether that external change is corruption or intentional editing of the file.
If you catch it in time, you should have multiple copies of a file to recover from - but only if you catch it in time.

What you're describing ought to be the job of the filesystem, and is something that next generation filesystems (zfs, btrfs, etc.) already do. Those filesystems take checksums of all files and provide various ways to discover when the file no longer matches its checksum, safeguarding against bitrot. The backup issue is only an issue if you're not doing incremental "forever" backups. With certain types of incremental backup software (like some of the neat modern de-duplicating solutions), you can effectively keep backing up indefinitely, which would give you the ability to roll back to an arbitrary point in time. Even conventional incremental backups can give you some ability to go back in time if necessary.

I run two scrubs a month (overnight) on my btrfs array; I have incremental backups going back months that would allow me to retrive an older version of any file that had rotted between scrubs; I can't imagine JRiver being able to offer something equivalent as it's so far outside the core mission of a media player, but the folks at JRiver have surprised me before

Quote

Independent verification that the file is intact (via JRiver) seemed like a good way to handle that.

As an example, here is what happens with two different programs when I make a copy of the same MP3 file.
1. The first copy only has the file renamed.
2. The second copy is renamed and has the tags removed.

HashCheck calculates the same checksums for the original and the copy, but different checksums for the file where the tags were changed.
dBpoweramp's "Audio CRC" test calculates the same checksums for all three files.

I believe that what dBpoweramp does is decode the files to WAV and then calculates the CRC of only the audio data.
But dBpoweramp only spits out a text file when you run the Audio CRC check, it doesn't store that in a database and allow you to verify that it is correct at a later date.
It seemed like JRiver would be able to do this as part of the audio analysis stage, as it has to decode the audio for that already.
I was hoping it might also be possible to do the same thing for video files - or at least store a checksum of the audio data contained within when run through the analyzer - that's probably sufficient.

Since JRiver is also wanting to improve things like metadata, I wondered if calculating the checksums might also be useful for that.

Fingerprinting for metadata analysis on a onetime basis might be a good idea, but hashing a large media collection regularly for integrity checking is pretty computationally expensive. For perspective, an operation that checks all my media file checksums (a "scrub") takes about 8-10 hours with full disk utilization, and that's with the filesystem's native scrub utility on local drives! It would likely be even more "expensive" for a program like JRiver to do that kind of scrub, especially over a network.

And you can't just monitor files that report changes; bitrot or other silent corruption don't update the files timestamp or really even "touch" the file in a way that would trigger a notification event. So the only way to spot real corruption (that doesn't result from direct user error) is to scrub all the checksums as described above.

In 10 years, hopefully, this will be in all the major filesystems and people will be able to forget about silent data corruption (and focus on the distressing enough "loud" data corruption).

RD James · « **Reply #4 on:** June 15, 2016, 07:47:00 pm »

Quote from: mwillems on June 15, 2016, 06:49:34 pm

What you're describing ought to be the job of the filesystem, and is something that next generation filesystems (zfs, btrfs, etc.) already do. Those filesystems take checksums of all files and provide various ways to discover when the file no longer matches its checksum, safeguarding against bitrot.

Well ReFS was supposed to bring that functionality over to Windows but that's the first time I've ever had an entire drive just suddenly report that all files are corrupt.
In my previous experience, disk failure usually means a handful of corrupt files at most, providing that you have a tool monitoring the drives in your system and set it to either shut down immediately or initiate a backup automatically.
This seems like it was a filesystem failure, not a disk failure.

Quote from: mwillems on June 15, 2016, 06:49:34 pm

The backup issue is only an issue if you're not doing incremental "forever" backups. With certain types of incremental backup software (like some of the neat modern de-duplicating solutions), you can effectively keep backing up indefinitely, which would give you the ability to roll back to an arbitrary point in time. Even conventional incremental backups can give you some ability to go back in time if necessary.

Yes, I use incremental backups (on the drives that have full backups...) but not with deduplication as I'm not running a server version of Windows, which means that I have a limited amount of time/changes that I can roll back.
But you need to have a way of knowing that a file is corrupt for that to be useful.
The filesystem should handle this, but that's clearly not something that can be completely relied upon.

Quote from: mwillems on June 15, 2016, 06:49:34 pm

I run two scrubs a month (overnight) on my btrfs array; I have incremental backups going back months that would allow me to retrive an older version of any file that had rotted between scrubs; I can't imagine JRiver being able to offer something equivalent as it's so far outside the core mission of a media player, but the folks at JRiver have surprised me before

Well that's why I was thinking it could be done on import as part of the audio analysis process.
If I run dBpoweramp's Audio CRC check on a WAV file instead of a compressed audio file it completes in a couple of seconds, and I'm not typically importing hundreds of files at once so that calculation time would not be expensive.
It was because it's specifically an audio CRC being calculated and not a file CRC that it seemed appropriate for media center.

Quote from: mwillems on June 15, 2016, 06:49:34 pm

Fingerprinting for metadata analysis on a onetime basis might be a good idea, but hashing a large media collection regularly for integrity checking is pretty computationally expensive.

Well it doesn't have to be a regularly scheduled thing.
Make it a manual thing that can be run on sets of files, and perhaps have an option to check files during playback once they have been decoded for example.
I'm not expecting it to replace what these advanced file systems are supposed to be doing, just as a way to confirm that your files are good as they're played, and having some reference that files can be compared against in the event that you have a failure to deal with.

It's more that this would have been very useful in sorting through the wreckage of 4TB/~150,000 files instead of what I'm having to do now.
I'd be less concerned if I knew what was bad, so that I know what needs to be replaced. Instead I have a lot of files that appear to be fine, but I know that many of them are not.
Many of the files are good - which is good because I'm now finding that a lot of them are no longer available due to closures (e.g. GameFront shutting down at the end of April, videos on YouTube channels being deleted or the channels shut down, bandcamp pages or other independent sites being shut down etc.) but I've also had a four hour video that passed all of the tests above, but then cut off about 57 minutes into it for example.

The good thing is that, at least in JRiver I still have thumbnails for all the videos, and filenames/other details for these files in my library. That alone is proving to be a very useful resource - at least I have a record of all the media that was on the drive, even if I don't have a way to check which files are good.

blgentry · « **Reply #5 on:** June 16, 2016, 08:12:19 am »

Apologies for continuing to derail this thread. If an admin wants to split this, that's probably the right thing to do.

Quote from: RD James on June 15, 2016, 07:47:00 pm

Well ReFS was supposed to bring that functionality over to Windows but that's the first time I've ever had an entire drive just suddenly report that all files are corrupt.

Then you're lucky. Most drive failures are pretty much everything at once. At least in my limited experience.

Quote

Yes, I use incremental backups (on the drives that have full backups...) but not with deduplication as I'm not running a server version of Windows, which means that I have a limited amount of time/changes that I can roll back.

So why are you going through this pain if you have backups? This is exactly why you have backups; so you can restore them in the event of failure.

Quote

But you need to have a way of knowing that a file is corrupt for that to be useful.

When a drive fails, you pretty much assume everything is gone. Anything you can recover is a bonus if you don't have backups. If you do have backups, you just ignore the drive. It's not worth the effort to sort good from bad.

Brian.

rec head · « **Reply #6 on:** June 16, 2016, 08:57:44 am »

TLDR: Have you looked at Spinrite?
https://www.grc.com/cs/prepurch.htm

RD James · « **Reply #7 on:** June 16, 2016, 11:46:07 am »

Quote from: blgentry on June 16, 2016, 08:12:19 am

Then you're lucky. Most drive failures are pretty much everything at once. At least in my limited experience.

In my experience that tends to happen if you aren't monitoring the health of the drive.
If you have something constantly monitoring the drive and immediately take action as soon as its health drops below 100%, it's rare to lose much if anything unless it's a catastrophic failure of the drive.

Quote from: blgentry on June 16, 2016, 08:12:19 am

So why are you going through this pain if you have backups? This is exactly why you have backups; so you can restore them in the event of failure.

I have backups for "important files" which is my primary music library (because CD libraries take so long to re-rip) important documents etc.
I don't have backups for everything because then I would need an additional 40TB which is very expensive.
The majority of data on my drives is ripped media, which I still have locked up in storage.
So my attitude has been that it's OK to lose, because I'm used to the failure state being only a handful of files at most, and I can always re-rip it even if the entire drive dies.

The problem is that this drive that I've lost data from was the one which also contained all of my (legally) downloaded media. (and other downloaded files like game patches/mods etc.)
Again, my attitude with that has been that if I lose it, I can always download it again.
The problem is that I'm now finding that sites/services have closed down, YouTube/Vimeo/etc. videos/channels have been deleted, YouTube has re-encoded it and it's now lower quality than before, accounts on other sites have been closed, sites have changed their back-end and no longer have records older than X number of years etc. and a lot of that media is no longer available.

So I will have to get backups put in place for that now, but that doesn't help this current situation.
And it turns out that finding the sources and downloading everything again is very time consuming compared to pulling a stack of discs out of storage and starting to re-rip them.

Quote from: blgentry on June 16, 2016, 08:12:19 am

When a drive fails, you pretty much assume everything is gone. Anything you can recover is a bonus if you don't have backups. If you do have backups, you just ignore the drive. It's not worth the effort to sort good from bad.

Right, but I just thought that if Media Center also calculated and stored the Audio CRC (not file CRC) as part of the import/analysis process - since it's already doing the work to decode the files - that it would be useful to have.
I'm just presenting one scenario where it would be saving me a lot of time and effort if I had access to this.
The fact that I still have filenames in JRiver's library is already immensely helpful in trying to source these files again, because who really remembers every individual file in their libraries until they actually want to play it and find that it isn't there.

Having an Audio Checksum stored in the library would remove a ton of manual work in sorting through these files.
Yes, everyone should have every bit of data on three separate hard drives and one of them should be stored offsite at all times and so on, but most people's backup systems are far from perfect and this seemed like something that would not add much computation time to the import/analysis process and would be invaluable if you ever have to recover files.

ferday · « **Reply #8 on:** June 16, 2016, 01:35:32 pm »

i use hashcalc
http://www.geeksengine.com/program_331.html
but it can't batch calc the hash and doesn't offer an easy way to get the hash out

there are also perl programs that will auto-save the hash into a tag, i haven't tried it
http://snipplr.com/view/4025/mp3-checksum-in-id3-tag/

i certainly don't disagree that having the hash available through MC could be useful

JimH · « **Reply #9 on:** June 16, 2016, 02:55:32 pm »

Wouldn't writing the hash tag (or any other tag) alter the hash?

blgentry · « **Reply #10 on:** June 16, 2016, 03:21:45 pm »

Quote from: JimH on June 16, 2016, 02:55:32 pm

Wouldn't writing the hash tag (or any other tag) alter the hash?

If the hash is of the entire file, then yes. Astute observation.

RDJ is asking for a hash of the audio data. That is, a hash (or CRC) of the raw, uncompressed audio data. I think he's asking for that for two reasons:

1. The hash of the uncompressed data should never change, even if you write lots of tags, because the tags are separate than the audio data.
2. Presumably this uncompressed audio data is being accessed, as a whole, during audio analysis. That would make the process of calculating the hash "cheap" (computationally) because the data is already in memory and being operated upon.

Whether or not a hash of the audio data is good enough to guarantee that a *video* file (with mixed video and audio streams) is still good is an open question. I think you'd need some meaningful hash of more of the data to be sure, but that's just my gut feeling.

Brian.

blgentry · « **Reply #11 on:** June 16, 2016, 03:26:10 pm »

Quote from: RD James on June 16, 2016, 11:46:07 am

I don't have backups for everything because then I would need an additional 40TB which is very expensive.
The majority of data on my drives is ripped media, which I still have locked up in storage.
[...]
The problem is that this drive that I've lost data from was the one which also contained all of my (legally) downloaded media. (and other downloaded files like game patches/mods etc.)
Again, my attitude with that has been that if I lose it, I can always download it again.
The problem is that I'm now finding that sites/services have closed down, YouTube/Vimeo/etc. videos/channels have been deleted, YouTube has re-encoded it and it's now lower quality than before, accounts on other sites have been closed, sites have changed their back-end and no longer have records older than X number of years etc. and a lot of that media is no longer available.

Ouch. A miscalculation that is biting you now.

Quote

So I will have to get backups put in place for that now, but that doesn't help this current situation.
[...]
Right, but I just thought that if Media Center also calculated and stored the Audio CRC (not file CRC) as part of the import/analysis process - since it's already doing the work to decode the files - that it would be useful to have.
I'm just presenting one scenario where it would be saving me a lot of time and effort if I had access to this.

Yeah, I get it. I still don't really agree that this would be easy to implement, or that it's strictly a good function for library software/media player. But I get it and I understand where you are coming from. It's not a crazy thing to suggest. Just not something I fully agree with.

Quote

The fact that I still have filenames in JRiver's library is already immensely helpful in trying to source these files again, because who really remembers every individual file in their libraries until they actually want to play it and find that it isn't there.

That's a really great point actually! The last time I lost media in a crash, I was able to pull a file listing from the dying drive, so that helped. But even file names aren't nearly as good as real metadata. This crash was pre-Media-Center for me.

Good luck with your recovery.

Brian.

mattkhan · « **Reply #12 on:** June 16, 2016, 05:05:12 pm »

Quote from: JimH on June 16, 2016, 02:55:32 pm

Wouldn't writing the hash tag (or any other tag) alter the hash?

such a value is commonly stored in an extended attribute so no

RD James · « **Reply #13 on:** June 16, 2016, 05:28:23 pm »

Quote from: blgentry on June 16, 2016, 03:21:45 pm

Whether or not a hash of the audio data is good enough to guarantee that a *video* file (with mixed video and audio streams) is still good is an open question. I think you'd need some meaningful hash of more of the data to be sure, but that's just my gut feeling.

I agree, but I think that hashing video files would be computationally expensive compared to only hashing the audio as part of the import/analysis process.
It would, however, provide way to create a list of "known bad" files, rather than "known good".

jmone · « **Reply #14 on:** June 16, 2016, 06:18:35 pm »

I have two pools (main and backup) and also looked at doing BitSum but as given they are over 40TB each I gave up due to the very long time such volumes take to calculate. I read "All ReFS metadata have 64-bit checksums which are stored independently."... but I don't know how to access this metadata.

mwillems · « **Reply #15 on:** June 16, 2016, 08:01:38 pm »

Quote from: jmone on June 16, 2016, 06:18:35 pm

I have two pools (main and backup) and also looked at doing BitSum but as given they are over 40TB each I gave up due to the very long time such volumes take to calculate. I read "All ReFS metadata have 64-bit checksums which are stored independently."... but I don't know how to access this metadata.

My understanding is that you can't, and you can't manually initiate a scrub either, it's all automagic in the background a bit at a time. That (from my perspective) makes it not particularly useful as there's no way to set your risk/fault tolerance through scheduling scrubs, and you're less likely to find small faults before they become large ones. The technet threads are not very encouraging about data recovery with ReFS after failure either (it looks like completely filling up a ReFS drive could lead to a particularly rough failure mode).

But, if I'm honest, risk of catastrophic failure is common to all of the next gen filesystems (btrfs and zfs too) at this point; it's why the conventional wisdom is to have full backups before using any of them

INTERACT FORUM

Author Topic: Calculate Audio Checksum (Read 7359 times)

RD James

Calculate Audio Checksum

blgentry

Corrupt Hard Drive

RD James

Corrupt Hard Drive

mwillems

Corrupt Hard Drive

RD James

Corrupt Hard Drive

blgentry

Corrupt Hard Drive

rec head

Re: Corrupt Hard Drive

RD James

Re: Corrupt Hard Drive

ferday

Re: Calculate Audio Checksum

JimH

Re: Calculate Audio Checksum

blgentry

Re: Calculate Audio Checksum

blgentry

Re: Corrupt Hard Drive

mattkhan

Re: Calculate Audio Checksum

RD James

Re: Calculate Audio Checksum

jmone

Re: Calculate Audio Checksum

mwillems

Re: Calculate Audio Checksum