Man.... I think you guys jinxed me or something. I think I'm getting an Ivy sooner than I'd thought, but replacing the "wrong" machine.
My server machine died last night. I'm pretty sure it is the CPU...
[Saga]
It has been acting up on me for the past month or so, but it seemed quite random. It would be fine all day doing its thing, recording TV and acting as a server. But when I would come home and use it at night, I was having some instability. It wouldn't bluescreen, but the video was just going black suddenly and the system's fans would all run full tilt. Nothing of relevance was showing when in the system logs. But, it typically only manifested when I was using a full screen VM, in MC's Theater View, or playing a 3D game. But it also seemed to be "getting worse". A month or two ago, the first I saw it happen, it happened once and then not again for a week or two. This week, it has been pretty much any time I did anything "demanding" with the machine though (though it did manage to survive a 2.5 hour run of Prime95 last weekend). And then, Wednesday night it did it when I wasn't doing anything fancy, just at the desktop in Firefox (on Interact, I suspect).
So for a while, I was convinced it was either my video card or my power supply. After this week and the really bad behavior, I was more suspicious of the power supply, but also having a bit of worry about the CPU. I actually brought home a replacement video card this weekend, and I was planning to swap both and test. I'd fiddled with the drivers enough that I was pretty darn sure they weren't at fault.
Then, last night I got home a bit late and put the baby to bed and came down to the living room to watch TV. I got about 5 minutes into an episode of The Daily Show, and it just froze. MC was still responding (at first), I could seek and switch out of Full Screen, but the show stopped playing and eventually MC itself locked up. Walked downstairs and the sever had a Kernel Stack Inpage Error (blue screen). Ooh, hadn't seen any of those before. Rebooted, it came back up okay, and then my wife came home and wanted to watch a show. So I didn't have time to mess with it. We got almost through one hour long episode of something, and it happened again (different bluescreen this time). So we managed to finish the show and when I put her to bed, I came down to figure out what was going on.
Now, a little background. When I first got this machine, I had it overclocked to 4.0GHz. It was rock-solid stable at that speed (could Prime95 for 24 hours straight, and run a wide array of GPU tests/game benchmarks), but it was slightly over-volted. It ran that way fine for about a year. Then, I started to notice some slight oddness, and I ran additional stability tests and found problems. So, I pulled out my notes on the system and found the speed I'd tested at stock voltage that was rock-solid stable and dropped back to that (3.75 GHz) and dropped back to stock vcore. Unfortunately, I found I couldn't get it stable at that speed and vcore. In fact, I found I had to go way back to 3.0-3.1 at stock vcore to get it as stable as I like. But, with (as before) my slight voltage bump, 3.70-3.8GHz was perfectly fine as best as I could tell. Worrying, but whatever, so I did that.
And, until about a month or two ago, it had been fine at those settings (I'd been testing it fairly regularly).
So, until last night, I had still been still more suspicious of the GPU than anything else. The server doesn't do a lot of gaming or anything, so it was an older AMD 5770, and they ran hot, so it seemed like I might be getting up against the end of its useful life. And that whole "screens going black thing, no error in the logs" seemed like a GPU crash or a driver crash (maybe power supply supplying flaky power to the GPU). But these new bluescreens, and the crash just at the desktop with Firefox open, were much more worrying. So, the first thing I did was run Prime95 on it to just kind of beat on it a bit while I said goodnight to my wife and cleaned up the disaster zone the child creates.
I came back down to check on it after about an hour, and she was locked right up solid. No mouse movement, and the Process Explorer CPU history graph on screen was locked right up. Bummer. Shut her off and rebooted. I decided to go into the BIOS and just reset back to the defaults. I'd already (when the trouble really started) pulled the CPU back to 3.2GHz, but I didn't mess with anything else. I figured I'd shoot back to the defaults, and then tweak things one at a time. I wanted to test with the CPU back at stock everything before I started testing the GPU and the power supply and everything. So, load optimized defaults, go in and turn off the stupid graphics overlay and enable the right settings for my hard drives (or else Windows won't boot) and reboot.
And that was all she wrote. Never came back alive, never passed POST again. I tried, of course, everything from resetting the CMOS/pulling the battery, to removing all the add-in cards, trying not one but two different GPUs and Power Supplies. Trying different RAM sticks and no RAM sticks at all (trying to get a no-RAM POST error beep code). Nada. Bupkus.
[/Saga]
So here's what I think happened... My CPU was burning out, slowly, probably from the over-volting, but maybe from heat? I don't remember the exact amount I bumped up on the vcore now without looking, but I was well below what people regularly report online, only a tiny couple of notches. I'm pretty conservative. But, I don't monitor the heat as well as I should. I have a very nice cooler, but the temps in the summer time in the basement can sometimes get pretty oppressive and I don't check it well then (and I think it was hot when I had those first problems way back in the day). In any case, here near the end of the death process, I'm thinking it was my over-volting that was even allowing it to POST. Once I reset the board back to defaults, it was no longer over-volting (except the little bit the board does in Auto-Mode), and so it won't POST (so I can't get it up long enough to get into the BIOS to tell it to over-volt again). And, of course, we don't have DIP switches on the boards anymore, or complex jumper settings.
But... It could be, almost as likely, part of the power supply circuitry on the motherboard. I don't think so. I looked at it for busted caps and things like that, and found nothing. No burning smell, and the BIOS had acted fine when I had used it the past few times. And the system seems to be trying to boot, and trying different settings after I reset the CMOS as it should, but it can't. But I have no way to tell without another socket 1156 CPU on-hand, which I don't have. I called around to all of the local computer shops and none of them have any on hand either.
I could buy one, but since they're discontinued, if you want a new one, you're dealing with sketchy vendors or overpriced markup. I can get a Core i7 870 on Amazon for $340, not prime, but... That's more than a brand new Ivy Bridge costs. And that's a big risk to take if I don't know if my board is good or bad. I could buy a cheap one, but the best I found for non-sketcy new 1156 CPUs was a crappy dual-core "Pentium" on Newegg for $110. I don't want to use that CPU long-term, it would just be a test, and $110 is a lot for something going on the junk pile.
So, the only choice would be to go with something I'd want to use long-term, and that would (for the server) really need to be the Core i7 870, especially if I'm not going to overclock it.
So, I think I'm getting an Ivy. I think what I'll do is grab a P8Z77-V Deluxe (since it is the server board, suddenly I care about that stuff a lot) and a Core i7 3770. I'm not going with the 3770k because I'm not going to overclock this one again, and they're otherwise essentially the same, so why pay the extra $40. Plus, the 3770k is sold-out everywhere now.
Then, I'm going to watch for a cheap Core i5 750 or 760, on eBay or something at my leisure and maybe I can get this thing going again lower-risk with a new CPU and use it to replace the HTPC (which I still want to do, but I can't afford to do both at the same time). Playing with overclocking the "toy" HTPC is one thing... If it dies, I can hook up my laptop to the TV and carry on in a pinch. But playing with the server is a bad idea. I've not liked it for a long while, but I was cheap back when I bought the Lynnfield and the i5 was a bit slow if it wasn't overclocked (but it did that like a champ). I don't think we'll do that on the server anymore...