From: stormreaver
Written: 2009-04-26 12:16:22.229289
Subject: Bad RAM

A few weeks ago, my computer was behaving extremely erratic. It would power down spontaneously and leave no hint of a reason in the system log. Eventually, it got to the point where it would shut down spontaneously two or three times a week, which was completely unacceptable.

Being competent at diagnosing system problems, and knowing full well that Linux just doesn't behave like this (Windows, yes. Linux, no), I decided that there had to be a hardware problem. Coincidentally, the first spontaneous reboot was on the same day I had bought my new UPS, so I immediately assumed that the UPS or the UPS software must somehow be causing this.

I deactivated the UPS software, and waited a few days. No spontaneous shutdown. I had also bought UPS's for my other computers, and they were behaving just fine, so I just couldn't understand how the UPS or the software could cause these problems.While I was at work that day, the computer spontaneously powered down, so I concluded that the UPS and the software weren't the cause.

My second conclusion was that the power supply was probably the culprit, so I went to Best Buy and bought a new power supply to replace the one I had replaced just a few months ago, under the assumption that the power supply went bad (though I always unplugged my computers at the slightest hint of bad weather). I put in the new, expensive, power supply and waited a few days without incident. When I tried to connect to my computer from work, I found that it had once again powered down spontaneously. The weather was good, there were no reports of power outages anywhere, and there was no external reason for this to be happening.

When I got home, I saw that the lights on the case were on, but there was no video display. I replaced the video card with one that was installed in a functional computer, but still had no video. In fact, the computer wouldn't even go through the POST process. This lead me to believe that the motherboard was fried -- the new motherboard and processor I had just bought a few weeks prior (damn!) Fortunately, I had already ordered a new motherboard and processor to upgrade my computer anyway, and the parts arrived from Newegg that same day (whew!).

I wasn't willing to concede the motherboard just yet, but at this point, I was running out of ideas, so I decided to start checking the things attached to the motherboard. I disconnected just about everything that was attached to the motherboard, including all the RAM, and started putting them back in one piece at a time. Each time I put a piece back in, I powered up to test the computer. Here's how it went:

Install one RAM chip and the video card, and power up. The video came back, and the computer POST'ed. Manually power off, and move on.

Connect the first hard drive, and power up. The video came back, and the computer POST'ed, and Linux started its boot sequence. Manually power off, and move on.

Insert the second RAM chip, and power up. The video came back, and the computer POST'ed, and Linux started its boot sequence. Manually power off, and move on.

Insert the third RAM chip, and power up. The computer would not POST!

Remove the third RAM chip, and power up. The computer POST'ed. Re-insert, no POST. Remove, POST. Reinsert, no POST. Bingo!

I put the computer back together, but kept that one (out of four) RAM chip out. My computer has been working perfectly ever since.

Despite the amount of time and money (I bought many of the major parts for a new computer) this took, I remember how much worse stuff like this was when I still ran Windows. With Windows, such behavior was common with virus infections and Windows' massive general instability. There was just no way to know where to begin looking, and I almost always ended up just wiping everything and reinstalling from scratch. And that was just to eliminate a possible failure point. Sometimes it helped, and sometimes it was a hardware problem after all. If I had been running Windows, I probably would have reinstalled, only to scream in agony after finding out that it was a hardware problem.

With Linux, though, such malfunctions are either clearly software or clearly hardware. There is no confusion, as Linux has proven itself to be extremely reliable and stable. It just turns out that defective RAM mimics a host of different hardware problems, and can be very deceptive to diagnose.

Many thanks to all the terrific Linux developers who make my computing life easier at every turn.
You must register an account before you can reply.