Thursday, March 5, 2009

DMA error headache: power supply died silently

After 23 days of uptime, the server started to throw DMA errors on different hard drives of the raid5.
Checking the drives with seagate tools showed no errors, and all IDE-cables had been swapped for new ones. I first suspected the PCI-IDE Controller to be defect, but onboard ide showed the same symptoms:

When put on load, any of the drives would produce DMA errors after some time and then cause a ide bus reset.
...
[1847037.014479] hdh: dma_intr: status=0x51 { DriveReady SeekComplete Error }
[1847037.014486] hdh: dma_intr: error=0x84 { DriveStatusError BadCRC }
[1847037.014489] ide: failed opcode was: unknown
[1847037.060018] ide3: reset: success
..

I finally found the issue: The power supply had died in an unexpected way - it would work nicely till under load and then drop voltage, which caused the hard drives break spinning, which in turn timed out running DMA transfers. After changing it, everything is up an running fine again!
Phew!