20050217

A board has died, long live the new board!

Our group has a Sun V1280 entry-level midrange server, for computational chemistry, software development and experiments.

It has 12 900 MHz UltraSPARC III Cu processors and 24 GB of RAM (3 CPU boards with 4 CPUs and 8 GB each). It's a pretty cool machine.

Last Friday night, it rebooted without warning (and without crashing). It seems that something went wrong with one of the CPU boards, so the system controller forced a reset. It came up with only 8 CPUs and 16 GB: one of the CPU boards failed POST.

Thankfully we have Sun support, so they (eventually) diagnosed the problem and shipped us a replacement board (along with a replacement power supply for one whose fan was making funny noises).

I was hoping to be able to utilise the V1280's (and Solaris') hot-swappability when replacing the CPU board. Unfortunately the dead board was so broken the system controller (lights-out-manager, which appears to be an UltraSPARC IIe) couldn't turn the power off to the dead board only. So I had to bring the whole machine down in order to replace it.

The replacement system board had 4 900MHz CPUs installed (the CPUs aren't user-serviceable components), but no RAM, so I had to move 32 sticks (256 MB each) of RAM from the dead board to the new board. Ouch thumbs.

Pictures to follow

No comments: