Nodes mysteriously dying?

So i’ve got 3 nodes that have randomly decided to crash over the past few days, and it seems like a daily occurrence now, waking up to have 70 emails in my inbox (I have nagios email me every 5 min there’s a problem until its fixed). Now I just need to find the time to actually sit down and diagnose these things. I am half-tempted to just re-image everything as there should be nothing on the computational nodes themselves, just the NFS share of /home, which obviously won’t be active when I PXE boot it.

Id like to figure out what exactly is going on with them before I just nuke everything, although for time’s sake, I might end up doing just that.

Oh well, its probably just someone’s crappy code breaking each node. At least its only been one at a time, and it doesn’t look like there have been any jobs submitted in a while.

I’ll probably end up pulling that apart tonight (if I find time between editing my first and second posts), although the biggest pain is the fact that I almost have to be there to see what exactly is going on.

::shrugs:: time for some learnin!


Stopping in to check on the nodes, I saw the following two screens for dmesg | more :


It looks as though eth0 is having a hardware fault, and gets stuck with MAC_TX_MODE=fffffff although i’m not sure what that status means exactly, i’ll figure it out though, as this problem is whats been affecting all of my nodes, and if people can’t connect to them, obviously they can’t be used for computations :(


I’ll post more when I actually know what’s going on. Open to suggestions though!

Hopefully I’ll be able to figure this out without wiping all my nodes clean. That would be kind of a pain in the butt.

This is a pretty bitchin rendition of panic.c (kernel panic)

So it seems like a bug in the tg3 driver, which drives the NIC. Although my money is still on a hardware failure that is tiny but there. I sincerely hope I’m wrong, but I’ve diagnosed a lot of failures like this in the past, and its generally turned out to be something that is just going to get worse :(.


