Nodes mysteriously dying?

So i’ve got 3 nodes that have randomly decided to crash over the past few days, and it seems like a daily occurrence now, waking up to have 70 emails in my inbox (I have nagios email me every 5 min there’s a problem until its fixed). Now I just need to find the time to actually sit down and diagnose these things. I am half-tempted to just re-image everything as there should be nothing on the computational nodes themselves, just the NFS share of /home, which obviously won’t be active when I PXE boot it.

Id like to figure out what exactly is going on with them before I just nuke everything, although for time’s sake, I might end up doing just that.

Oh well, its probably just someone’s crappy code breaking each node. At least its only been one at a time, and it doesn’t look like there have been any jobs submitted in a while.

I’ll probably end up pulling that apart tonight (if I find time between editing my first and second posts), although the biggest pain is the fact that I almost have to be there to see what exactly is going on.

::shrugs:: time for some learnin!

nagios

Stopping in to check on the nodes, I saw the following two screens for dmesg | more :

IMG_0202

It looks as though eth0 is having a hardware fault, and gets stuck with MAC_TX_MODE=fffffff although i’m not sure what that status means exactly, i’ll figure it out though, as this problem is whats been affecting all of my nodes, and if people can’t connect to them, obviously they can’t be used for computations :(

IMG_0203

I’ll post more when I actually know what’s going on. Open to suggestions though!

Hopefully I’ll be able to figure this out without wiping all my nodes clean. That would be kind of a pain in the butt.

This is a pretty bitchin rendition of panic.c (kernel panic)

So it seems like a bug in the tg3 driver, which drives the NIC. Although my money is still on a hardware failure that is tiny but there. I sincerely hope I’m wrong, but I’ve diagnosed a lot of failures like this in the past, and its generally turned out to be something that is just going to get worse :(.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: