This is the most annoying hardware problem ever.


I have an Ultra 10 at work which handles mail for a small group of users who haven’t moved onto Notes for whatever reason. Lately it’s been hanging over the weekend: console reports that /var is full and / is out of inodes, and a hard reboot brings it back up without a full /var or an inode-full /. Last weekend I managed to have a console actually connected when it failed, saw some additional IDE errors. Ok, for some reason there’s a scheduled reboot on Saturdays, I guess it doesn’t like that, because a hard power cycle fixes things. Comment out the crontab entry and away we go.

And then it failed again this weekend, and this time I thought ahead a bit and tried a probe-ide at the console. No hard drive. It just forgets that anything’s connected to the first IDE interface. Alright, that explains why it fails the way it does. Came back after a power cycle again, but now it was bugging me, so I started digging through syslog to get some idea of the timing.

So now I know that I have a machine which hangs solid at 4:27 PM on Saturday. Every week. The scheduled reboot happens after that so wasn’t happening at all. There’s nothing in anyone’s crontab at 4:27. It shares a rack with a handful of other boxes, but nothing that requires weekend intervention like a tape drive. The other U10s in that rack are unmolested. It’s very underloaded, and there’s no significant mail traffic around that time.

What the hell?


10 responses to “This is the most annoying hardware problem ever.”

  1. Is it always at exactly 4:27? Might it be the cleaning people hitting some kind of interference with the vacuum or something?

    *ponder*

    Is there anything in the crontab of the machines racked with it that might cause the machine above/below it to wig out at that time and cause radio interference?

  2. I found out this morning, looking back through my support ticket history, that at 4am on the 4th of every month, my colocated machine fails. Nothing in syslog, it just *dies*.

    I knew it had happened a couple times, but I only noticed the timing today. Quite annoying, really. Makes me wonder if it’s something that’s not actually on my end. Left a non-urgent ticket in their support queue, so we’ll see what they say.

    Just felt like sharing a similar anecdote.

  3. Exactly. Doubtful — if the cleaning people were that consistent then sometimes they’d empty the garbage at my desk when it needs it. None of the other machines’ crontabs have anything odd (:27? That’d be weird.) It’s in a data centre, not a back room, so power ought to be stable, or unstable power ought to be setting off alarms. It does have a lot of network interfaces (pair of qfes) so there are a lot of possible dependencies, but that’s the whole problem, figuring out what dependency is doing it :-)

  4. I loved that story, until a former manager liked it enough to tell it lots, having forgotten every time that he’d told it before. :-)

  5. Duuuuuuuude, it’s so obvious, maaaaaan.

    Your Ultra 10’s stoned. I mean, how long after 4:20 would it take you to just say “aw, fuck it” and stop working?

    :-)

  6. One place that I worked at had a pretty crappy server room with crappy building wiring. Every Thursday night at roughly the same time, several severs in the rack would force-reboot (without saving anything, possibly tweaking the non-journaled filesyste). Not all of the servers did this, just a few. It took a very long time to realize that one of the outlets in the room (which, I think, was actually a broom closet) was linked to a mysterious light switch two rooms away that nobody ever thought to mess with (“it’s broken; it doesn’t connect to anything.”) When the cleaning people came in on Thursdays after everyone left, they would jiggle the switch, thinking it went to the overhead lights, causing the servers to switch off and on.