I need some ideas on how to debug a Windows hang that we’ve seen twice
in the last 45 days.
The error:
When the error occurs, the system won’t seem to do anything in
response to user input (mouse or keyboard).
You can move the mouse and the pointer does move immediately, but when
you click, no action happens on screen. (If you wait long enough, say
5 minutes, you might see a window move or start to respond. Note that
if a window does move, you’ll see it go away instantly, just 5 minutes
after you hit the ‘minimize’ box).
Ctrl-alt-delete has no effect from the keyboard. The numlock and
capslock LEDs do properly (and instantly) follow the key presses.
Our apps aren’t responding, nor does the system seem to respond to TCP
traffic of any kind (including ping from the local network). I didn’t
have the mac address or anything to try fooling with ARP.
Two things of note:
-
We have a custom PCI board of our own in the system, and our driver
is running it. Obviously we have to suspect our own hardware first in
these cases, however, we’ve had no other trouble from the driver in at
least 3 months. -
The system is a Tyan S2892 motherboard including nVidia RAID on the
board. We’re using the raid, and after both of these failures I’ve
seen the RAID controller be confused. It seems to believe that it’s
got two seperate degraded arrays and doesn’t rebuild on it’s own.
Instead, I have to manually delete the second array and force the
extra disk into the first so it will rebuild. This behaviour leads me
to also suspect the RAID system.
Windows is XP Pro x32 SP2.
Debugging:
So far I’ve tried turning on keyboard crash dumping, but due to
testing requirements (the boss wants us testing on exactly the
shipping configuration), I’m unable to leave it turned on all the
time, thus it wasn’t present on the latest crash. Same goes for the
PCI Dump board that I have. Similarly, I can’t leave debugger support
enabled all the time.
Anyone have any other good suggestions on how to gain information when
this occurs? I’d love to know what the heck the system is doing, so
that I can either fix or absolve my driver.
Frankly, if anyone can come up with a good idea on how I can
exacerbate the problem so that I can get it to die more than once
every month and a half, I’d love to hear it.
Thanks!
–
Michael Kohne
xxxxx@kohne.org