Debugging Windows 'hang'

I need some ideas on how to debug a Windows hang that we’ve seen twice
in the last 45 days.

The error:
When the error occurs, the system won’t seem to do anything in
response to user input (mouse or keyboard).
You can move the mouse and the pointer does move immediately, but when
you click, no action happens on screen. (If you wait long enough, say
5 minutes, you might see a window move or start to respond. Note that
if a window does move, you’ll see it go away instantly, just 5 minutes
after you hit the ‘minimize’ box).

Ctrl-alt-delete has no effect from the keyboard. The numlock and
capslock LEDs do properly (and instantly) follow the key presses.

Our apps aren’t responding, nor does the system seem to respond to TCP
traffic of any kind (including ping from the local network). I didn’t
have the mac address or anything to try fooling with ARP.

Two things of note:

  1. We have a custom PCI board of our own in the system, and our driver
    is running it. Obviously we have to suspect our own hardware first in
    these cases, however, we’ve had no other trouble from the driver in at
    least 3 months.

  2. The system is a Tyan S2892 motherboard including nVidia RAID on the
    board. We’re using the raid, and after both of these failures I’ve
    seen the RAID controller be confused. It seems to believe that it’s
    got two seperate degraded arrays and doesn’t rebuild on it’s own.
    Instead, I have to manually delete the second array and force the
    extra disk into the first so it will rebuild. This behaviour leads me
    to also suspect the RAID system.

Windows is XP Pro x32 SP2.

Debugging:
So far I’ve tried turning on keyboard crash dumping, but due to
testing requirements (the boss wants us testing on exactly the
shipping configuration), I’m unable to leave it turned on all the
time, thus it wasn’t present on the latest crash. Same goes for the
PCI Dump board that I have. Similarly, I can’t leave debugger support
enabled all the time.

Anyone have any other good suggestions on how to gain information when
this occurs? I’d love to know what the heck the system is doing, so
that I can either fix or absolve my driver.

Frankly, if anyone can come up with a good idea on how I can
exacerbate the problem so that I can get it to die more than once
every month and a half, I’d love to hear it.

Thanks!


Michael Kohne
xxxxx@kohne.org

May be a stupid suggestion , but nevertehless . Open an run Task Manager on the foreground and display “Processes” . When the
trouble happens , you will see the process that uses all CPU time. If you don’t see a process that acts in the bad way , let appear
“Performance” ( and select “Do Show Kernel times” ) and wait again till the trouble happens. From my experience , such things
happen when the system has intensively to use the swap file when one or other program allocates ( does not free ) huge amount of
memory or other resources such as “handles”.

C.

----- Original Message -----
From: “Michael Kohne”
To: “Windows System Software Devs Interest List”
Sent: Wednesday, January 17, 2007 4:16 PM
Subject: [ntdev] Debugging Windows ‘hang’

>I need some ideas on how to debug a Windows hang that we’ve seen twice
> in the last 45 days.
>
> The error:
> When the error occurs, the system won’t seem to do anything in
> response to user input (mouse or keyboard).
> You can move the mouse and the pointer does move immediately, but when
> you click, no action happens on screen. (If you wait long enough, say
> 5 minutes, you might see a window move or start to respond. Note that
> if a window does move, you’ll see it go away instantly, just 5 minutes
> after you hit the ‘minimize’ box).
>
> Ctrl-alt-delete has no effect from the keyboard. The numlock and
> capslock LEDs do properly (and instantly) follow the key presses.
>
> Our apps aren’t responding, nor does the system seem to respond to TCP
> traffic of any kind (including ping from the local network). I didn’t
> have the mac address or anything to try fooling with ARP.
>
> Two things of note:
> 1) We have a custom PCI board of our own in the system, and our driver
> is running it. Obviously we have to suspect our own hardware first in
> these cases, however, we’ve had no other trouble from the driver in at
> least 3 months.
>
> 2) The system is a Tyan S2892 motherboard including nVidia RAID on the
> board. We’re using the raid, and after both of these failures I’ve
> seen the RAID controller be confused. It seems to believe that it’s
> got two seperate degraded arrays and doesn’t rebuild on it’s own.
> Instead, I have to manually delete the second array and force the
> extra disk into the first so it will rebuild. This behaviour leads me
> to also suspect the RAID system.
>
> Windows is XP Pro x32 SP2.
>
>
> Debugging:
> So far I’ve tried turning on keyboard crash dumping, but due to
> testing requirements (the boss wants us testing on exactly the
> shipping configuration), I’m unable to leave it turned on all the
> time, thus it wasn’t present on the latest crash. Same goes for the
> PCI Dump board that I have. Similarly, I can’t leave debugger support
> enabled all the time.
>
> Anyone have any other good suggestions on how to gain information when
> this occurs? I’d love to know what the heck the system is doing, so
> that I can either fix or absolve my driver.
>
> Frankly, if anyone can come up with a good idea on how I can
> exacerbate the problem so that I can get it to die more than once
> every month and a half, I’d love to hear it.
>
> Thanks!
>
> –
> Michael Kohne
> xxxxx@kohne.org
>
> —
> Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Take a look at the event log after this happens and see if there are any
errors reported from the storage system.

There are any number of things that can cause the system to lock up like
that, but one of them can be if the storage system hangs up. Paging I/O
backs up and while the mouse moves (because it’s done in the video card)
nothing visual can happen because the code to make it happen inevitably
hits a page fault and blocks.

Given your description of the nvidia controller getting confused
afterwards it could very easily be a disk problem which causes the
controller to stall while it tries to recover.

Do you have another machine without that controller which you can use to
test with?

Alternately I’d check the log files, and if you see anything storage
related see if you can convince your boss that his resistance to letting
you change the machine config to track down a problem (particularly a
minimal change like enabling keyboard bugcheck) means that your product
will ship with a known bug.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Michael Kohne
Sent: Wednesday, January 17, 2007 7:17 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Debugging Windows ‘hang’

I need some ideas on how to debug a Windows hang that we’ve seen twice
in the last 45 days.

The error:
When the error occurs, the system won’t seem to do anything in
response to user input (mouse or keyboard).
You can move the mouse and the pointer does move immediately, but when
you click, no action happens on screen. (If you wait long enough, say
5 minutes, you might see a window move or start to respond. Note that
if a window does move, you’ll see it go away instantly, just 5 minutes
after you hit the ‘minimize’ box).

Ctrl-alt-delete has no effect from the keyboard. The numlock and
capslock LEDs do properly (and instantly) follow the key presses.

Our apps aren’t responding, nor does the system seem to respond to TCP
traffic of any kind (including ping from the local network). I didn’t
have the mac address or anything to try fooling with ARP.

Two things of note:

  1. We have a custom PCI board of our own in the system, and our driver
    is running it. Obviously we have to suspect our own hardware first in
    these cases, however, we’ve had no other trouble from the driver in at
    least 3 months.

  2. The system is a Tyan S2892 motherboard including nVidia RAID on the
    board. We’re using the raid, and after both of these failures I’ve
    seen the RAID controller be confused. It seems to believe that it’s
    got two seperate degraded arrays and doesn’t rebuild on it’s own.
    Instead, I have to manually delete the second array and force the
    extra disk into the first so it will rebuild. This behaviour leads me
    to also suspect the RAID system.

Windows is XP Pro x32 SP2.

Debugging:
So far I’ve tried turning on keyboard crash dumping, but due to
testing requirements (the boss wants us testing on exactly the
shipping configuration), I’m unable to leave it turned on all the
time, thus it wasn’t present on the latest crash. Same goes for the
PCI Dump board that I have. Similarly, I can’t leave debugger support
enabled all the time.

Anyone have any other good suggestions on how to gain information when
this occurs? I’d love to know what the heck the system is doing, so
that I can either fix or absolve my driver.

Frankly, if anyone can come up with a good idea on how I can
exacerbate the problem so that I can get it to die more than once
every month and a half, I’d love to hear it.

Thanks!


Michael Kohne
xxxxx@kohne.org


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Michael Kohne wrote:

So far I’ve tried turning on keyboard crash dumping, but due to
testing requirements (the boss wants us testing on exactly the
shipping configuration), I’m unable to leave it turned on all the
time, thus it wasn’t present on the latest crash.

This sounds like a management problem, not a development problem (at
least not yet). Clearly, this system isn’t ready for final burn-in
testing yet. Here are your choices:

  • this problem gets examined if it happens again
    or
  • the system is tested without debugging support even though you have
    known failure cases

Cheers,

/ h+

OK, I should have thought of that. Thanks. I’m sure I can convince the
powers that be that we can safely leave the bloody task manager open.
It might tell us something.

Thank you.

On 1/17/07, Christiaan Ghijselinck wrote:
>
> May be a stupid suggestion , but nevertehless . Open an run Task Manager on the foreground and display “Processes” . When the
> trouble happens , you will see the process that uses all CPU time. If you don’t see a process that acts in the bad way , let appear
> “Performance” ( and select “Do Show Kernel times” ) and wait again till the trouble happens. From my experience , such things
> happen when the system has intensively to use the swap file when one or other program allocates ( does not free ) huge amount of
> memory or other resources such as “handles”.
>
> C.
>


Michael Kohne
xxxxx@kohne.org

On 1/17/07, Peter Wieland wrote:
> Take a look at the event log after this happens and see if there are any
> errors reported from the storage system.

I get no errors. The only way I know when the trouble occurred is that
I look at the last event log entries from our apps.

> There are any number of things that can cause the system to lock up like
> that, but one of them can be if the storage system hangs up. Paging I/O
> backs up and while the mouse moves (because it’s done in the video card)
> nothing visual can happen because the code to make it happen inevitably
> hits a page fault and blocks.

Another engineer and I came to this possibility earlier in the day as
well. Now it sounds even more likely to me. I’m going to abuse the
on-board RAID next week sometime. It may tell me something.

> Given your description of the nvidia controller getting confused
> afterwards it could very easily be a disk problem which causes the
> controller to stall while it tries to recover.
>
> Do you have another machine without that controller which you can use to
> test with?

Sadly, perhaps not. Due to penny pinching, we are somewhat resource
constrained. I’ll have to try for it though. If I can get the
resources, I’ll disable RAID on one unit (they are mirrored, so this
is pretty easy). Unfortunately, due to the length of time needed to
reproduce (one every 45 days or so), we may have trouble convincing
ourselves that this is the problem.

> Alternately I’d check the log files, and if you see anything storage
> related see if you can convince your boss that his resistance to letting
> you change the machine config to track down a problem (particularly a
> minimal change like enabling keyboard bugcheck) means that your product
> will ship with a known bug.
>
> -p

Sadly, nothing useful in the event logs. I’m going to bring up the
fact that we have already shipped with this problem (customer #1
already has his system), so we really need to get to the bottom of it.

Unfortunately, we tend to be resource constrained, and spend way too
much of our time allocating units to various development efforts. I’m
going to have to find some way to justify him giving me a couple of
systems for 2-3 months or let me have the test folk turn on keyboard
crash dump support on all test units. Perhaps I’ll at least get the
dumps. Of course, if it is the RAID controller, then I’m not going to
get the dump, am I? That should be interesting.

Thanks for all your help.


Michael Kohne
xxxxx@kohne.org

Michael Kohne wrote:

> Given your description of the nvidia controller getting confused
> afterwards it could very easily be a disk problem which causes the
> controller to stall while it tries to recover.
>
> Do you have another machine without that controller which you can use to
> test with?

Sadly, perhaps not. Due to penny pinching, we are somewhat resource
constrained. I’ll have to try for it though. If I can get the
resources, I’ll disable RAID on one unit (they are mirrored, so this

Who says the RAID controller gets confused AFTER the problem? Perhaps
the problem is the controller getting confused, causing the hangs?

Spending tens of thousands of engineer dollars to avoid spending a few
hundred dollars on hardware is a well-known problem, given that
engineers are already budgeted and “paid for,” but it’s still not a good
sign. If you can spend maybe $29, then getting a PCI IDE controller and
hooking up a disk to that, instead of using the NVIDIA controller, could
help.

Another thing to try is to ask the NVIDIA developer support guys. They
are usually reasonable responsive, and you can sign up to their
developer support program for free on their web site (www.nvidia.com).
They perhaps know of some issue with the specific controller and long
up-times.

Cheers,

/ h+

>see a process that acts in the bad way , let appear

“Performance” ( and select “Do Show Kernel times” ) and wait again till the

No. ICMP responses are generated from DPC context. So, if the machine does not
respond to pings, then something really serious - maybe hardware - occured.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com