System hang where even NMI won't work?

I have a system hang which has been very intermittent. We’ve had it happen
about 5 times now on 5 different machines (out of about 40) over the past
few months.

My usual step when I get a system hang is to short the ISA bus signal
IOCHCHK with ground so as to generate an NMI, which then pops me into
SoftICE. That has always worked fairly well.

Unfortunately, in this case, even that NMI does not pop me into SoftICE (or
generate a bluescreen), so it’s as if it’s hung so that even NMI is being
ignored.

To me, the points to something wacky with hardware (although note that it
has occurred on multiple systems), or it’s caught at some point in the
kernel such that it isn’t processing interrupts.

How do I go about debugging something like this? Recall that it is very
intermittent, so I can’t do something like put a processor emulator
(hardware ICE) on each machine and hope for it to happen again. I really
have to debug it post-mortem or while it is occurring. (Although I could,
of course, install some software on each system.)

Any ideas would be appreciated!


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com

It might be out of relevance , but those machines where SMP machines with
more than one CPU enabled (Loacal APIC architecture … )?

----- Original Message -----
From: “Taed Nelson”
To: “NT Developers Interest List”
Sent: Tuesday, May 15, 2001 1:10 AM
Subject: [ntdev] System hang where even NMI won’t work?

> I have a system hang which has been very intermittent. We’ve had it
happen
> about 5 times now on 5 different machines (out of about 40) over the past
> few months.
>
> My usual step when I get a system hang is to short the ISA bus signal
> IOCHCHK with ground so as to generate an NMI, which then pops me into
> SoftICE. That has always worked fairly well.
>
> Unfortunately, in this case, even that NMI does not pop me into SoftICE
(or
> generate a bluescreen), so it’s as if it’s hung so that even NMI is being
> ignored.
>
> To me, the points to something wacky with hardware (although note that it
> has occurred on multiple systems), or it’s caught at some point in the
> kernel such that it isn’t processing interrupts.
>
> How do I go about debugging something like this? Recall that it is very
> intermittent, so I can’t do something like put a processor emulator
> (hardware ICE) on each machine and hope for it to happen again. I really
> have to debug it post-mortem or while it is occurring. (Although I could,
> of course, install some software on each system.)
>
> Any ideas would be appreciated!
>
>
> —
> You are currently subscribed to ntdev as: danp@jb.rdsor.ro
> To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com
>


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com

I sometimes got this kind of behavior when the PCI bus went to lunch for
some reason or another. You can get a PCI analyzer, either an instrument or
a board that plugs onto the bus and exports signals to a serial port (with
analyzer software on a second machine), or even an attachment to your
oscilloscope if you’re lucky to have one that supports it. Sometimes even a
Port 80 wire-wrap may help, you sprinkle your code with outs to port 80. A
Logic Analyzer also may help, specially one with a big buffer: you look at
bus transactions, and when the system hangs, you scan the history buffer in
the logic analyzer trying to find something out of the ordinary. I remember
once when we had a graphics chip bug where a new rev of the chip wouldn’t
accept two back-to-back IACK cycles, and it was a pig to find! In any case,
this maybe the time where some hardware instrumentation may help you.

Hope this helps,

Alberto.

-----Original Message-----
From: Taed Nelson [mailto:xxxxx@vertical.com]
Sent: Monday, May 14, 2001 6:10 PM
To: NT Developers Interest List
Subject: [ntdev] System hang where even NMI won’t work?

I have a system hang which has been very intermittent. We’ve had it happen
about 5 times now on 5 different machines (out of about 40) over the past
few months.

My usual step when I get a system hang is to short the ISA bus signal
IOCHCHK with ground so as to generate an NMI, which then pops me into
SoftICE. That has always worked fairly well.

Unfortunately, in this case, even that NMI does not pop me into SoftICE (or
generate a bluescreen), so it’s as if it’s hung so that even NMI is being
ignored.

To me, the points to something wacky with hardware (although note that it
has occurred on multiple systems), or it’s caught at some point in the
kernel such that it isn’t processing interrupts.

How do I go about debugging something like this? Recall that it is very
intermittent, so I can’t do something like put a processor emulator
(hardware ICE) on each machine and hope for it to happen again. I really
have to debug it post-mortem or while it is occurring. (Although I could,
of course, install some software on each system.)

Any ideas would be appreciated!


You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com


You are currently subscribed to ntdev as: $subst(‘Recip.EmailAddr’)
To unsubscribe send a blank email to leave-ntdev-$subst(‘Recip.MemberIDChar’)@lists.osr.com