How does windbg break into the system?

Hi,

we are currently investigating a problem where our test machine completely
freezes and becomes inaccessible even with Windbg.

We know that a hardware device that continuously responds with retries on
the PCI bus will completely freeze the system. Now we’d like to better
understand the possible software situations that can prevent Windbg from
breaking in. When we break in the system, we sometimes break in a CPU
running at a DIRQL. However, we did a test whereby we intentionally
deadlocked a CPU in an infinite while loop in our ISR. The system was then
inaccessible with Windbg, even on a dual-processor machine. Could it be that
our ISR was running at a higher DIRQL than the COMPORT interrupt? What
mechanism does the debugger use to break in?

The very same question was asked in this forum in aug 2002, but no answer
came in.

Thanks,

Patrick

Patrick Laniel wrote:

we are currently investigating a problem where our test machine completely
freezes and becomes inaccessible even with Windbg.

We know that a hardware device that continuously responds with retries on
the PCI bus will completely freeze the system. Now we’d like to better
understand the possible software situations that can prevent Windbg from
breaking in. When we break in the system, we sometimes break in a CPU
running at a DIRQL. However, we did a test whereby we intentionally
deadlocked a CPU in an infinite while loop in our ISR. The system was then
inaccessible with Windbg, even on a dual-processor machine. Could it be that
our ISR was running at a higher DIRQL than the COMPORT interrupt? What
mechanism does the debugger use to break in?

DIRQL is more of a theoretical concept than a hardware concept, at least
on the x86 platforms. In fact, while the processor is inside your ISR,
interrupts are disabled. You OWN that CPU, and nothing short of an NMI
can break in, including the serial port (or 1394) interrupt for the
debugger, or the debugger’s feeble attempts to freeze the other
processor. That’s why hardware-assist crowbars use NMI to break in to a
debugger.

> From: xxxxx@lists.osr.com

[mailto:xxxxx@lists.osr.com]On Behalf Of Tim Roberts
Sent: Monday, February 07, 2005 4:26 PM
To: Kernel Debugging Interest List
Subject: Re: [windbg] How does windbg break into the system?

DIRQL is more of a theoretical concept than a hardware
concept, at least
on the x86 platforms. In fact, while the processor is inside
your ISR,
interrupts are disabled. You OWN that CPU, and nothing short
of an NMI
can break in, including the serial port (or 1394) interrupt for the
debugger, or the debugger’s feeble attempts to freeze the other
processor.

That is incorrect, DIRQL are mapped almost directly to
hardware (PIC or local APIC on x86).

Yes, then an interrupt handler is called by the CPU all
normal interrupts are disabled but on Windows this first stage
ISR enables interrupts on the CPU itself and masks all
interrupts with current or lower priority in the interrupt controller
and then calls the corresponding driver’s ISR.
That means that a device with higher DIRQL can interrupt
the IRS of a device with lower DIRQL.

BTW, on x64 (AMD64) SMP systems Windows kernel uses
NMI IPIs to freeze other CPUs.

Dmitriy Budko, VMware

This is a serial port; if the interrupts are masked at an IRQL that
causes the corresponding serial port interrupt to be masked, it will
never run. Synch level (for example) blocks most hardware interrupts -
dispatcher database lock runs at synch level, for example. Clock
interrupt is higher - just.

To debug, you could try using a dump switch (basically it generates an
NMI on the PCI bus) when you get into a situation like this - a dump is
generated if the appropriate registry parameter was set (read at boot
time, of course, not at system crash time…)

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Patrick Laniel
Sent: Monday, February 07, 2005 6:25 PM
To: Kernel Debugging Interest List
Subject: [windbg] How does windbg break into the system?
Importance: Low

Hi,

we are currently investigating a problem where our test machine
completely
freezes and becomes inaccessible even with Windbg.

We know that a hardware device that continuously responds with retries
on
the PCI bus will completely freeze the system. Now we’d like to better
understand the possible software situations that can prevent Windbg from
breaking in. When we break in the system, we sometimes break in a CPU
running at a DIRQL. However, we did a test whereby we intentionally
deadlocked a CPU in an infinite while loop in our ISR. The system was
then
inaccessible with Windbg, even on a dual-processor machine. Could it be
that
our ISR was running at a higher DIRQL than the COMPORT interrupt? What
mechanism does the debugger use to break in?

The very same question was asked in this forum in aug 2002, but no
answer
came in.

Thanks,

Patrick


You are currently subscribed to windbg as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Thanks. We will look into the dump switch option.

I would have imagined that the COMPORT interrupt would have been
reprogrammed as an NMI or at least a high priority IRQL. Already, with a
DIRQL of 7 (which corresponds to a TPR value of 8x on my APIC x86 machine),
the debugger cant break in (and interrupts are not masked!). Also, on a dual-CPU machine, it doesnt seem to help that one CPU is mostly idle.

Patrick

Subject: RE: How does windbg break into the system?
From: “Tony Mason”
Date: Mon, 7 Feb 2005 19:56:11 -0500
X-Message-Number: 10

This is a serial port; if the interrupts are masked at an IRQL that
causes the corresponding serial port interrupt to be masked, it will
never run. Synch level (for example) blocks most hardware interrupts -
dispatcher database lock runs at synch level, for example. Clock
interrupt is higher - just.

To debug, you could try using a dump switch (basically it generates an
NMI on the PCI bus) when you get into a situation like this - a dump is
generated if the appropriate registry parameter was set (read at boot
time, of course, not at system crash time…)

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

Note that the windbg interface does NOT use interrupts – it’s a polling
interface where the processor checks the com port on a regular basis – this
is done after every clock interrupt for example – given that clock
interrupts are at a higher DIRQL than any devices, a break-in request should
be serviced even if your ISR is spinning at DIRQL.

As part of servicing the break-in, the kd code will send IPI-FREEZE to every
other processor and wait for them to respond - again, the IPI interrupt runs
at a higher DIRQL than most everything else, so these should get through.
Even if the other processor is completly hung (or has interrupts disabled
globally), the KD code will eventually timeout (~2-3mins) and you will drop
into windbg BUT you wont be able to debug the other processors (and any
attempt to change the current processor will hang windbg).

So, even if one processor is spinning with interrupt sdisabled, you ought to
be able to get into windbg on the other processor eventually (but you really
have to wait a looooong time). Assuming you did wait long enough, I would
say the chances are that the PCI bus is hung which will back up through the
front side bus and stop everything – even a NMI cant break through this,
you have to get the PCI bus and FSB unwedged before you can make any
progress

/simgr

(FWIW, what you need in that case is some sort of hot-plug controller in the
path to the cause of the hang that can be kicked to master abort the
outstanding transactions - does this sound far fetched? Perhaps, but I use
one every day :wink:

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Patrick Laniel
Sent: Tuesday, February 08, 2005 8:19 AM
To: Kernel Debugging Interest List
Cc: Remi Sanscartier
Subject: RE:[windbg] How does windbg break into the system?
Importance: Low

Thanks. We will look into the dump switch option.

I would have imagined that the COMPORT interrupt would have been
reprogrammed as an NMI or at least a high priority IRQL. Already, with a
DIRQL of 7 (which corresponds to a TPR value of 8x on my APIC x86 machine),
the debugger cant break in (and interrupts are not masked!). Also, on a dual-CPU machine, it doesnt seem to help that one CPU is mostly idle.

Patrick

Subject: RE: How does windbg break into the system?
From: “Tony Mason”
Date: Mon, 7 Feb 2005 19:56:11 -0500
X-Message-Number: 10

This is a serial port; if the interrupts are masked at an IRQL that
causes the corresponding serial port interrupt to be masked, it will
never run. Synch level (for example) blocks most hardware interrupts -
dispatcher database lock runs at synch level, for example. Clock
interrupt is higher - just.

To debug, you could try using a dump switch (basically it generates an
NMI on the PCI bus) when you get into a situation like this - a dump is
generated if the appropriate registry parameter was set (read at boot
time, of course, not at system crash time…)

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com


You are currently subscribed to windbg as: xxxxx@stratus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Thanks for the correction Simon; when I went looking for the polled
breakin code I couldn’t find it (I can see in the serial port driver
where it detects the use by the debugger and several other cases). It
has been a long time since I’ve looked at this path, and of course once
you pointed to the clock interrupt it was easy to find.

For the record: the IRQL for CLOCK1_LEVEL is 28 on the x86. Thus only
IPI_LEVEL (29), POWER_LEVEL (30) and HIGH_LEVEL (31) would prevent a
break-in from the kernel debugger. The clock tick code does in fact
check to see if a break has been requested from the kernel debugger and
if so it invokes the debugger directly.

Sounds like a useful piece of hardware there. Where do I get one? :wink:

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Graham, Simon
Sent: Tuesday, February 08, 2005 11:37 AM
To: Kernel Debugging Interest List
Subject: RE: [windbg] How does windbg break into the system?

Note that the windbg interface does NOT use interrupts – it’s a polling
interface where the processor checks the com port on a regular basis –
this
is done after every clock interrupt for example – given that clock
interrupts are at a higher DIRQL than any devices, a break-in request
should
be serviced even if your ISR is spinning at DIRQL.

As part of servicing the break-in, the kd code will send IPI-FREEZE to
every
other processor and wait for them to respond - again, the IPI interrupt
runs
at a higher DIRQL than most everything else, so these should get
through.
Even if the other processor is completly hung (or has interrupts
disabled
globally), the KD code will eventually timeout (~2-3mins) and you will
drop
into windbg BUT you wont be able to debug the other processors (and any
attempt to change the current processor will hang windbg).

So, even if one processor is spinning with interrupt sdisabled, you
ought to
be able to get into windbg on the other processor eventually (but you
really
have to wait a looooong time). Assuming you did wait long enough, I
would
say the chances are that the PCI bus is hung which will back up through
the
front side bus and stop everything – even a NMI cant break through
this,
you have to get the PCI bus and FSB unwedged before you can make any
progress

/simgr

(FWIW, what you need in that case is some sort of hot-plug controller in
the
path to the cause of the hang that can be kicked to master abort the
outstanding transactions - does this sound far fetched? Perhaps, but I
use
one every day :wink:

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Patrick Laniel
Sent: Tuesday, February 08, 2005 8:19 AM
To: Kernel Debugging Interest List
Cc: Remi Sanscartier
Subject: RE:[windbg] How does windbg break into the system?
Importance: Low

Thanks. We will look into the dump switch option.

I would have imagined that the COMPORT interrupt would have been
reprogrammed as an NMI or at least a high priority IRQL. Already, with a
DIRQL of 7 (which corresponds to a TPR value of 8x on my APIC x86
machine),
the debugger cant break in (and interrupts are not masked!). Also, on a dual-CPU machine, it doesnt seem to help that one CPU is mostly idle.

Patrick

Subject: RE: How does windbg break into the system?
From: “Tony Mason”
Date: Mon, 7 Feb 2005 19:56:11 -0500
X-Message-Number: 10

This is a serial port; if the interrupts are masked at an IRQL that
causes the corresponding serial port interrupt to be masked, it will
never run. Synch level (for example) blocks most hardware interrupts -
dispatcher database lock runs at synch level, for example. Clock
interrupt is higher - just.

To debug, you could try using a dump switch (basically it generates an
NMI on the PCI bus) when you get into a situation like this - a dump is
generated if the appropriate registry parameter was set (read at boot
time, of course, not at system crash time…)

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com


You are currently subscribed to windbg as: xxxxx@stratus.com
To unsubscribe send a blank email to xxxxx@lists.osr.com


You are currently subscribed to windbg as: xxxxx@osr.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Thanks Simon! Since my first posting a couple of days ago, we have
investigated further and we had come to the same conclusion.

Patrick

-----Original Message-----
Subject: RE: How does windbg break into the system?
From: “Graham, Simon”
Date: Tue, 8 Feb 2005 11:36:45 -0500
X-Message-Number: 8

Note that the windbg interface does NOT use interrupts – it’s a polling
interface where the processor checks the com port on a regular basis – this
is done after every clock interrupt for example – given that clock
interrupts are at a higher DIRQL than any devices, a break-in request should
be serviced even if your ISR is spinning at DIRQL.

As part of servicing the break-in, the kd code will send IPI-FREEZE to every
other processor and wait for them to respond - again, the IPI interrupt runs
at a higher DIRQL than most everything else, so these should get through.
Even if the other processor is completly hung (or has interrupts disabled
globally), the KD code will eventually timeout (~2-3mins) and you will drop
into windbg BUT you wont be able to debug the other processors (and any
attempt to change the current processor will hang windbg).

So, even if one processor is spinning with interrupt sdisabled, you ought to
be able to get into windbg on the other processor eventually (but you really
have to wait a looooong time). Assuming you did wait long enough, I would
say the chances are that the PCI bus is hung which will back up through the
front side bus and stop everything – even a NMI cant break through this,
you have to get the PCI bus and FSB unwedged before you can make any
progress

/simgr

(FWIW, what you need in that case is some sort of hot-plug controller in the
path to the cause of the hang that can be kicked to master abort the
outstanding transactions - does this sound far fetched? Perhaps, but I use
one every day :wink:

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Patrick Laniel
Sent: Tuesday, February 08, 2005 8:19 AM
To: Kernel Debugging Interest List
Cc: Remi Sanscartier
Subject: RE:[windbg] How does windbg break into the system?
Importance: Low

Thanks. We will look into the dump switch option.

I would have imagined that the COMPORT interrupt would have been
reprogrammed as an NMI or at least a high priority IRQL. Already, with a
DIRQL of 7 (which corresponds to a TPR value of 8x on my APIC x86 machine),
the debugger cant break in (and interrupts are not masked!). Also, on a<br>dual-CPU machine, it doesnt seem to help that one CPU is mostly idle.

Patrick

Subject: RE: How does windbg break into the system?
From: “Tony Mason”
Date: Mon, 7 Feb 2005 19:56:11 -0500
X-Message-Number: 10

This is a serial port; if the interrupts are masked at an IRQL that
causes the corresponding serial port interrupt to be masked, it will
never run. Synch level (for example) blocks most hardware interrupts -
dispatcher database lock runs at synch level, for example. Clock
interrupt is higher - just.

To debug, you could try using a dump switch (basically it generates an
NMI on the PCI bus) when you get into a situation like this - a dump is
generated if the appropriate registry parameter was set (read at boot
time, of course, not at system crash time…)

Regards,

Tony

Tony Mason
Consulting Partner
OSR Open Systems Resources, Inc.
http://www.osr.com