Why windbg cannot break into target

Tim, what do you mean by a healthy system? You mean my computer before I freeze it?

I’m moving this discussion to the windbg list.

I have a laptop with an older AMD CPU, one that also has the special IBS feature. I’ll run my app there and see if it also freezes. I suspect it won’t, because my driver used to work fine on another computer I used to own, that had a third model AMD, also with IBS. However, I’ve changed the driver a lot since then, so I may have messed something up. I’ll ask around my network of friends to see if anyone has this model (It’s the AMD Bulldozer/Steamroller/etc. family) and would be willing to try my app on their machine. Any of you readers have one and would like to try it?

Regarding the NMI, I’ve asked already on WINDBG list if I can get the system to respond to NMI in some other way than doing a hard reboot, such as calling a bugcheck, or anything that would allow windbg to break in. Any answers for that here?

Thanks.

>Regarding the NMI, I’ve asked already on WINDBG list if I can get the system to respond to NMI in some other way than

doing a hard reboot, such as calling a bugcheck, or anything that would allow windbg to break in. Any answers for that here?

There is a registry option you can set that will cause the OS to crash dump if it gets an NMI, details at

https://support.microsoft.com/en-us/help/927069/how-to-generate-a-complete-crash-dump-file-or-a-kernel-crash-dump-file

You also can’t just hook up some wire to a cpu pin, your motherboard has to have NMI generation support. Many/most server motherboards support this, a lot of desktop boards don’t. For older PCI (not PCIe) slot systems, there used to be an add on card you could get that would trigger an NMI by raising the correct PCI bus pin for a few microseconds. PCIe doesn’t use physical signals like this, a device can I believe still generate an error indication, but it has to generate the correct PCIe TLP. A lot of bigger servers just have admin/ipmi interface ways of NMI triggering. Some of the desktop/small server motherboards with Intel vPro management processor I believe also can force an NMI, likely though an incredibly obtuse remote API (it’s one of the power control commands, like power on/ power off/reboot/NMI).

Some motherboards that support NMI generation don’t have a switch connected to the header pins, and the easy no soldering solution is a switch+wire+header plug from Amazon https://www.amazon.com/gp/product/B00E6NFL8I

The kernel debugger will not respond unless the target stub get’s control of the cpu. If the system is spinning at say HIGH_IRQL, this will never happen even though the cpu is not hung. I didn’t write the kernel debugger, but also think it’s likely that if one core get’s control, but it can’t gain control of all the other cores, like via an IPI (interprocessor Interrupt), the debugger will not be responsive.

Other ways to get cpu control, set a breakpoint early in an interrupt handler (dig through the vector table) and tell the debugger to continue running anytime this breakpoint trips. You could also do this for the NMI interrupt, or if you’re fooling with the performance counters, when some counter that will eventually, but slowly trigger. These kinds of autocontinued breakpoints can’t be triggered “too” often.

Another thing you might do is run your driver in a parent partition VM, and attach the debugger to the hypervisor. The parent partition generally has all the hardware resources passed though. I’ve never done this, so cant say if this is useful or just uselessly painful. I believe there was a message on this list just a few week ago about attaching the debugger to the hypervisor, and getting a crash dump of a VM. There are also “thin” bare metal hypervisors around (look on github), that are more for system examination than running multiple VMs. I’ve never done this either, although have always though wrapping the OS in a hypervisor for kernel debugging would be really useful.

If you try HARD and don’t get the debugger to respond, the question that needs answering: is the cpu really hung (not executing instructions) or is the cpu still executing instructions but the target debugger stub just never gets control. Truly hung cpus are HARD to debug, because you can’t get any data after the hang. Unless I’m working on new or unusual hardware, a hung cpu is not high on my probability list. If this really is a cpu hang, everything you try to get the debugger to be responsive will fail.

A lot people have reported that window kernel debugging over a legacy serial port is more reliable under difficult condition, like debugging power transitions. It’s possible whatever is happening in your case would also have improved debuggability via a serial debug transport. The kernel debugger transport connections range from “just barely works sometimes” for USB2 a transport to “almost always positively has control” which is more often the case with serial and exactly the correct 1394 card transports. Ethernet transports tend to fall in the “works pretty well, except when they don’t” category. I read 1394 debugging was being dropped from the latest Win 10 release (boo), and legacy serial ports are becoming rare on newer chipsets.

Jan

xxxxx@rolle.name wrote:

Tim, what do you mean by a healthy system? You mean my computer before I freeze it?

Right.  In my view, it’s easier to debug if you can follow/trace/step
through the healthy system until it freezes, rather than try to conduct
a post-mortem analysis when useful information might already be gone. 
Sprinkle debug prints liberally throughout the code, and use the last
thing you get to refine the search.

I’m moving this discussion to the windbg list.

I’m not sure that’s the best move.  You don’t have questions about
WinDbg, you have questions about kernel debugging.

Regarding the NMI, I’ve asked already on WINDBG list if I can get the system to respond to NMI in some other way than doing a hard reboot, such as calling a bugcheck, or anything that would allow windbg to break in. Any answers for that here?

Yes, you can configure Windows to BSOD on NMI.
    https://support.microsoft.com/en-us/help/927069


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

I rigged some jumper wires to an external header from the MB reset pins,
then I can short the two pins on the header with a screwdriver.  If I do
this while I have my system hang, it causes an immediate reboot.

Can I assume that this sends an NMI to the CPU, or does the BIOS just
shut off the CPU and do the reboot?

I did this while running windbg on the host, and windbg didn’t report
anything special other than reestablishing connection after the reboot
took place.

Is it possible to use windbg to set a breakpoint when NMI is received,
and that might allow windbg to send commands?

On 3/7/2018 11:35 AM, xxxxx@probo.com wrote:

Do you think it would help if I wire an external switch to the NMI input on my CPU?
That can break in to SOME kinds of failures, but not all.  You’d still
have a pretty big forensic analysis job.  I think you’d be better served
to start from a healthy system and figure out where it goes awry.


Michael Rolle
xxxxx@rolle.name
408-313-8149

Great, Tim, thanks.  And I’ll keep this discussion on NTDEV.  If I have
anything specifically about windbg, I’ll use WINDBG list for that.

On 3/8/2018 11:39 AM, xxxxx@probo.com wrote:

xxxxx@rolle.name wrote:
> Tim, what do you mean by a healthy system? You mean my computer before I freeze it?
Right.  In my view, it’s easier to debug if you can follow/trace/step
through the healthy system until it freezes, rather than try to conduct
a post-mortem analysis when useful information might already be gone.
Sprinkle debug prints liberally throughout the code, and use the last
thing you get to refine the search.

> I’m moving this discussion to the windbg list.
I’m not sure that’s the best move.  You don’t have questions about
WinDbg, you have questions about kernel debugging.

> Regarding the NMI, I’ve asked already on WINDBG list if I can get the system to respond to NMI in some other way than doing a hard reboot, such as calling a bugcheck, or anything that would allow windbg to break in. Any answers for that here?
Yes, you can configure Windows to BSOD on NMI.
    https://support.microsoft.com/en-us/help/927069


Michael Rolle
xxxxx@rolle.name
408-313-8149

I looked at that link at

https://support.microsoft.com/en-us/help/927069

The Registry instructions didn’t quite match my experience. However, I added a value to*HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl called
NMICrashDump* with a value of 1. I trust this is correct, please tell me if it isn’t. I’m going to reboot now and see if my reset button generates the crash dump.


Michael Rolle
xxxxx@rolle.name
408-313-8149

xxxxx@rolle.name wrote:

I rigged some jumper wires to an external header from the MB reset
pins, then I can short the two pins on the header with a screwdriver. 
If I do this while I have my system hang, it causes an immediate reboot.

Can I assume that this sends an NMI to the CPU, or does the BIOS just
shut off the CPU and do the reboot?

No, it doesn’t send an NMI, it does a reset.  The CPU state is cleared,
and it restarts at the reset vector.  That’s unrecoverable.

Is it possible to use windbg to set a breakpoint when NMI is received,
and that might allow windbg to send commands?

You can configure Windows to fire a BSOD upon receiving an NMI.  I sent
the KB link earlier today.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> I rigged some jumper wires to an external header from the MB reset pins,

then I can short the two pins on the header with a screwdriver.? If I do
this while I have my system hang, it causes an immediate reboot.

Can I assume that this sends an NMI to the CPU, or does the BIOS just
shut off the CPU and do the reboot?

No that’s not asserting NMI. The BIOS isn’t causing the reboot either. This is a physical signal that causes the CPU to reset. It will definitely “unhang” your CPU, but it’s not going to break you into WinDbg.

My MB BIOS doesn’t have any setting relating to NMI, so it looks like there’s no way to generate one. I contacted support at ASUS for my B350M-A MB asking if there was a way to do that.

Anyone know a way to physically assert NMI with this MB? Without it, setting Windows to do a crash dump on NMI isn’t useful.

Thanks.

Do you need this for Instruction-Based Sampling (IBS), like you mentioned
before ?

//Daniel

If you are on Windows 10 RS2 or above, you could try enabling Hyper-V on your test machine and seeing if you observe a HYPERVISOR_WATCHDOG_TIMEOUT bugcheck after a few minutes when the system has hung if you run your scenario on the root partition. If so, then there may be something useful that may be discoverable from the resultant crash dump (it is also possible that this might bring KD back to life if KD is connected, but that will depend on the state of the machine).

RS2 and above implement a synthetic watchdog in the hypervisor, that can catch *some* system hangs in kernel mode code (e.g. where all processors are wedged with interrupts disabled in kernel mode code), where the underlying Hyper-V hypervisor is still running.

This does rely on the hypervisor still being functional which depends on the state of the machine. Typically, “purely software” bugs in the root partition would not commonly block the hypervisor from continuing to execute, but hardware issues where a processor is physically hung, e.g. if someone is not responding to a bus transaction and hangs the system up etc., may not be usefully caught by this mechanism.

Note: Some processor features might not be exposed by Hyper-V. I’m not sure offhand if AMD IBS is exposed or masked. If not, then this would not work for your driver, but it is a general technique that may be applicable to other problems, in lieu of hardware with a physical NMI injection capability.

  • S (Msft)

-----Original Message-----
From: xxxxx@lists.osr.com On Behalf Of xxxxx@rolle.name
Sent: Thursday, March 08, 2018 2:51 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Why windbg cannot break into target

My MB BIOS doesn’t have any setting relating to NMI, so it looks like there’s no way to generate one. I contacted support at ASUS for my B350M-A MB asking if there was a way to do that.

Anyone know a way to physically assert NMI with this MB? Without it, setting Windows to do a crash dump on NMI isn’t useful.

Thanks.


NTDEV is sponsored by OSR

Visit the list online at: https:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at https:

To unsubscribe, visit the List Server section of OSR Online at https:</https:></https:></https:>

An NMI is TOTALLY different than a reset. Rebooting will be the correct response for a reset, even if you enable NMI crash dumps. You would need to short across NMI pins, if they exist on you your motherboard.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@rolle.name
Sent: Thursday, March 8, 2018 1:34 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Why windbg cannot break into target

I rigged some jumper wires to an external header from the MB reset pins, then I can short the two pins on the header with a screwdriver. If I do this while I have my system hang, it causes an immediate reboot.

Can I assume that this sends an NMI to the CPU, or does the BIOS just shut off the CPU and do the reboot?

I did this while running windbg on the host, and windbg didn’t report anything special other than reestablishing connection after the reboot took place.

Is it possible to use windbg to set a breakpoint when NMI is received, and that might allow windbg to send commands?

On 3/7/2018 11:35 AM, xxxxx@probo.com wrote:
> Do you think it would help if I wire an external switch to the NMI input on my CPU?
> That can break in to SOME kinds of failures, but not all. You’d still
> have a pretty big forensic analysis job. I think you’d be better served
> to start from a healthy system and figure out where it goes awry.
>


Michael Rolle
xxxxx@rolle.name
408-313-8149


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

As was mentioned by Jan Bottorff, there is at least one NMI add-in card available. I’ve never used it myself, as I convinced my hardware folks to include the feature in our PCI card, but it’s certainly the same feature that I use fairly often, so I know it works great as long as the registry setting NMICrashDump is set to 1.

Here’s the card that I know of for PCI:
http://connecttech.com/product/pci-dump-switch-card/

And for PCIe:
http://connecttech.com/product/pci-express-dump-switch-card/

I do not know how much they cost.

Yeah, I mentioned the nmi board as well, although it might not work given
the situation. Still unless it is ridiculous $ worth a try.

Mark Roddy

On Fri, Mar 9, 2018 at 2:44 PM, xxxxx@vertical.com
wrote:

> As was mentioned by Jan Bottorff, there is at least one NMI add-in card
> available. I’ve never used it myself, as I convinced my hardware folks to
> include the feature in our PCI card, but it’s certainly the same feature
> that I use fairly often, so I know it works great as long as the registry
> setting NMICrashDump is set to 1.
>
> Here’s the card that I know of for PCI:
> http://connecttech.com/product/pci-dump-switch-card/
>
> And for PCIe:
> http://connecttech.com/product/pci-express-dump-switch-card/
>
> I do not know how much they cost.
>
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

xxxxx@vertical.com wrote:

As was mentioned by Jan Bottorff, there is at least one NMI add-in card available. I’ve never used it myself, as I convinced my hardware folks to include the feature in our PCI card, but it’s certainly the same feature that I use fairly often, so I know it works great as long as the registry setting NMICrashDump is set to 1.

Although this will work to break through CPU lock-ups, if the problem is
actually a bus freeze, it won’t help.

In the meantime, are you adding in debug traces, so you can at least see
how far it gets before it locks up?


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

What a lot of information here. Let me respond to various items…

  1. Yes, this is for the AMD IBS on a Ryzen 3.

  2. I’m in the process of adding debug traces.

  3. Before that, I am going to run my test app and driver on my laptop, which has IBS hardware too, but it is an older (AMD Bobcat family 14h) model. I want to see if I get the hangup there too. And I have a friend who has a Ryzen 7, to try it out there. If I don’t get the hangup there, that would point the finger of suspicion at the Ryzen 3, which might possibly have a hardware bug that freezes the CPU.

  4. Interesting note that it only takes one CPU that doesn’t respond to an inter processor interrupt to keep the debug stub from answering a break command from the host. I had been wondering how the IBS hardware or my driver could possibly hang up all of the CPUs since it uses the hardware, and runs DPCs, only on one of them. Of course, if only one CPU is getting hung, could that also explain the entire system hanging?

  5. I’ll shop around locally for a PCIe card to generate the NMI. Thanks for the tip, I didn’t realize that NMI could be asserted over the PCIe bus.

  6. Regarding using a different debug transport other than Ethernet, I might try that if all else fails, provided that somebody can say that a different debug stub for that transport would be better at breaking into the hung system. I would think that the main difference between the stubs would be which IRQL they use on the transport device. But wouldn’t any debug transport run its iterrupt vector at the highest possible IRQL so that it couldn’t get locked out.

On Mar 9, 2018, at 6:37 PM, xxxxx@rolle.name wrote:
>
> 6. Regarding using a different debug transport other than Ethernet, I might try that if all else fails, provided that somebody can say that a different debug stub for that transport would be better at breaking into the hung system.

No, it would not.

> I would think that the main difference between the stubs would be which IRQL they use on the transport device. But wouldn’t any debug transport run its iterrupt vector at the highest possible IRQL so that it couldn’t get locked out.

Correct. The transport will not affect your issue.

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

IPCs are used for “processor rendez-vous”. If everyone doesn’t show up, the
ones that do just hang out at the corral waiting. Forever. But not at a
level that would mask NMIs.

Mark Roddy

On Fri, Mar 9, 2018 at 9:37 PM, xxxxx@rolle.name wrote:

> What a lot of information here. Let me respond to various items…
>
> 1. Yes, this is for the AMD IBS on a Ryzen 3.
>
> 2. I’m in the process of adding debug traces.
>
> 3. Before that, I am going to run my test app and driver on my laptop,
> which has IBS hardware too, but it is an older (AMD Bobcat family 14h)
> model. I want to see if I get the hangup there too. And I have a friend
> who has a Ryzen 7, to try it out there. If I don’t get the hangup there,
> that would point the finger of suspicion at the Ryzen 3, which might
> possibly have a hardware bug that freezes the CPU.
>
> 4. Interesting note that it only takes one CPU that doesn’t respond to an
> inter processor interrupt to keep the debug stub from answering a break
> command from the host. I had been wondering how the IBS hardware or my
> driver could possibly hang up all of the CPUs since it uses the hardware,
> and runs DPCs, only on one of them. Of course, if only one CPU is getting
> hung, could that also explain the entire system hanging?
>
> 5. I’ll shop around locally for a PCIe card to generate the NMI. Thanks
> for the tip, I didn’t realize that NMI could be asserted over the PCIe bus.
>
> 6. Regarding using a different debug transport other than Ethernet, I
> might try that if all else fails, provided that somebody can say that a
> different debug stub for that transport would be better at breaking into
> the hung system. I would think that the main difference between the stubs
> would be which IRQL they use on the transport device. But wouldn’t any
> debug transport run its iterrupt vector at the highest possible IRQL so
> that it couldn’t get locked out.
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

Thanks, Tim. Like Thomas Edison, I know two more ways not to make a light bulb.

Mark, I don’t understand your message. What is an IPC? Inter process communication? Are you explaining that if any one processor doesn’t respond to the debug stub, then the stub cannot break in? And why would NMI be any different? Wouldn’t the debug stub send an NMI to all the other processors to stop them?

By the way, the link to connecttech.com for the PCIe dump switch card led me to a page where I could email them for a quote. In response, they sent me a link to their distribution partner. Here it is: http://www.wdlsystems.com/Box-PC/?search=adg018. The card sells for $133.00.

I’m inclined to pass up the opportunity at that price. But I’m putting it here for other more serious hardware developers to see.

If your thing is preventing “InterProcess Communication” from completing by
keeping one or more processors out of the rendez-vous, the system will be
hung but it would respond to an NMI. The nmi only has to to get to one cpu.
IPC is interprocess communication triggered by IPI interprocess interrupt.

Mark Roddy

On Sat, Mar 10, 2018 at 10:03 PM, xxxxx@rolle.name wrote:

> Thanks, Tim. Like Thomas Edison, I know two more ways not to make a light
> bulb.
>
> Mark, I don’t understand your message. What is an IPC? Inter process
> communication? Are you explaining that if any one processor doesn’t
> respond to the debug stub, then the stub cannot break in? And why would
> NMI be any different? Wouldn’t the debug stub send an NMI to all the other
> processors to stop them?
>
> By the way, the link to connecttech.com for the PCIe dump switch card led
> me to a page where I could email them for a quote. In response, they sent
> me a link to their distribution partner. Here it is:
> http://www.wdlsystems.com/Box-PC/?search=adg018. The card sells for
> $133.00.
>
> I’m inclined to pass up the opportunity at that price. But I’m putting it
> here for other more serious hardware developers to see.
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>