>Regarding the NMI, I’ve asked already on WINDBG list if I can get the system to respond to NMI in some other way than
doing a hard reboot, such as calling a bugcheck, or anything that would allow windbg to break in. Any answers for that here?
There is a registry option you can set that will cause the OS to crash dump if it gets an NMI, details at
https://support.microsoft.com/en-us/help/927069/how-to-generate-a-complete-crash-dump-file-or-a-kernel-crash-dump-file
You also can’t just hook up some wire to a cpu pin, your motherboard has to have NMI generation support. Many/most server motherboards support this, a lot of desktop boards don’t. For older PCI (not PCIe) slot systems, there used to be an add on card you could get that would trigger an NMI by raising the correct PCI bus pin for a few microseconds. PCIe doesn’t use physical signals like this, a device can I believe still generate an error indication, but it has to generate the correct PCIe TLP. A lot of bigger servers just have admin/ipmi interface ways of NMI triggering. Some of the desktop/small server motherboards with Intel vPro management processor I believe also can force an NMI, likely though an incredibly obtuse remote API (it’s one of the power control commands, like power on/ power off/reboot/NMI).
Some motherboards that support NMI generation don’t have a switch connected to the header pins, and the easy no soldering solution is a switch+wire+header plug from Amazon https://www.amazon.com/gp/product/B00E6NFL8I
The kernel debugger will not respond unless the target stub get’s control of the cpu. If the system is spinning at say HIGH_IRQL, this will never happen even though the cpu is not hung. I didn’t write the kernel debugger, but also think it’s likely that if one core get’s control, but it can’t gain control of all the other cores, like via an IPI (interprocessor Interrupt), the debugger will not be responsive.
Other ways to get cpu control, set a breakpoint early in an interrupt handler (dig through the vector table) and tell the debugger to continue running anytime this breakpoint trips. You could also do this for the NMI interrupt, or if you’re fooling with the performance counters, when some counter that will eventually, but slowly trigger. These kinds of autocontinued breakpoints can’t be triggered “too” often.
Another thing you might do is run your driver in a parent partition VM, and attach the debugger to the hypervisor. The parent partition generally has all the hardware resources passed though. I’ve never done this, so cant say if this is useful or just uselessly painful. I believe there was a message on this list just a few week ago about attaching the debugger to the hypervisor, and getting a crash dump of a VM. There are also “thin” bare metal hypervisors around (look on github), that are more for system examination than running multiple VMs. I’ve never done this either, although have always though wrapping the OS in a hypervisor for kernel debugging would be really useful.
If you try HARD and don’t get the debugger to respond, the question that needs answering: is the cpu really hung (not executing instructions) or is the cpu still executing instructions but the target debugger stub just never gets control. Truly hung cpus are HARD to debug, because you can’t get any data after the hang. Unless I’m working on new or unusual hardware, a hung cpu is not high on my probability list. If this really is a cpu hang, everything you try to get the debugger to be responsive will fail.
A lot people have reported that window kernel debugging over a legacy serial port is more reliable under difficult condition, like debugging power transitions. It’s possible whatever is happening in your case would also have improved debuggability via a serial debug transport. The kernel debugger transport connections range from “just barely works sometimes” for USB2 a transport to “almost always positively has control” which is more often the case with serial and exactly the correct 1394 card transports. Ethernet transports tend to fall in the “works pretty well, except when they don’t” category. I read 1394 debugging was being dropped from the latest Win 10 release (boo), and legacy serial ports are becoming rare on newer chipsets.
Jan