Jan
-----Original Message-----
From: xxxxx@lists.osr.com On Behalf Of Burrr Sent: Friday, June 1, 2018 11:03 AM To: Windows System Software Devs Interest List Subject: [ntdev] System Lockup - NMI Crash Dump
Hi,
I am debugging a rather nasty system issue which locks up the system(no response from the keyboard, mouse), and happens on some systems but not others. This is a Windows 10 system with several of our drivers both PCIe and USB. I’ve tried several experiments to isolate the issue without much success. One of the methods we landed on, to narrow down the issue is a NMI jumper available on the motherboard which can force a crash dump. https://blogs.technet.microsoft.com/askperf/2009/01/23/two-minute-drill-nmi/
We can create a crash dump by asserting the NMI when the system is running, however I can’t seem to create a crash dump when the system locks up. I did try to force one of my drivers to lock up by creating a pseudo condition, and I was able to create a crash dump using the NMI in that scenario.
My question are: 1. What would cause a lockup where the NMI does not respond? 2. Would a driver be able to cause a lockup that would block the NMI from responding to the OS?
> Only the IPC thing there probably was a typo, it should be IPI (initer-processor interupt)
Of course it was very obviously a typo, but I was literally floored by the OP’s reaction to Mark’s statement, particularly by the part concerning “debug stub sending NMI to all other processors”. Look what he said…
On my system I am already able to generate a NMI when the system is
working normally. However when the system locks up, it does not work.
I have attached a PCIe analyzer to see if there are any weird things
going on, but did not find anything useful.
There are 7 MSI interrupts and DMA transactions that are being used on
my PCIe driver. There are several other USB drivers that are used on
prior systems and redeployed here.
I am floored as to the reasons why the OS would not respond to the NMI.
Also what can cause such an event?
Rogue DMA and Bus Freeze are likely going to resolve to “Bus Freeze”. If
your resources include access to a pci(e) bus analyzer that is the best
path forward in my opinion.
Although plain old debug console logging can also be fruitful and is way
less expensive.
> From the thread, I gather possible choices for a freeze where the NMI > doesn’t respond, are: > > Bus Freeze > Rogue DMA request > Interrupt Storm > > Is that right? > > Burrr > > On 6/2/2018 2:27 AM, Jan Bottorff wrote: > > I’d suggest reading the thread at http://www.osronline.com/ > showThread.CFM?link=288112 It has some new and really interesting > strategies to debug hard lockups. > > > > Jan > > > > > > — > NTDEV is sponsored by OSR > > Visit the list online at: http:> showlists.cfm?list=ntdev> > > MONTHLY seminars on crash dump analysis, WDF, Windows internals and > software drivers! > Details at http: > > To unsubscribe, visit the List Server section of OSR Online at < > http://www.osronline.com/page.cfm?name=ListServer> ></http:></http:>
Or OS so corrupted by overwrite that it can’t handle the NMI
* Bob
Bob Ammerman
xxxxx@ramsystems.biz
716.864.8337
138 Liston St
Buffalo, NY 14223
www.ramsystems.biz
-----Original Message-----
From: xxxxx@lists.osr.com On Behalf Of xxxxx@outlook.com Sent: Monday, June 4, 2018 8:11 AM To: Windows System Software Devs Interest List Subject: Re:[ntdev] System Lockup - NMI Crash Dump
From the thread, I gather possible choices for a freeze where the NMI doesn’t respond, are:
Bus Freeze Rogue DMA request Interrupt Storm
Is that right?
Burrr
On 6/2/2018 2:27 AM, Jan Bottorff wrote: > I’d suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups. > > Jan >
+1 for what Mr. Roddy said, above. It is *exactly* what I was going to post.
Nothing substitutes for a bus analyzer, which often will let you root-cause a complex problem like this in minutes (or, you know, at least kick it back to the FPGA guys)… but it is amazing how very much you can discern using DbgPrint.
Another avenue, if you can work WITH your FPGA people, is to have them help with ChipScope or SignalTap… A good FPGA guy can get almost as much out of this as a proper bus analyzer.
+1 for what Mr. Roddy said, above. It is *exactly* what I was going to post.
Nothing substitutes for a bus analyzer, which often will let you root-cause a complex problem like this in minutes (or, you know, at least kick it back to the FPGA guys)… but it is amazing how very much you can discern using DbgPrint.
Another avenue, if you can work WITH your FPGA people, is to have them help with ChipScope or SignalTap… A good FPGA guy can get almost as much out of this as a proper bus analyzer.
Visit the list online at: http: > > MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! > Details at http: > > To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>
Does the problem reproduce pretty quickly/easily? Also, do you have a kernel debugger attached?
If you add enough DbgPrints you should be able to figure out the last things your driver(s) did before the hang. Pay particular attention to the DMA transfers that you performed prior to the hang (particularly offsets, lengths, and physical addresses). I’ve definitely debugged problems like this that way, usually it ends up being a particular set of arguments that triggers an edge condition in my code or the hardware.
The problem is hard to create and takes anywhere from 2 hrs to 18 hrs to
create. I don’t have a kernel debugger attached. The reason is: If I do
anything to slow down the operation of the system, the problem takes
several days to occur.
Also I’ve noticed that the problem occurs sometimes when no DMA
operation is going on.
Does the problem reproduce pretty quickly/easily? Also, do you have a kernel debugger attached?
If you add enough DbgPrints you should be able to figure out the last things your driver(s) did before the hang. Pay particular attention to the DMA transfers that you performed prior to the hang (particularly offsets, lengths, and physical addresses). I’ve definitely debugged problems like this that way, usually it ends up being a particular set of arguments that triggers an edge condition in my code or the hardware.
>The problem is hard to create and takes anywhere from 2 hrs to 18 hrs
Ugh. My condolences.
I don’t have a kernel debugger attached
Regardless, I would recommend you test your driver with Driver Verifier DMA verification enabled. IF you have a problem with the DMA APIs, this will usually catch it quickly.
Also I’ve noticed that the problem occurs sometimes
when no DMA operation is going on.
Yes, but that doesn’t rule out the DMA. As Mr. Ammerman suggested, this could be a corrupted DMA operation resulting in an overwrite.
And debug console logging, even if it slows down reproduction to have the
debugger attached, would at least give you clues about what your driver was
doing around the time of the failure.
I’d dedicate a test system just to running with the debugger attached and
your driver logging its operations. Meanwhile pursue other paths.
> >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs > > Ugh. My condolences. > > >I don’t have a kernel debugger attached > > Regardless, I would recommend you test your driver with Driver Verifier > DMA verification enabled. IF you have a problem with the DMA APIs, this > will usually catch it quickly. > > >Also I’ve noticed that the problem occurs sometimes > >when no DMA operation is going on. > > Yes, but that doesn’t rule out the DMA. As Mr. Ammerman suggested, this > could be a corrupted DMA operation resulting in an overwrite. > > Peter > OSR > @OSRDrivers > > > — > NTDEV is sponsored by OSR > > Visit the list online at: < > http://www.osronline.com/showlists.cfm?list=ntdev> > > MONTHLY seminars on crash dump analysis, WDF, Windows internals and > software drivers! > Details at http: > > To unsubscribe, visit the List Server section of OSR Online at < > http://www.osronline.com/page.cfm?name=ListServer> ></http:>
Before I saw this message I had started a test with the debugger attached to a system to see if I would be able to break into the debugger when the lockup occurred.
A lockup did occur, but I was unable to break into the debugger.
I restarted the test with some minimal logging from my main driver with the debugger attached.
Burrr
On 6/5/2018 5:57 PM, xxxxx@gmail.commailto:xxxxx wrote: And debug console logging, even if it slows down reproduction to have the debugger attached, would at least give you clues about what your driver was doing around the time of the failure.
I’d dedicate a test system just to running with the debugger attached and your driver logging its operations. Meanwhile pursue other paths.
Mark Roddy
On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.commailto:xxxxx > wrote: >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs
Ugh. My condolences.
>I don’t have a kernel debugger attached
Regardless, I would recommend you test your driver with Driver Verifier DMA verification enabled. IF you have a problem with the DMA APIs, this will usually catch it quickly.
>Also I’ve noticed that the problem occurs sometimes >when no DMA operation is going on.
Yes, but that doesn’t rule out the DMA. As Mr. Ammerman suggested, this could be a corrupted DMA operation resulting in an overwrite.
MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at http:
To unsubscribe, visit the List Server section of OSR Online at http: — NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at To unsubscribe, visit the List Server section of OSR Online at</http:></http:></http:></mailto:xxxxx></mailto:xxxxx>
I’ve tried all these, but it did not yield any red flags.
Any other ideas?
Burrr
On 6/5/2018 5:57 PM, xxxxx@gmail.commailto:xxxxx wrote: And debug console logging, even if it slows down reproduction to have the debugger attached, would at least give you clues about what your driver was doing around the time of the failure.
I’d dedicate a test system just to running with the debugger attached and your driver logging its operations. Meanwhile pursue other paths.
Mark Roddy
On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.commailto:xxxxx > wrote: >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs
Ugh. My condolences.
>I don’t have a kernel debugger attached
Regardless, I would recommend you test your driver with Driver Verifier DMA verification enabled. IF you have a problem with the DMA APIs, this will usually catch it quickly.
>Also I’ve noticed that the problem occurs sometimes >when no DMA operation is going on.
Yes, but that doesn’t rule out the DMA. As Mr. Ammerman suggested, this could be a corrupted DMA operation resulting in an overwrite.
MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at http:
To unsubscribe, visit the List Server section of OSR Online at http: — NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at To unsubscribe, visit the List Server section of OSR Online at</http:></http:></http:></mailto:xxxxx></mailto:xxxxx>
> Would a driver be able to cause a lockup that would block the NMI from responding to the OS?
Of course. For example, consider what happens if it somehow corrupts the memory region that is occupied by IDTs - in such case NMI, just like any other interrupt, seems to be out of luck completely.
IIRC, every CPU has its own IDT under Windows, but still these IDTs must be, apparently, located in the same memory region. For example, all theoretically possible IDTs ( 256 possible IDTs * 256 IDT entries * 16 bytes per entry on a 64-bit system) would occupy only 1M, i.e. fit in the same large page. Therefore, a single relatively large write to the target area is going to screw up all of them in one go, and, at this point, you are going to get exactly the scenario that you are describing, i.e. a sudden freeze of the system that cannot get resolved even by NMI.
A driver can do it either directly by the CPU if the target area is not marked as a read-only one
in its PTE, or indirectly by the wrong DMA operation.
In general, I would suggest taking “The Occam Razor” approach, and start investigating the most
likely reasons and simple theories before proceeding to more complex ones. In this particular case I would start from the theory of IDT corruption (first direct and then indirect one) before proceeding
to more complex scenarios (like a hardware-caused lockup which is,in turn, is caused by a driver incorrectly programming its device)
> Would a driver be able to cause a lockup that would block the NMI from responding to the OS?
Of course. For example, consider what happens if it somehow corrupts the memory region that is occupied by IDTs - in such case NMI, just like any other interrupt, seems to be out of luck completely.
IIRC, every CPU has its own IDT under Windows, but still these IDTs must be, apparently, located in the same memory region. For example, all theoretically possible IDTs ( 256 possible IDTs * 256 IDT entries * 16 bytes per entry on a 64-bit system) would occupy only 1M, i.e. fit in the same large page. Therefore, a single relatively large write to the target area is going to screw up all of them in one go, and, at this point, you are going to get exactly the scenario that you are describing, i.e. a sudden freeze of the system that cannot get resolved even by NMI.
A driver can do it either directly by the CPU if the target area is not marked as a read-only one
in its PTE, or indirectly by the wrong DMA operation.
In general, I would suggest taking “The Occam Razor” approach, and start investigating the most
likely reasons and simple theories before proceeding to more complex ones. In this particular case I would start from the theory of IDT corruption (first direct and then indirect one) before proceeding
to more complex scenarios (like a hardware-caused lockup which is,in turn, is caused by a driver incorrectly programming its device)
> Would a driver be able to cause a lockup that would block the NMI from responding to the OS?
Of course. For example, consider what happens if it somehow corrupts the memory region that is occupied by IDTs - in such case NMI, just like any other interrupt, seems to be out of luck completely.
IIRC, every CPU has its own IDT under Windows, but still these IDTs must be, apparently, located in the same memory region. For example, all theoretically possible IDTs ( 256 possible IDTs * 256 IDT entries * 16 bytes per entry on a 64-bit system) would occupy only 1M, i.e. fit in the same large page. Therefore, a single relatively large write to the target area is going to screw up all of them in one go, and, at this point, you are going to get exactly the scenario that you are describing, i.e. a sudden freeze of the system that cannot get resolved even by NMI.
A driver can do it either directly by the CPU if the target area is not marked as a read-only one
in its PTE, or indirectly by the wrong DMA operation.
In general, I would suggest taking “The Occam Razor” approach, and start investigating the most
likely reasons and simple theories before proceeding to more complex ones. In this particular case I would start from the theory of IDT corruption (first direct and then indirect one) before proceeding
to more complex scenarios (like a hardware-caused lockup which is,in turn, is caused by a driver incorrectly programming its device)
> I’ve tried all these, but it did not yield any red flags. > > Any other ideas? > > Burrr > > On 6/5/2018 5:57 PM, xxxxx@gmail.com wrote: > > And debug console logging, even if it slows down reproduction to have the > debugger attached, would at least give you clues about what your driver was > doing around the time of the failure. > > I’d dedicate a test system just to running with the debugger attached and > your driver logging its operations. Meanwhile pursue other paths. > > Mark Roddy > > > On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.com > wrote: > >> >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs >> >> Ugh. My condolences. >> >> >I don’t have a kernel debugger attached >> >> Regardless, I would recommend you test your driver with Driver Verifier >> DMA verification enabled. IF you have a problem with the DMA APIs, this >> will usually catch it quickly. >> >> >Also I’ve noticed that the problem occurs sometimes >> >when no DMA operation is going on. >> >> Yes, but that doesn’t rule out the DMA. As Mr. Ammerman suggested, this >> could be a corrupted DMA operation resulting in an overwrite. >> >> Peter >> OSR >> @OSRDrivers >> >> >> — >> NTDEV is sponsored by OSR >> >> Visit the list online at: < >> http://www.osronline.com/showlists.cfm?list=ntdev> >> >> MONTHLY seminars on crash dump analysis, WDF, Windows internals and >> software drivers! >> Details at http: >> >> To unsubscribe, visit the List Server section of OSR Online at < >> http://www.osronline.com/page.cfm?name=ListServer> >> > — NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars > on crash dump analysis, WDF, Windows internals and software drivers! > Details at To unsubscribe, visit the List Server section of OSR Online at > > > > — > NTDEV is sponsored by OSR > > Visit the list online at: < > http://www.osronline.com/showlists.cfm?list=ntdev> > > MONTHLY seminars on crash dump analysis, WDF, Windows internals and > software drivers! > Details at http: > > To unsubscribe, visit the List Server section of OSR Online at < > http://www.osronline.com/page.cfm?name=ListServer> ></http:></http:>