System Lockup - NMI Crash Dump

I’d suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups.

Jan

Jan
-----Original Message-----
From: xxxxx@lists.osr.com On Behalf Of Burrr
Sent: Friday, June 1, 2018 11:03 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] System Lockup - NMI Crash Dump

Hi,

I am debugging a rather nasty system issue which locks up the system(no response from the keyboard, mouse), and happens on some systems but not others. This is a Windows 10 system with several of our drivers both PCIe and USB. I’ve tried several experiments to isolate the issue without much success. One of the methods we landed on, to narrow down the issue is a NMI jumper available on the motherboard which can force a crash dump.
https://blogs.technet.microsoft.com/askperf/2009/01/23/two-minute-drill-nmi/

We can create a crash dump by asserting the NMI when the system is running, however I can’t seem to create a crash dump when the system locks up.
I did try to force one of my drivers to lock up by creating a pseudo condition, and I was able to create a crash dump using the NMI in that scenario.

My question are:
1. What would cause a lockup where the NMI does not respond?
2. Would a driver be able to cause a lockup that would block the NMI from responding to the OS?

Any insights would be appreciated,

Thanks,
Burrr

> I’d suggest reading the thread at

http://www.osronline.com/showThread.CFM?link=288112 It has some new and really
interesting strategies to debug hard lockups.

Only the IPC thing there probably was a typo, it should be IPI (initer-processor interupt)?

– pa

> Only the IPC thing there probably was a typo, it should be IPI (initer-processor interupt)

Of course it was very obviously a typo, but I was literally floored by the OP’s reaction to Mark’s statement, particularly by the part concerning “debug stub sending NMI to all other processors”. Look what he said…

Anton Bassov

Thanks much for the information.

On my system I am already able to generate a NMI when the system is
working normally. However when the system locks up, it does not work.

I have attached a PCIe analyzer to see if there are any weird things
going on, but did not find anything useful.
There are 7 MSI interrupts and DMA transactions that are being used on
my PCIe driver. There are several other USB drivers that are used on
prior systems and redeployed here.

I am floored as to the reasons why the OS would not respond to the NMI.
Also what can cause such an event?

Burrr

On 6/2/2018 2:27 AM, Jan Bottorff wrote:

I’d suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups.

Jan

Jan

From the thread, I gather possible choices for a freeze where the NMI
doesn’t respond, are:

Bus Freeze
Rogue DMA request
Interrupt Storm

Is that right?

Burrr

On 6/2/2018 2:27 AM, Jan Bottorff wrote:

I’d suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups.

Jan

Rogue DMA and Bus Freeze are likely going to resolve to “Bus Freeze”. If
your resources include access to a pci(e) bus analyzer that is the best
path forward in my opinion.

Although plain old debug console logging can also be fruitful and is way
less expensive.

Mark Roddy

On Mon, Jun 4, 2018 at 8:11 AM, xxxxx@outlook.com
wrote:

> From the thread, I gather possible choices for a freeze where the NMI
> doesn’t respond, are:
>
> Bus Freeze
> Rogue DMA request
> Interrupt Storm
>
> Is that right?
>
> Burrr
>
> On 6/2/2018 2:27 AM, Jan Bottorff wrote:
> > I’d suggest reading the thread at http://www.osronline.com/
> showThread.CFM?link=288112 It has some new and really interesting
> strategies to debug hard lockups.
> >
> > Jan
> >
>
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

Or OS so corrupted by overwrite that it can’t handle the NMI

* Bob

  Bob Ammerman
  xxxxx@ramsystems.biz
  716.864.8337

138 Liston St
Buffalo, NY 14223
www.ramsystems.biz

-----Original Message-----
From: xxxxx@lists.osr.com On Behalf Of xxxxx@outlook.com
Sent: Monday, June 4, 2018 8:11 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] System Lockup - NMI Crash Dump

From the thread, I gather possible choices for a freeze where the NMI doesn’t respond, are:

Bus Freeze
Rogue DMA request
Interrupt Storm

Is that right?

Burrr

On 6/2/2018 2:27 AM, Jan Bottorff wrote:
> I’d suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups.
>
> Jan
>


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

+1 for what Mr. Roddy said, above. It is *exactly* what I was going to post.

Nothing substitutes for a bus analyzer, which often will let you root-cause a complex problem like this in minutes (or, you know, at least kick it back to the FPGA guys)… but it is amazing how very much you can discern using DbgPrint.

Another avenue, if you can work WITH your FPGA people, is to have them help with ChipScope or SignalTap… A good FPGA guy can get almost as much out of this as a proper bus analyzer.

Peter
OSR
@OSRDrivers

We’ve captured a few PCIe Analyzer traces but none of them point to
anything specific or bus level errors.

We’ve also captured lots of traces with the FPGA with ChipScope with
nothing specific or apparent that points to the issue.

Burrr

On 6/4/2018 12:13 PM, xxxxx@osr.com wrote:

+1 for what Mr. Roddy said, above. It is *exactly* what I was going to post.

Nothing substitutes for a bus analyzer, which often will let you root-cause a complex problem like this in minutes (or, you know, at least kick it back to the FPGA guys)… but it is amazing how very much you can discern using DbgPrint.

Another avenue, if you can work WITH your FPGA people, is to have them help with ChipScope or SignalTap… A good FPGA guy can get almost as much out of this as a proper bus analyzer.

Peter
OSR
@OSRDrivers


NTDEV is sponsored by OSR

Visit the list online at: http:
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

Does the problem reproduce pretty quickly/easily? Also, do you have a kernel debugger attached?

If you add enough DbgPrints you should be able to figure out the last things your driver(s) did before the hang. Pay particular attention to the DMA transfers that you performed prior to the hang (particularly offsets, lengths, and physical addresses). I’ve definitely debugged problems like this that way, usually it ends up being a particular set of arguments that triggers an edge condition in my code or the hardware.

-scott
OSR

And to add to what Mr. Noone said… If this problem is DMA related, enable DMA Verification in Driver Verfier, as well.

Peter
OSR
@OSRDrivers

The problem is hard to create and takes anywhere from 2 hrs to 18 hrs to
create. I don’t have a kernel debugger attached. The reason is: If I do
anything to slow down the operation of the system, the problem takes
several days to occur.
Also I’ve noticed that the problem occurs sometimes when no DMA
operation is going on.

Burrr

On 6/5/2018 9:26 AM, xxxxx@osr.com wrote:

Does the problem reproduce pretty quickly/easily? Also, do you have a kernel debugger attached?

If you add enough DbgPrints you should be able to figure out the last things your driver(s) did before the hang. Pay particular attention to the DMA transfers that you performed prior to the hang (particularly offsets, lengths, and physical addresses). I’ve definitely debugged problems like this that way, usually it ends up being a particular set of arguments that triggers an edge condition in my code or the hardware.

-scott
OSR

>The problem is hard to create and takes anywhere from 2 hrs to 18 hrs

Ugh. My condolences.

I don’t have a kernel debugger attached

Regardless, I would recommend you test your driver with Driver Verifier DMA verification enabled. IF you have a problem with the DMA APIs, this will usually catch it quickly.

Also I’ve noticed that the problem occurs sometimes
when no DMA operation is going on.

Yes, but that doesn’t rule out the DMA. As Mr. Ammerman suggested, this could be a corrupted DMA operation resulting in an overwrite.

Peter
OSR
@OSRDrivers

And debug console logging, even if it slows down reproduction to have the
debugger attached, would at least give you clues about what your driver was
doing around the time of the failure.

I’d dedicate a test system just to running with the debugger attached and
your driver logging its operations. Meanwhile pursue other paths.

Mark Roddy

On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.com wrote:

> >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs
>
> Ugh. My condolences.
>
> >I don’t have a kernel debugger attached
>
> Regardless, I would recommend you test your driver with Driver Verifier
> DMA verification enabled. IF you have a problem with the DMA APIs, this
> will usually catch it quickly.
>
> >Also I’ve noticed that the problem occurs sometimes
> >when no DMA operation is going on.
>
> Yes, but that doesn’t rule out the DMA. As Mr. Ammerman suggested, this
> could be a corrupted DMA operation resulting in an overwrite.
>
> Peter
> OSR
> @OSRDrivers
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev&gt;
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:>

Before I saw this message I had started a test with the debugger attached to a system to see if I would be able to break into the debugger when the lockup occurred.

A lockup did occur, but I was unable to break into the debugger.

I restarted the test with some minimal logging from my main driver with the debugger attached.

Burrr

On 6/5/2018 5:57 PM, xxxxx@gmail.commailto:xxxxx wrote:
And debug console logging, even if it slows down reproduction to have the debugger attached, would at least give you clues about what your driver was doing around the time of the failure.

I’d dedicate a test system just to running with the debugger attached and your driver logging its operations. Meanwhile pursue other paths.

Mark Roddy

On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.commailto:xxxxx > wrote:
>The problem is hard to create and takes anywhere from 2 hrs to 18 hrs

Ugh. My condolences.

>I don’t have a kernel debugger attached

Regardless, I would recommend you test your driver with Driver Verifier DMA verification enabled. IF you have a problem with the DMA APIs, this will usually catch it quickly.

>Also I’ve noticed that the problem occurs sometimes
>when no DMA operation is going on.

Yes, but that doesn’t rule out the DMA. As Mr. Ammerman suggested, this could be a corrupted DMA operation resulting in an overwrite.

Peter
OSR
@OSRDrivers


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:
— NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at To unsubscribe, visit the List Server section of OSR Online at</http:></http:></http:></mailto:xxxxx></mailto:xxxxx>

I’ve tried all these, but it did not yield any red flags.

Any other ideas?

Burrr

On 6/5/2018 5:57 PM, xxxxx@gmail.commailto:xxxxx wrote:
And debug console logging, even if it slows down reproduction to have the debugger attached, would at least give you clues about what your driver was doing around the time of the failure.

I’d dedicate a test system just to running with the debugger attached and your driver logging its operations. Meanwhile pursue other paths.

Mark Roddy

On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.commailto:xxxxx > wrote:
>The problem is hard to create and takes anywhere from 2 hrs to 18 hrs

Ugh. My condolences.

>I don’t have a kernel debugger attached

Regardless, I would recommend you test your driver with Driver Verifier DMA verification enabled. IF you have a problem with the DMA APIs, this will usually catch it quickly.

>Also I’ve noticed that the problem occurs sometimes
>when no DMA operation is going on.

Yes, but that doesn’t rule out the DMA. As Mr. Ammerman suggested, this could be a corrupted DMA operation resulting in an overwrite.

Peter
OSR
@OSRDrivers


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:
— NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at To unsubscribe, visit the List Server section of OSR Online at</http:></http:></http:></mailto:xxxxx></mailto:xxxxx>

> Would a driver be able to cause a lockup that would block the NMI from responding to the OS?

Of course. For example, consider what happens if it somehow corrupts the memory region that is occupied by IDTs - in such case NMI, just like any other interrupt, seems to be out of luck completely.

IIRC, every CPU has its own IDT under Windows, but still these IDTs must be, apparently, located in the same memory region. For example, all theoretically possible IDTs ( 256 possible IDTs * 256 IDT entries * 16 bytes per entry on a 64-bit system) would occupy only 1M, i.e. fit in the same large page. Therefore, a single relatively large write to the target area is going to screw up all of them in one go, and, at this point, you are going to get exactly the scenario that you are describing, i.e. a sudden freeze of the system that cannot get resolved even by NMI.

A driver can do it either directly by the CPU if the target area is not marked as a read-only one
in its PTE, or indirectly by the wrong DMA operation.

In general, I would suggest taking “The Occam Razor” approach, and start investigating the most
likely reasons and simple theories before proceeding to more complex ones. In this particular case I would start from the theory of IDT corruption (first direct and then indirect one) before proceeding
to more complex scenarios (like a hardware-caused lockup which is,in turn, is caused by a driver incorrectly programming its device)

Anton Bassov

Thanks for the explanation

On 6/7/2018 10:48 PM, xxxxx@hotmail.com wrote:

> Would a driver be able to cause a lockup that would block the NMI from responding to the OS?

Of course. For example, consider what happens if it somehow corrupts the memory region that is occupied by IDTs - in such case NMI, just like any other interrupt, seems to be out of luck completely.

IIRC, every CPU has its own IDT under Windows, but still these IDTs must be, apparently, located in the same memory region. For example, all theoretically possible IDTs ( 256 possible IDTs * 256 IDT entries * 16 bytes per entry on a 64-bit system) would occupy only 1M, i.e. fit in the same large page. Therefore, a single relatively large write to the target area is going to screw up all of them in one go, and, at this point, you are going to get exactly the scenario that you are describing, i.e. a sudden freeze of the system that cannot get resolved even by NMI.

A driver can do it either directly by the CPU if the target area is not marked as a read-only one
in its PTE, or indirectly by the wrong DMA operation.

In general, I would suggest taking “The Occam Razor” approach, and start investigating the most
likely reasons and simple theories before proceeding to more complex ones. In this particular case I would start from the theory of IDT corruption (first direct and then indirect one) before proceeding
to more complex scenarios (like a hardware-caused lockup which is,in turn, is caused by a driver incorrectly programming its device)

Anton Bassov

Thanks for the explanation

On 6/7/2018 10:48 PM, xxxxx@hotmail.com wrote:

> Would a driver be able to cause a lockup that would block the NMI from responding to the OS?

Of course. For example, consider what happens if it somehow corrupts the memory region that is occupied by IDTs - in such case NMI, just like any other interrupt, seems to be out of luck completely.

IIRC, every CPU has its own IDT under Windows, but still these IDTs must be, apparently, located in the same memory region. For example, all theoretically possible IDTs ( 256 possible IDTs * 256 IDT entries * 16 bytes per entry on a 64-bit system) would occupy only 1M, i.e. fit in the same large page. Therefore, a single relatively large write to the target area is going to screw up all of them in one go, and, at this point, you are going to get exactly the scenario that you are describing, i.e. a sudden freeze of the system that cannot get resolved even by NMI.

A driver can do it either directly by the CPU if the target area is not marked as a read-only one
in its PTE, or indirectly by the wrong DMA operation.

In general, I would suggest taking “The Occam Razor” approach, and start investigating the most
likely reasons and simple theories before proceeding to more complex ones. In this particular case I would start from the theory of IDT corruption (first direct and then indirect one) before proceeding
to more complex scenarios (like a hardware-caused lockup which is,in turn, is caused by a driver incorrectly programming its device)

Anton Bassov

More logging to console.

Mark Roddy

On Thu, Jun 7, 2018 at 2:50 PM xxxxx@outlook.com
wrote:

> I’ve tried all these, but it did not yield any red flags.
>
> Any other ideas?
>
> Burrr
>
> On 6/5/2018 5:57 PM, xxxxx@gmail.com wrote:
>
> And debug console logging, even if it slows down reproduction to have the
> debugger attached, would at least give you clues about what your driver was
> doing around the time of the failure.
>
> I’d dedicate a test system just to running with the debugger attached and
> your driver logging its operations. Meanwhile pursue other paths.
>
> Mark Roddy
>
>
> On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.com
> wrote:
>
>> >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs
>>
>> Ugh. My condolences.
>>
>> >I don’t have a kernel debugger attached
>>
>> Regardless, I would recommend you test your driver with Driver Verifier
>> DMA verification enabled. IF you have a problem with the DMA APIs, this
>> will usually catch it quickly.
>>
>> >Also I’ve noticed that the problem occurs sometimes
>> >when no DMA operation is going on.
>>
>> Yes, but that doesn’t rule out the DMA. As Mr. Ammerman suggested, this
>> could be a corrupted DMA operation resulting in an overwrite.
>>
>> Peter
>> OSR
>> @OSRDrivers
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> Visit the list online at: <
>> http://www.osronline.com/showlists.cfm?list=ntdev&gt;
>>
>> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>> software drivers!
>> Details at http:
>>
>> To unsubscribe, visit the List Server section of OSR Online at <
>> http://www.osronline.com/page.cfm?name=ListServer&gt;
>>
> — NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars
> on crash dump analysis, WDF, Windows internals and software drivers!
> Details at To unsubscribe, visit the List Server section of OSR Online at
>
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev&gt;
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>