Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Before Posting... Please check out the Community Guidelines in the
Announcements and Administration Category.

System Lockup - NMI Crash Dump

BurrrBurrr Posts: 9
Hi,

I am debugging a rather nasty system issue which locks up the system(no
response from the keyboard, mouse), and happens on some systems but not
others. This is a Windows 10 system with several of our drivers both
PCIe and USB. I've tried several experiments to isolate the issue
without much success. One of the methods we landed on, to narrow down
the issue is a NMI jumper available on the motherboard which can force a
crash dump.
https://blogs.technet.microsoft.com/askperf/2009/01/23/two-minute-drill-nmi/

We can create a crash dump by asserting the NMI when the system is
running, however I can't seem to create a crash dump when the system
locks up.
I did try to force one of my drivers to lock up by creating a pseudo
condition, and I was able to create a crash dump using the NMI in that
scenario.

My question are:
1. What would cause a lockup where the NMI does not respond?
2. Would a driver be able to cause a lockup that would block the NMI
from responding to the OS?

Any insights would be appreciated,

Thanks,
Burrr

Comments

  • Jan_BottorffJan_Bottorff Posts: 465
    I'd suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups.

    Jan

    Jan
    -----Original Message-----
    From: xxxxx@lists.osr.com <xxxxx@lists.osr.com> On Behalf Of Burrr <xxxxx@outlook.com>
    Sent: Friday, June 1, 2018 11:03 AM
    To: Windows System Software Devs Interest List <xxxxx@lists.osr.com>
    Subject: [ntdev] System Lockup - NMI Crash Dump

    Hi,

    I am debugging a rather nasty system issue which locks up the system(no response from the keyboard, mouse), and happens on some systems but not others. This is a Windows 10 system with several of our drivers both PCIe and USB. I've tried several experiments to isolate the issue without much success. One of the methods we landed on, to narrow down the issue is a NMI jumper available on the motherboard which can force a crash dump.
    https://blogs.technet.microsoft.com/askperf/2009/01/23/two-minute-drill-nmi/

    We can create a crash dump by asserting the NMI when the system is running, however I can't seem to create a crash dump when the system locks up.
    I did try to force one of my drivers to lock up by creating a pseudo condition, and I was able to create a crash dump using the NMI in that scenario.

    My question are:
    1. What would cause a lockup where the NMI does not respond?
    2. Would a driver be able to cause a lockup that would block the NMI from responding to the OS?

    Any insights would be appreciated,

    Thanks,
    Burrr
  • Pavel_APavel_A Posts: 2,643
    > I'd suggest reading the thread at
    > http://www.osronline.com/showThread.CFM?link=288112 It has some new and really
    > interesting strategies to debug hard lockups.

    Only the IPC thing there probably was a typo, it should be IPI (initer-processor interupt)?

    -- pa
  • anton_bassovanton_bassov Posts: 4,792
    > Only the IPC thing there probably was a typo, it should be IPI (initer-processor interupt)


    Of course it was very obviously a typo, but I was literally floored by the OP's reaction to Mark's statement, particularly by the part concerning "debug stub sending NMI to all other processors". Look what he said.....


    <quote>

    Mark, I don't understand your message. What is an IPC? Inter process communication?
    Are you explaining that if any one processor doesn't respond to the debug stub, then the stub cannot break in? And why would NMI be any different? Wouldn't the debug stub send an NMI to all the other processors to stop them?

    </quote>



    Anton Bassov
  • BurrrBurrr Posts: 9
    Thanks much for the information.

    On my system I am already able to generate a NMI when the system is
    working normally. However when the system locks up, it does not work.

    I have attached a PCIe analyzer to see if there are any weird things
    going on, but did not find anything useful.
    There are 7 MSI interrupts and DMA transactions that are being used on
    my PCIe driver. There are several other USB drivers that are used on
    prior systems and redeployed here.

    I am floored as to the reasons why the OS would not respond to the NMI.
    Also what can cause such an event?

    Burrr

    On 6/2/2018 2:27 AM, Jan Bottorff wrote:
    > I'd suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups.
    >
    > Jan
    >
    > Jan
  • BurrrBurrr Posts: 9
    From the thread, I gather possible choices for a freeze where the NMI
    doesn't respond, are:

    Bus Freeze
    Rogue DMA request
    Interrupt Storm

    Is that right?

    Burrr

    On 6/2/2018 2:27 AM, Jan Bottorff wrote:
    > I'd suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups.
    >
    > Jan
    >
  • On Mon, Jun 4, 2018, 5:27 PM xxxxx@outlook.com
    wrote:

    > Thanks much for the information.
    >
    > On my system I am already able to generate a NMI when the system is
    > working normally. However when the system locks up, it does not work.
    >
    > I have attached a PCIe analyzer to see if there are any weird things
    > going on, but did not find anything useful.
    > There are 7 MSI interrupts and DMA transactions that are being used on
    > my PCIe driver. There are several other USB drivers that are used on
    > prior systems and redeployed here.
    >
    > I am floored as to the reasons why the OS would not respond to the NMI.
    > Also what can cause such an event?
    >
    > Burrr
    >
    > On 6/2/2018 2:27 AM, Jan Bottorff wrote:
    > > I'd suggest reading the thread at
    > http://www.osronline.com/showThread.CFM?link=288112 It has some new and
    > really interesting strategies to debug hard lockups.
    > >
    > > Jan
    > >
    > > Jan
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at: <
    > http://www.osronline.com/showlists.cfm?list=ntdev>;
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    > software drivers!
    > Details at
    >
    > To unsubscribe, visit the List Server section of OSR Online at <
    > http://www.osronline.com/page.cfm?name=ListServer>;
  • Mark_RoddyMark_Roddy Posts: 4,269
    Rogue DMA and Bus Freeze are likely going to resolve to "Bus Freeze". If
    your resources include access to a pci(e) bus analyzer that is the best
    path forward in my opinion.

    Although plain old debug console logging can also be fruitful and is way
    less expensive.

    Mark Roddy

    On Mon, Jun 4, 2018 at 8:11 AM, xxxxx@outlook.com
    wrote:

    > From the thread, I gather possible choices for a freeze where the NMI
    > doesn't respond, are:
    >
    > Bus Freeze
    > Rogue DMA request
    > Interrupt Storm
    >
    > Is that right?
    >
    > Burrr
    >
    > On 6/2/2018 2:27 AM, Jan Bottorff wrote:
    > > I'd suggest reading the thread at http://www.osronline.com/
    > showThread.CFM?link=288112 It has some new and really interesting
    > strategies to debug hard lockups.
    > >
    > > Jan
    > >
    >
    >
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at: showlists.cfm?list=ntdev>
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    > software drivers!
    > Details at
    >
    > To unsubscribe, visit the List Server section of OSR Online at <
    > http://www.osronline.com/page.cfm?name=ListServer>;
    >
  • Or OS so corrupted by overwrite that it can't handle the NMI

    * Bob


      Bob Ammerman
      xxxxx@ramsystems.biz
      716.864.8337

    138 Liston St
    Buffalo, NY 14223
    www.ramsystems.biz


    -----Original Message-----
    From: xxxxx@lists.osr.com <xxxxx@lists.osr.com> On Behalf Of xxxxx@outlook.com
    Sent: Monday, June 4, 2018 8:11 AM
    To: Windows System Software Devs Interest List <xxxxx@lists.osr.com>
    Subject: Re:[ntdev] System Lockup - NMI Crash Dump

    From the thread, I gather possible choices for a freeze where the NMI doesn't respond, are:

    Bus Freeze
    Rogue DMA request
    Interrupt Storm

    Is that right?

    Burrr

    On 6/2/2018 2:27 AM, Jan Bottorff wrote:
    > I'd suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups.
    >
    > Jan
    >



    ---
    NTDEV is sponsored by OSR

    Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev>;

    MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
    Details at <http://www.osr.com/seminars>;

    To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer>;
  • +1 for what Mr. Roddy said, above. It is *exactly* what I was going to post.

    Nothing substitutes for a bus analyzer, which often will let you root-cause a complex problem like this in minutes (or, you know, at least kick it back to the FPGA guys)... but it is amazing how very much you can discern using DbgPrint.

    Another avenue, if you can work WITH your FPGA people, is to have them help with ChipScope or SignalTap... A good FPGA guy can get almost as much out of this as a proper bus analyzer.

    Peter
    OSR
    @OSRDrivers

    Peter Viscarola
    OSR
    @OSRDrivers

  • BurrrBurrr Posts: 9
    We've captured a few PCIe Analyzer traces but none of them point to
    anything specific or bus level errors.

    We've also captured lots of traces with the FPGA with ChipScope with
    nothing specific or apparent that points to the issue.

    Burrr

    On 6/4/2018 12:13 PM, xxxxx@osr.com wrote:
    > +1 for what Mr. Roddy said, above. It is *exactly* what I was going to post.
    >
    > Nothing substitutes for a bus analyzer, which often will let you root-cause a complex problem like this in minutes (or, you know, at least kick it back to the FPGA guys)... but it is amazing how very much you can discern using DbgPrint.
    >
    > Another avenue, if you can work WITH your FPGA people, is to have them help with ChipScope or SignalTap... A good FPGA guy can get almost as much out of this as a proper bus analyzer.
    >
    > Peter
    > OSR
    > @OSRDrivers
    >
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev>;
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
    > Details at <http://www.osr.com/seminars>;
    >
    > To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer>;
  • Scott_Noone_(OSR)Scott_Noone_(OSR) Posts: 3,004
    Does the problem reproduce pretty quickly/easily? Also, do you have a kernel debugger attached?

    If you add enough DbgPrints you should be able to figure out the last things your driver(s) did before the hang. Pay particular attention to the DMA transfers that you performed prior to the hang (particularly offsets, lengths, and physical addresses). I've definitely debugged problems like this that way, usually it ends up being a particular set of arguments that triggers an edge condition in my code or the hardware.

    -scott
    OSR

    -scott
    OSR

  • And to add to what Mr. Noone said.... If this problem is DMA related, enable DMA Verification in Driver Verfier, as well.

    Peter
    OSR
    @OSRDrivers

    Peter Viscarola
    OSR
    @OSRDrivers

  • BurrrBurrr Posts: 9
    The problem is hard to create and takes anywhere from 2 hrs to 18 hrs to
    create. I don't have a kernel debugger attached. The reason is: If I do
    anything to slow down the operation of the system, the problem takes
    several days to occur.
    Also I've noticed that the problem occurs sometimes when no DMA
    operation is going on.

    Burrr

    On 6/5/2018 9:26 AM, xxxxx@osr.com wrote:
    > Does the problem reproduce pretty quickly/easily? Also, do you have a kernel debugger attached?
    >
    > If you add enough DbgPrints you should be able to figure out the last things your driver(s) did before the hang. Pay particular attention to the DMA transfers that you performed prior to the hang (particularly offsets, lengths, and physical addresses). I've definitely debugged problems like this that way, usually it ends up being a particular set of arguments that triggers an edge condition in my code or the hardware.
    >
    > -scott
    > OSR
    >
  • >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs

    Ugh. My condolences.

    >I don't have a kernel debugger attached

    Regardless, I would recommend you test your driver with Driver Verifier DMA verification enabled. IF you have a problem with the DMA APIs, this will usually catch it quickly.

    >Also I've noticed that the problem occurs sometimes
    >when no DMA operation is going on.

    Yes, but that doesn't rule out the DMA. As Mr. Ammerman suggested, this could be a corrupted DMA operation resulting in an overwrite.

    Peter
    OSR
    @OSRDrivers

    Peter Viscarola
    OSR
    @OSRDrivers

  • Mark_RoddyMark_Roddy Posts: 4,269
    And debug console logging, even if it slows down reproduction to have the
    debugger attached, would at least give you clues about what your driver was
    doing around the time of the failure.

    I'd dedicate a test system just to running with the debugger attached and
    your driver logging its operations. Meanwhile pursue other paths.

    Mark Roddy


    On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.com wrote:

    > >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs
    >
    > Ugh. My condolences.
    >
    > >I don't have a kernel debugger attached
    >
    > Regardless, I would recommend you test your driver with Driver Verifier
    > DMA verification enabled. IF you have a problem with the DMA APIs, this
    > will usually catch it quickly.
    >
    > >Also I've noticed that the problem occurs sometimes
    > >when no DMA operation is going on.
    >
    > Yes, but that doesn't rule out the DMA. As Mr. Ammerman suggested, this
    > could be a corrupted DMA operation resulting in an overwrite.
    >
    > Peter
    > OSR
    > @OSRDrivers
    >
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at: <
    > http://www.osronline.com/showlists.cfm?list=ntdev>;
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    > software drivers!
    > Details at
    >
    > To unsubscribe, visit the List Server section of OSR Online at <
    > http://www.osronline.com/page.cfm?name=ListServer>;
    >
  • BurrrBurrr Posts: 9
    Before I saw this message I had started a test with the debugger attached to a system to see if I would be able to break into the debugger when the lockup occurred.

    A lockup did occur, but I was unable to break into the debugger.

    I restarted the test with some minimal logging from my main driver with the debugger attached.

    Burrr

    On 6/5/2018 5:57 PM, xxxxx@gmail.com wrote:
    And debug console logging, even if it slows down reproduction to have the debugger attached, would at least give you clues about what your driver was doing around the time of the failure.

    I'd dedicate a test system just to running with the debugger attached and your driver logging its operations. Meanwhile pursue other paths.

    Mark Roddy


    On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.com > wrote:
    >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs

    Ugh. My condolences.

    >I don't have a kernel debugger attached

    Regardless, I would recommend you test your driver with Driver Verifier DMA verification enabled. IF you have a problem with the DMA APIs, this will usually catch it quickly.

    >Also I've noticed that the problem occurs sometimes
    >when no DMA operation is going on.

    Yes, but that doesn't rule out the DMA. As Mr. Ammerman suggested, this could be a corrupted DMA operation resulting in an overwrite.

    Peter
    OSR
    @OSRDrivers


    ---
    NTDEV is sponsored by OSR

    Visit the list online at:

    MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
    Details at

    To unsubscribe, visit the List Server section of OSR Online at
    --- NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at To unsubscribe, visit the List Server section of OSR Online at
  • BurrrBurrr Posts: 9
    I've tried all these, but it did not yield any red flags.

    Any other ideas?

    Burrr

    On 6/5/2018 5:57 PM, xxxxx@gmail.com wrote:
    And debug console logging, even if it slows down reproduction to have the debugger attached, would at least give you clues about what your driver was doing around the time of the failure.

    I'd dedicate a test system just to running with the debugger attached and your driver logging its operations. Meanwhile pursue other paths.

    Mark Roddy


    On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.com > wrote:
    >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs

    Ugh. My condolences.

    >I don't have a kernel debugger attached

    Regardless, I would recommend you test your driver with Driver Verifier DMA verification enabled. IF you have a problem with the DMA APIs, this will usually catch it quickly.

    >Also I've noticed that the problem occurs sometimes
    >when no DMA operation is going on.

    Yes, but that doesn't rule out the DMA. As Mr. Ammerman suggested, this could be a corrupted DMA operation resulting in an overwrite.

    Peter
    OSR
    @OSRDrivers


    ---
    NTDEV is sponsored by OSR

    Visit the list online at:

    MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
    Details at

    To unsubscribe, visit the List Server section of OSR Online at
    --- NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at To unsubscribe, visit the List Server section of OSR Online at
  • anton_bassovanton_bassov Posts: 4,792
    > Would a driver be able to cause a lockup that would block the NMI from responding to the OS?

    Of course. For example, consider what happens if it somehow corrupts the memory region that is occupied by IDTs - in such case NMI, just like any other interrupt, seems to be out of luck completely.

    IIRC, every CPU has its own IDT under Windows, but still these IDTs must be, apparently, located in the same memory region. For example, all theoretically possible IDTs ( 256 possible IDTs * 256 IDT entries * 16 bytes per entry on a 64-bit system) would occupy only 1M, i.e. fit in the same large page. Therefore, a single relatively large write to the target area is going to screw up all of them in one go, and, at this point, you are going to get exactly the scenario that you are describing, i.e. a sudden freeze of the system that cannot get resolved even by NMI.



    A driver can do it either directly by the CPU if the target area is not marked as a read-only one
    in its PTE, or indirectly by the wrong DMA operation.


    In general, I would suggest taking "The Occam Razor" approach, and start investigating the most
    likely reasons and simple theories before proceeding to more complex ones. In this particular case I would start from the theory of IDT corruption (first direct and then indirect one) before proceeding
    to more complex scenarios (like a hardware-caused lockup which is,in turn, is caused by a driver incorrectly programming its device)



    Anton Bassov
  • BurrrBurrr Posts: 9
    Thanks for the explanation

    On 6/7/2018 10:48 PM, xxxxx@hotmail.com wrote:
    >
    >> Would a driver be able to cause a lockup that would block the NMI from responding to the OS?
    >
    > Of course. For example, consider what happens if it somehow corrupts the memory region that is occupied by IDTs - in such case NMI, just like any other interrupt, seems to be out of luck completely.
    >
    > IIRC, every CPU has its own IDT under Windows, but still these IDTs must be, apparently, located in the same memory region. For example, all theoretically possible IDTs ( 256 possible IDTs * 256 IDT entries * 16 bytes per entry on a 64-bit system) would occupy only 1M, i.e. fit in the same large page. Therefore, a single relatively large write to the target area is going to screw up all of them in one go, and, at this point, you are going to get exactly the scenario that you are describing, i.e. a sudden freeze of the system that cannot get resolved even by NMI.
    >
    >
    >
    > A driver can do it either directly by the CPU if the target area is not marked as a read-only one
    > in its PTE, or indirectly by the wrong DMA operation.
    >
    >
    > In general, I would suggest taking "The Occam Razor" approach, and start investigating the most
    > likely reasons and simple theories before proceeding to more complex ones. In this particular case I would start from the theory of IDT corruption (first direct and then indirect one) before proceeding
    > to more complex scenarios (like a hardware-caused lockup which is,in turn, is caused by a driver incorrectly programming its device)
    >
    >
    >
    > Anton Bassov
    >
  • BurrrBurrr Posts: 9
    Thanks for the explanation

    On 6/7/2018 10:48 PM, xxxxx@hotmail.com wrote:
    >
    >> Would a driver be able to cause a lockup that would block the NMI from responding to the OS?
    >
    > Of course. For example, consider what happens if it somehow corrupts the memory region that is occupied by IDTs - in such case NMI, just like any other interrupt, seems to be out of luck completely.
    >
    > IIRC, every CPU has its own IDT under Windows, but still these IDTs must be, apparently, located in the same memory region. For example, all theoretically possible IDTs ( 256 possible IDTs * 256 IDT entries * 16 bytes per entry on a 64-bit system) would occupy only 1M, i.e. fit in the same large page. Therefore, a single relatively large write to the target area is going to screw up all of them in one go, and, at this point, you are going to get exactly the scenario that you are describing, i.e. a sudden freeze of the system that cannot get resolved even by NMI.
    >
    >
    >
    > A driver can do it either directly by the CPU if the target area is not marked as a read-only one
    > in its PTE, or indirectly by the wrong DMA operation.
    >
    >
    > In general, I would suggest taking "The Occam Razor" approach, and start investigating the most
    > likely reasons and simple theories before proceeding to more complex ones. In this particular case I would start from the theory of IDT corruption (first direct and then indirect one) before proceeding
    > to more complex scenarios (like a hardware-caused lockup which is,in turn, is caused by a driver incorrectly programming its device)
    >
    >
    >
    > Anton Bassov
    >
  • Mark_RoddyMark_Roddy Posts: 4,269
    More logging to console.

    Mark Roddy


    On Thu, Jun 7, 2018 at 2:50 PM xxxxx@outlook.com
    wrote:

    > I've tried all these, but it did not yield any red flags.
    >
    > Any other ideas?
    >
    > Burrr
    >
    > On 6/5/2018 5:57 PM, xxxxx@gmail.com wrote:
    >
    > And debug console logging, even if it slows down reproduction to have the
    > debugger attached, would at least give you clues about what your driver was
    > doing around the time of the failure.
    >
    > I'd dedicate a test system just to running with the debugger attached and
    > your driver logging its operations. Meanwhile pursue other paths.
    >
    > Mark Roddy
    >
    >
    > On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.com
    > wrote:
    >
    >> >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs
    >>
    >> Ugh. My condolences.
    >>
    >> >I don't have a kernel debugger attached
    >>
    >> Regardless, I would recommend you test your driver with Driver Verifier
    >> DMA verification enabled. IF you have a problem with the DMA APIs, this
    >> will usually catch it quickly.
    >>
    >> >Also I've noticed that the problem occurs sometimes
    >> >when no DMA operation is going on.
    >>
    >> Yes, but that doesn't rule out the DMA. As Mr. Ammerman suggested, this
    >> could be a corrupted DMA operation resulting in an overwrite.
    >>
    >> Peter
    >> OSR
    >> @OSRDrivers
    >>
    >>
    >> ---
    >> NTDEV is sponsored by OSR
    >>
    >> Visit the list online at: <
    >> http://www.osronline.com/showlists.cfm?list=ntdev>;
    >>
    >> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    >> software drivers!
    >> Details at
    >>
    >> To unsubscribe, visit the List Server section of OSR Online at <
    >> http://www.osronline.com/page.cfm?name=ListServer>;
    >>
    > --- NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars
    > on crash dump analysis, WDF, Windows internals and software drivers!
    > Details at To unsubscribe, visit the List Server section of OSR Online at
    >
    >
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at: <
    > http://www.osronline.com/showlists.cfm?list=ntdev>;
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    > software drivers!
    > Details at
    >
    > To unsubscribe, visit the List Server section of OSR Online at <
    > http://www.osronline.com/page.cfm?name=ListServer>;
    >
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!