Please help. PCI driver DMA DPC cause BSOD on Win10 1809 version

Another fun fact for today that I just learned: the DPC watchdog appears to be disabled if you have the debugger attached. I tried to force a machine to crash by infinitely looping on a spinlock in a DPC and nothing happened. Breaking in and running !dpcwatchdog I see this:

Current DPC: Nothing!BuggyDpcRoutine (Normal DPC)
Debugger Saved IRQL: 13
Cumulative DPC Time Limit: 0.000 seconds
Current Cumulative DPC Time: 72.234 seconds
Single DPC Time Limit: 0.000 seconds
Current Single DPC Time: 72.234 seconds

Makes sense that this crash might come as a surprise to someone if they did most of their test/validation with a debugger attached.

I’m pretty sure the DPC watchdog limit is a lot longer than 128 uSec, although 20 seconds does seem like an extremely long time. As I remember it get’s checked in the timer ISR/IPI, so a single 16msec timer tick would be way longer than 128 uSec. Perhaps it’s in opaque DPC_WATCHDOG units. The APIs to retrieve it just will tell you what fraction of the way you are to a timeout. I usually pick a number like 25% as a threshold of when to handoff work to a passive level thread.

Perhaps a request for a documentation update is in order, to clarify what unit the watchdog “ticks” are in.

It would be pretty easy for the OP to write a few lines of code to measure what the limit actually is. Raise IRQL to DISPATCH_LEVEL, loop around incrementing a counter with a 10 uSec stall in each loop, see how many times you loop before getting a DPC_WATCHDOG crash.

There also is another watchdog timer which basically detected when a core is no longer responding to timer ticks. All the cores get timer ticks, and there is code that makes sure all cores are responding to those ticks. A core can sometimes be stuck “outside” OS execution, like trapped back to a hypervisor. I believe the number I’ve recently seen is if 4 timer ticks happen, and a core doesn’t react, a timer watchdog timeout bugcheck happens. If for example you have two processor sockets, and one socket get’s into a fatal error state, half the cores will stop responding to timer ticks. Another case happens if for example you go do a PCI config read, and the device keeps responding with a retry status, that core is basically locked on that read and not responding to interrupts. A lot of complex I/O devices process PCI config requests in firmware, so what you see if that firmware locks up while the OS is doing a config cycle is an OS timer watchdog bugcheck.

Even though the watchdog timers can be annoying to developers, I do believe they are a really important feature for detecting hardware and software errors.

Jan

Excellent point!

Jan

There are two different and basically unrelated DPC watchdogs.

The single DPC watchdog measures how long a KDEFERRED_ROUTINE executes. If it’s more than 20 seconds†, you get a bugcheck 133/0.

The cumulative DPC watchdog measures how long a processor stays at DISPATCH_LEVEL (or higher). If it’s more than 120 seconds†, you get a bugcheck 133/1.

Note that the cumulative DPC watchdog doesn’t actually have anything to do with DPCs. You can trigger it from a regular boring thread, just by doing KeRaiseIrql(DISPATCH_LEVEL), then waiting for a couple minutes – no DPC involved. (Another common repro: trying to recursively acquire a spinlock.)

Also note that the cumulative DPC watchdog complains about situations that are not always under your control. If someone ran their crappy code at high IRQL for 119 seconds, then handed the processor to your DPC, you might happen to be the unlucky chap on the processor when the 120 second timer expires. Recent versions of Windows include callstack captures of the last 10 seconds† before a 133/1 is declared, so you can see what other things were going on just prior to the bugcheck.

@“Scott_Noone_(OSR)” said:

Another fun fact for today that I just learned: the DPC watchdog appears to be disabled if you have the debugger attached.

I am not aware of that behavior. However, DPC watchdogs are disabled in most varieties of VM guests†. Use physical hosts, if you need to test against the DPC watchdog.

If you are currently running in a DPC (an actual DPC, not just raised IRQL), you can use the KeQueryDpcWatchdogInformation to learn what the DPC watchdog limits are†, and maybe how close you are to exceeding them†. However, using this API correctly requires a deep understanding of the NT kernel and its DPC watchdog. So we generally recommend that you do not use this API, except possibly for debugging / diagnostics. To properly scare you away, let me point out that the returned DpcWatchdogCount has some super strange behavior. (It only makes sense if are intimately familiar with how the DPC watchdog is implemented. Give me a whiteboard and a beer, and I can explain it in about 20 minutes.) Also, the behavior of DpcWatchdogCount changed somewhere between Windows 8.1 and Windows 10, and is likely to subtlely change again.

@Jan_Bottorff said:

Some kinds of drivers will absolutely need DPC_WATCHDOG avoidance code, like probably any driver that can received unbounded fast data streams. The classic example is a high speed network card, Ethernet packets just arrive, and you can’t just use more and more DPC time as the received rate goes up.

^^This^^. If your driver has no bound on the number of DPCs that get queued up, then your driver has a 133/1 bugcheck waiting for it.

† Implementation detail, subject to change in any release, even a QFE. In fact, some of these details have already changed since the DPC watchdog was invented.

On Oct 19, 2018, at 5:18 PM, Jan_Bottorff wrote:
>
> I’m pretty sure the DPC watchdog limit is a lot longer than 128 uSec, although 20 seconds does seem like an extremely long time.

I suppose we should do an experiment to find out, but I think it really is 128us. I encountered this bug check a year or two ago, and it was for an operation much less than a millisecond.

Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Thanks very much for all your replies. After 2 weeks’s desperately debugging, unfortunately I still can not fix this problem. But I still want to give an update here. We have opened one ticket with Microsoft WDK group and provided 2 dump files to Microsoft engineers. However no solution found yet.

I really do not think our DPC function is the root reason. The same driver code has been used almost for 10 years from Win7 to Win10 (1607, 1703, 1709 and 1803 version) and no problems. Only for the latest Win10 1809 version, the BSOD can consistently happen (with the same hardware).

My debug shows the root reason is from the “PutScatterGatherList()” function inside of our DPC. The “PutScatterGatherList()” can not finish and take forever to run, which I do not understand.

To prove this, I did another test to disable all the interrupts in our driver to enforce the DPC function not running. Since the “PutScatterGatherList()” function is also used in our “CancelActiveRead()” function, I can make another BSOD by canceling the DMA read only without DPC running.

Our “CancelActiveRead()” function has the code like:

VOID CancelActiveRead(  PDEVICE_OBJECT DeviceObject,   PIRP Irp)
{
    //-----------------------------------------------------------------------------
    PGEMINI_DEVICE_EXT  DevExt = (PGEMINI_DEVICE_EXT)DeviceObject->DeviceExtension;
    KIRQL OldIrql;
    //-----------------------------------------------------------------------------

    DbgPrint("\nCancelActiveRead() enter()\n");

    ......  // Some code here

    //-------------------------------------------------------------------------
    KeAcquireSpinLock(&DevExt->ImageReadStateLock, &OldIrql);

    DbgPrint("\nCancelActiveRead() get ImageReadStateLock\n");

    //
    //  This drives everything practically into reset
    //
    SetPCICardReset(DevExt);

    //
    //  Clear the Transfer length
    //
    ProgramDmaTransferLength (DevExt, 0);

    //
    //  Undo some of the resets
    //
    ClearPCICardReset(DevExt);

    //
    //  Disarm the Card!  - yes we are doing this underneath the
    //  app sort of, except that either the app told up to CancelIo,
    //  or Windows is killing the thread...
    //
    ClearControlRegisterBits( DevExt, CTRL_BIT_ARM );
    SetControlRegisterBits( DevExt, CTRL_BIT_FLUSH_INPUT );

    //
    //  If these are equal, the DPC hasn't seen it yet
    //
    if (DevExt->CurrentReadIrp == Irp)
        DevExt->CurrentReadIrp = NULL;


    //
    //  Return the scatter/gather list we've been using
    //
    //  Third parameter is FALSE because this is a Read
    //  operation
    //
    DbgPrint("\nCancelActiveRead() Call PutScatterGatherList()\n");
    DevExt->ReadAdapter->DmaOperations->PutScatterGatherList(
                                        DevExt->ReadAdapter,
                                        DevExt->ReadScatterGatherList,
                                        FALSE);
    DbgPrint("\nCancelActiveRead() end PutScatterGatherList()\n");
    //
    //  Make sure we don't have a stale pointer
    //
    DevExt->ReadScatterGatherList = NULL;


    KeReleaseSpinLock(&DevExt->ImageReadStateLock, OldIrql);
	DbgPrint("\nCancelActiveRead() release ImageReadStateLock\n");
    //-------------------------------------------------------------------------

    .......  // other codes here
    .......
}

From the WinDbg, I got
"CancelActiveRead() enter()

CancelActiveRead() get ImageReadStateLock

CancelActiveRead() Call PutScatterGatherList()
"
NO ANY output after this.

Since I added two dbgprints just before and after the “PutScatterGatherList()”, I can only see the first dbgprint is displayed but the second dbgprint never show up. And the target machine freezes.

In our driver, the “PutScatterGatherList()” function is only used in “CancelActiveRead()” and DPC function. Therefore I think the “PutScatterGatherList()” take forever to run is the root problem caused the BSOD. Because it can not finish forever, finially triggered the DPC_WATCHDOG_VIOLATION.

I just do not understand what causes the “PutScatterGatherList()” can not finish. And I still have no clue how to fix it.

We finally figured out the reason. I give an update here. It may be helpful for others. After working with Microsoft and Scott (OSR), we figured out the Win10 1809 (RS5) has a bug in Version2 DMA adapter. The V2 API can loop forever when the MDL is out of range. We changed to V3 version DMA adapter and tested it. So far the V3 works and we have not find issues yet. Thanks a lot for everyone’s help.

Can i get more information about the bug in “Version2” DMA somewhere? Was this fixed in 1903 update? Is there a way to work around this other than forcing the DMA Version 3 (which is not easily possible on our side atm.).

The reason I am curious is that we are seeing very similar problem with our PCIe driver on one customer’s PC and nowhere else. They have the update 1809 installed. One set of dump files show DPC_WATCHDOG_VIOLATION/DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED and the other dump file set always point to our DPC routine on a specific code line - a division by zero bugcheck although there is no division at that line (for the record: the only division in our DPC is done several lines before that and the divisor is a cached performance counter frequency value).

I don’t want to start describing our problem in detail, just trying to figure out if this could be the same problem.

Thank you!

Does the call stack point to your driver calling PutScatterGatherList? That was the footprint of the bug we saw in the V2 version of the structure. It was acknowledged as bug in 1809 but I didn’t follow up to see if it was fixed or not.

Hi Scott! (former student here) No reference to PutScatterGatherList at all in the stack. We do not actually do any DMA/SG related stuff in the DPC. We just create a DMA transaction, fill Common Buffer with SG descriptors at the very beginning and then let the device fill the memory in a circular buffer fashion - interrupting each time an “image” is DMA’d (it’s an image acquisition device, I mean… camera).
I might be on a wrong path then but since the same driver has been working fine for years the 1809 changes raised some suspicion. I am going to dig deeper, and will perhaps post a separate question if I get completely stuck.

For reference, here are the two kind of stack traces we are getting from multiple dumps (the customer sees a BSOD about once a day). The PMPCIE is our driver.

KMODE_EXCEPTION_NOT_HANDLED
nt!KeBugCheckEx
nt!KiFatalExceptionHandler+0x22
nt!RtlpExecuteHandlerForException+0xf
nt!RtlDispatchException+0x430
nt!KiDispatchException+0x144
nt!KiExceptionDispatch+0xc2
nt!KiDivideErrorFault+0x2ef
PMPCIE!PmEvtInterruptDpc+0xaa [c:\dev\...\pmpcieinterruptsupport.cpp @ 188] 
Wdf01000!FxInterrupt::_InterruptDpcThunk+0x9d [minkernel\wdf\framework\shared\irphandlers\pnp\km\interruptobjectkm.cpp @ 410] 
nt!KiExecuteAllDpcs+0x2e7
nt!KiRetireDpcList+0x1ae
nt!KiIdleLoop+0x5a

DPC_WATCHDOG_VIOLATION
nt!KeBugCheckEx
nt!KeAccumulateTicks+0x144c93
nt!KeClockInterruptNotify+0xcf
hal!HalpTimerClockIpiRoutine+0x21
nt!KiCallInterruptServiceRoutine+0xa5
nt!KiInterruptSubDispatchNoLockNoEtw+0xfa
nt!KiInterruptDispatchNoLockNoEtw+0x37
nt!MiResolveProtoPteFault+0x399
nt!MiDispatchFault+0x359
nt!MmAccessFault+0x15a
nt!KiPageFault+0x343

Can you show us the code in PmpEvtInterruptDpc?

Hi Tim, appreciate your help but I am not sure if I can post the whole code. Apologies if the digest below doesn’t help. Perhaps I can send the whole thing via email?

Below is a shortened version and explanation why it’s done this way. Maybe you’ll spot a problem right away.

ISR (not shown here)

  1. ISR can get fired very often, in extreme cases it could be 20000x a second (e.g. at 20000 FPS).
  2. HOWEVER: In this particular case the rate would be much lower around 10 ISRs a second.
  3. Since we had issues with skipped frames at very high rates we implemented an “ISR event queueing”, as described here.
  4. ISR is called on every frame written, i.e. the device interrupts: “hey, I just DMA’d a frame into your circ. buffer”
  5. ISR retrieves a preallocated event item (ExInterlockedPopEntrySList), fills the event with a simple frame counter and timestamp from KeQueryPerformanceCounter.
  6. ISR queues the event into a custom ISR-DPC queue (ExInterlockedInsertTailList, max 64 items capacity).
  7. ISR then calls WdfInterruptQueueDpcForIsr (if the same interrupt arrives again before the DPC is executed the WdfInterruptQueueDpcForIsr should fail but the ‘event’ is still queued before that).
  8. ISR routine leaves.

DPC (below)

  1. DPC iterates over the ‘ISR event queue’ (PmIrqEventDequeueExInterlockedRemoveHeadList)
  2. DPC checks the mask for the EOF event. (there might be different flags used for other purposes)
  3. DPC queues a ‘notification’ event into another, different queue, this queue is processed by a thread on passive level that sends notifications to user space.
  4. (in short: there is a queue between ISR and DPC and different queue between DPC and passive level thread)
  5. Meanwhile, the ISR may get fired again, pushing another event to the queue, that’s why we limit the DPC with MAX_IRQ_EVENTS

So far this has been working fine although I’m not an experience kernel developer - if there is a problem with the approach itself I’ll be happy to learn. I am also investigating a possibility that the division by Frequency is really the problem and the Frequency gets corrupted (it’s updated at the start of the acquisition). But again, this has been flawless for years and it still is on 99% of computers we tested, even with very high frame rates. Just the one computer on customer’s site reports these weird bugchecks so far.

Function itself

VOID PmEvtInterruptDpc(IN WDFINTERRUPT WdfInterrupt, IN WDFOBJECT WdfDevice) {
    PDEVICE_EXTENSION DeviceExtension = GetDeviceContext(WdfDevice);
    ULONG ulProcessed = 0;
    PPM_IRQ_EVENT irqEvt = NULL;

    while ((irqEvt = PmIrqEventDequeue(DeviceExtension)) != NULL) {
        if( irqEvt->InterruptMask & PMPCIE_INT_MASK_EOF) {
            // We need to store the last timestamps because these are still used with status IOCTL
            DeviceExtension->TimeStamp = irqEvt->TimeStamp;
            DeviceExtension->TimeStamp.QuadPart -= DeviceExtension->AcquisitionStartTime.QuadPart;
            DeviceExtension->TimeStamp.QuadPart *= RES100US;
            DeviceExtension->TimeStamp.QuadPart /= DeviceExtension->Frequency.QuadPart; //< **** DIVISION HERE! ****

            PM_ACQ_METADATA acqMd;
            acqMd.EofCount = irqEvt->EofCount;
            acqMd.TimeStampEof = DeviceExtension->TimeStamp.QuadPart;
            PmQueueAcqEvent(DeviceExtension,PM_EVENT_DING_NOTIFICATIONS,(void *)NOTIFICATION_EOF, &acqMd); //< **** LINE 188 ****, bugcheck sometimes reported here, division by zero
            KeSetEvent(&DeviceExtension->AcqEvent,0,FALSE);
            irqEvt->InterruptMask &= (ULONG)(~PMPCIE_INT_MASK_EOF); // Clear the flag
        }
        ulProcessed++;
        // Push the event object back for reuse
        PmIrqEventReturnForReuse(DeviceExtension, irqEvt);
        // Since we could theoretically spend infinite time in this loop (because ISR
        // can keep filling the queue, we should stop at some point and give up)
        if (ulProcessed > MAX_IRQ_EVENTS) // Max 64 items
        {
            DbgPrint("PMPCIE: Too many IRQs pending in DPC, quitting.");
            break;
        }
    }
    return;
}

You might add ASSERT(DeviceExtension->Frequency.QuadPart); just before the division.

Is RES100US equal to 10,000? Remember that, on some systems, the frequency in KeQueryPerformanceCounter returns the CPU clock, and in that case it wouldn’t take much of a multiplication to overflow a 64-bit integer.

Have you disassembled the code to see if the exception is really occurring on a div instruction? It could be another case of optimization rearrangement. If you look at the code, it doesn’t actually NEED to compute the division until it passes acqMd to PmQueueAcqEvent.

Can you please show us the code where the divide is being done?

Peter

Peter, the division is done right on this line:
DeviceExtension->TimeStamp.QuadPart /= DeviceExtension->Frequency.QuadPart;

Tim, yes RES100US is defined as 10000. Not sure about the overflow you mentioned, even if the dividend would overflow to zero it’s the divisor Frequency that worries me (unless I misunderstood completely).

Good point about the optimization! WinDbg is clear in this regard:

FAULTING_IP: 
PMPCIE!PmEvtInterruptDpc+aa [c:\...\pmpcieinterruptsupport.cpp @ 188]
fffff802`80339926 48f7bfe0130000  idiv    rax,qword ptr [rdi+13E0h]

And the disassembly (may not exactly correspond to the code in above post)

    fffff802`803398fa 747d           je      PMPCIE!PmEvtInterruptDpc+0xfd (fffff802`80339979)
    fffff802`803398fc 488b4b30       mov     rcx, qword ptr [rbx+30h]
    fffff802`80339900 4c8d4dd0       lea     r9, [rbp-30h]
    fffff802`80339904 482b8fd8130000 sub     rcx, qword ptr [rdi+13D8h]
    fffff802`8033990b 4c8b87f8130000 mov     r8, qword ptr [rdi+13F8h]
    fffff802`80339912 4869c110270000 imul    rax, rcx, 2710h
    fffff802`80339919 4c8987f0130000 mov     qword ptr [rdi+13F0h], r8
    fffff802`80339920 4899           cqo     
    fffff802`80339922 4c8945d8       mov     qword ptr [rbp-28h], r8
    fffff802`80339926 48f7bfe0130000 idiv    rax, qword ptr [rdi+13E0h] // <===
    fffff802`8033992d 8bd0           mov     edx, eax
    fffff802`8033992f 488987e8130000 mov     qword ptr [rdi+13E8h], rax
    fffff802`80339936 412bd0         sub     edx, r8d
    fffff802`80339939 488945e0       mov     qword ptr [rbp-20h], rax
    fffff802`8033993d 899700140000   mov     dword ptr [rdi+1400h], edx
    fffff802`80339943 8b4b28         mov     ecx, dword ptr [rbx+28h]
    fffff802`80339946 894dd0         mov     dword ptr [rbp-30h], ecx
    fffff802`80339949 8b4b2c         mov     ecx, dword ptr [rbx+2Ch]
    fffff802`8033994c 8955e8         mov     dword ptr [rbp-18h], edx
    fffff802`8033994f ba02000000     mov     edx, 2
    fffff802`80339954 894dd4         mov     dword ptr [rbp-2Ch], ecx
    fffff802`80339957 488bcf         mov     rcx, rdi
    fffff802`8033995a 448d4206       lea     r8d, [rdx+6]
    fffff802`8033995e e801efffff     call    PMPCIE!PmQueueAcqEvent (fffff802`80338864)

Now assuming the Frequency is really the culprit - which means we must have a synchronization problem - this raises a question:

Can a div/0 exception in DPC lead to a DPC_WATCHDOG_VIOLATION? Perhaps the OS is saving the stack trace, handling the exception, further delaying the DPC, causing the watchdog to cause a bugcheck sooner, essentially hiding the root cause in the other dump file?

In any case I now cache the Frequency only once, in EvtDeviceAdd (MSDN says it’s set at system boot and never changes if we target modern OSes) and waiting for the customer to see if that helps. If yes, it will confirm the synchronization problem.

My caution about overflow was not related to your div by zero problem. I’m just pointing out the potential math problems. I don’t think it’s an issue for you, because your multiplier is smallish, but let me describe how this kind of code can cause problems.

On some systems, KeQueryPerformanceCounter is based on the CPU’s cycle counter. That means the frequency is the clock frequency. Say your CPU is 3.6 GHz. That means the counter overflows a 32-bit integer every second. That’s OK, the field is 64-bits wide.

But some people try to convert to Windows’ 100ns units by multiplying the counter by 10,000,000 and dividing by the frequency. That’s a problem, because it only takes 9 minutes at 3.6GHz for cycle counter x 10,000,000 to overflow a 64-bit integer. You’ll still get a result, but truncated.

Unlikely that the divide by zero is causing the watchdog. For the DPC_WATCHDOG_VIOLATION post the output of the following:

!running -ti

Unfortunately !running -ti gives:

6: kd> !running -ti
GetUlongFromAddress: unable to read from fffff80126b602c4
Couldn't find processor info

Probably because this is an “Automatic” small minidump. Also it seems we actually have only one dump with this watchdog error so far. It might be completely unrelated - the customer simply sent the whole Minidump folder… So far no reports on further crashes, fingers crossed this was the div/0 problem.

Yeah, that’s a minidump and it’s not going to be much use for the DPC_WATCHDOG problem. The clipped version of the stack you provided looks reasonably normal, so the problem is probably on another processor. With a minidump you only get the one processor so it’s a dead end. We’ll all just pretend we didn’t see that one :slight_smile:

Can a div/0 exception in DPC lead to a DPC_WATCHDOG_VIOLATION?

It’s not necessary div/0 exception, it also might be division overflow (happens when the result of division doesn’t fit to AX). And it actually fits the instruction sequence you posted above:

cqo

idiv

if rax has MSB set and the divisor is too small, it might lead to the exception.