Please help. PCI driver DMA DPC cause BSOD on Win10 1809 version

Thanks a lot for your reply. We do have “KeReleaseSpinLockFromDpcLevel” in the else part. I just did not show it. As I said, this driver works very stable for many years from Win7 to Win10 1703 version. The dump file shows the “PutScatterGatherList()” function can not finish in a short time, which I do not understand.

I think I’ve answered this question off-line… but just for the record:

The crash you’re seeing isn’t unusual… it simply is telling you that you’re spending too much time in your DpcForIsr, as described in your dump. You’re allowed 500 clock ticks, you spent 501 clock ticks… see more info in the DPC_WATCHDOG_GLOBAL_TRIAGE_BLOCK as described in the crash.

The only fix for this is to discover WHY you’re spending so much time in your DpcForIsr, and not do that. Seriously. Do some profiling in your DpcForIsr… figure out exactly what’s going on.

This is the only way to avoid this issue.

How much time are you spending waiting for that lock? Is it highly contended? Do you loop in your DpcForIsr (doing “the grand loop”?)… if so, do you limit the number of iterations you use in your loop?

Windows is becoming increasingly strict in enforcing “proper behavior” in drivers. As you know, when YOU spend lots of time in YOUR DpcForIsr, you risk increasing latency for OTHER drivers in the system… to the point where those drivers can start to fail. The ONLY fix for this is to ensure that your driver is “well behaved”…

So, sorry… no quick fix available for this one!

Peter

(I’m working with the OP, who is a former WDF student of mine, offline… I’ll ask him to post-back here once his problem is rectified.)

As others have said, you need to figure out why your driver spends so much time at dispatch_level. If it were me, I’d add some ETW tracing in the ISR/DPC code and collect data on how much time it spends in different areas. I’d start by sprinkling half a dozen events with some summary data like bytes transferred at interesting places, like at the entry/exit of the DPC. ETW events will be timestamped with a QPC timer, which is often currently processor clock speed / 1024, so around 0.3 uSec resolution. Some of the etw trace decoder tools are silly and only show ms resolution, even though the actual timestamp is much better. I’ve seen etw traces with a million events/sec, but it’s better if you keep the event rate somewhat lower. You CAN still extract ETW trace data if the system crashes, there are windbg commands to write the buffers to a file you can view with your preferred ETW tools. You should also look at non-crashing runs and get a sense of what normal for your driver timing is like.

The DPC watchdog timer is set REALLY long, 0x500 ticks is about 20 seconds (assuming 16ms ticks). Are you really spending 20 seconds in one DPC? Typically, if you spend 20 seconds in a DPC or ISR you’re stuck on a spinlock or something. There also are two forms of DPC_WATCHDOG timeout, one is a per DPC limit, and one is for back-to-back DPCS, which includes DPC’s from different drivers, so your driver might need to be frugal with time if another driver is greedy, which is why your driver should not be greedy.

It’s REALLY easy to add ETW tracing with the TraceLogging API, see my example on this list from 6 months ago, or follow the basic MSFT doc pages. I often use the perfview tool on github to look at trace. Traceview also does the trick, especially if you use the etw name to guid hashing tool (etwguid), then you can just put *MyTraceProvider into traceview instead of the guid. Message analyzer seems to currently be broken (again) for TraceLogging traces.

Some kinds of drivers will absolutely need DPC_WATCHDOG avoidance code, like probably any driver that can received unbounded fast data streams. The classic example is a high speed network card, Ethernet packets just arrive, and you can’t just use more and more DPC time as the received rate goes up.

Jan

Jan_Bottorff wrote:

The DPC watchdog timer is set REALLY long, 0x500 ticks is about 20 seconds (assuming 16ms ticks).

I believe the term “ticks” here actually refers to 100ns timing units,
so 0x500 actually means 128µs.

so 0x500 actually means 128µs.

And that would make sense, because [the “official” MSFT guideline](https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/guidelines-for-writing-dpc-routines “the “official” MSFT guideline”) is to spend “no more than 100us” in your DPC per invocation.

The OP has suspended work on this issue temporarily, but I have counseled him to do his due diligence in terms of debugging/profiling. There is, I suppose, a vanishingly small chance that he’s discovered a Windows bug (there have been ongoing changes to the DMA code to support DMA remapping)… and I don’t have source code to look at right now to be able to run the issue down for him.

Peter

You can get the DPC watchdog timeout information from a dump using !dpcwatchdog. Looking at a DPC_WATCHDOG_VIOLATION dump (thank you Google and tech support forums for thy bounty) you can see the timeout is 20 seconds for a single DPC and 120 seconds cumulative:

Current DPC: tcpip!TcpPeriodicTimeoutHandler (Normal DPC)
Debugger Saved IRQL: 13
Cumulative DPC Time Limit: 120.000 seconds
Current Cumulative DPC Time: 20.000 seconds
Single DPC Time Limit: 20.000 seconds
Current Single DPC Time: 20.016 seconds

This is a Windows 10 RS3 target, so possible the numbers are smaller on RS4/RS5.

Another fun fact for today that I just learned: the DPC watchdog appears to be disabled if you have the debugger attached. I tried to force a machine to crash by infinitely looping on a spinlock in a DPC and nothing happened. Breaking in and running !dpcwatchdog I see this:

Current DPC: Nothing!BuggyDpcRoutine (Normal DPC)
Debugger Saved IRQL: 13
Cumulative DPC Time Limit: 0.000 seconds
Current Cumulative DPC Time: 72.234 seconds
Single DPC Time Limit: 0.000 seconds
Current Single DPC Time: 72.234 seconds

Makes sense that this crash might come as a surprise to someone if they did most of their test/validation with a debugger attached.

I’m pretty sure the DPC watchdog limit is a lot longer than 128 uSec, although 20 seconds does seem like an extremely long time. As I remember it get’s checked in the timer ISR/IPI, so a single 16msec timer tick would be way longer than 128 uSec. Perhaps it’s in opaque DPC_WATCHDOG units. The APIs to retrieve it just will tell you what fraction of the way you are to a timeout. I usually pick a number like 25% as a threshold of when to handoff work to a passive level thread.

Perhaps a request for a documentation update is in order, to clarify what unit the watchdog “ticks” are in.

It would be pretty easy for the OP to write a few lines of code to measure what the limit actually is. Raise IRQL to DISPATCH_LEVEL, loop around incrementing a counter with a 10 uSec stall in each loop, see how many times you loop before getting a DPC_WATCHDOG crash.

There also is another watchdog timer which basically detected when a core is no longer responding to timer ticks. All the cores get timer ticks, and there is code that makes sure all cores are responding to those ticks. A core can sometimes be stuck “outside” OS execution, like trapped back to a hypervisor. I believe the number I’ve recently seen is if 4 timer ticks happen, and a core doesn’t react, a timer watchdog timeout bugcheck happens. If for example you have two processor sockets, and one socket get’s into a fatal error state, half the cores will stop responding to timer ticks. Another case happens if for example you go do a PCI config read, and the device keeps responding with a retry status, that core is basically locked on that read and not responding to interrupts. A lot of complex I/O devices process PCI config requests in firmware, so what you see if that firmware locks up while the OS is doing a config cycle is an OS timer watchdog bugcheck.

Even though the watchdog timers can be annoying to developers, I do believe they are a really important feature for detecting hardware and software errors.

Jan

Excellent point!

Jan

There are two different and basically unrelated DPC watchdogs.

The single DPC watchdog measures how long a KDEFERRED_ROUTINE executes. If it’s more than 20 seconds†, you get a bugcheck 133/0.

The cumulative DPC watchdog measures how long a processor stays at DISPATCH_LEVEL (or higher). If it’s more than 120 seconds†, you get a bugcheck 133/1.

Note that the cumulative DPC watchdog doesn’t actually have anything to do with DPCs. You can trigger it from a regular boring thread, just by doing KeRaiseIrql(DISPATCH_LEVEL), then waiting for a couple minutes – no DPC involved. (Another common repro: trying to recursively acquire a spinlock.)

Also note that the cumulative DPC watchdog complains about situations that are not always under your control. If someone ran their crappy code at high IRQL for 119 seconds, then handed the processor to your DPC, you might happen to be the unlucky chap on the processor when the 120 second timer expires. Recent versions of Windows include callstack captures of the last 10 seconds† before a 133/1 is declared, so you can see what other things were going on just prior to the bugcheck.

@“Scott_Noone_(OSR)” said:

Another fun fact for today that I just learned: the DPC watchdog appears to be disabled if you have the debugger attached.

I am not aware of that behavior. However, DPC watchdogs are disabled in most varieties of VM guests†. Use physical hosts, if you need to test against the DPC watchdog.

If you are currently running in a DPC (an actual DPC, not just raised IRQL), you can use the KeQueryDpcWatchdogInformation to learn what the DPC watchdog limits are†, and maybe how close you are to exceeding them†. However, using this API correctly requires a deep understanding of the NT kernel and its DPC watchdog. So we generally recommend that you do not use this API, except possibly for debugging / diagnostics. To properly scare you away, let me point out that the returned DpcWatchdogCount has some super strange behavior. (It only makes sense if are intimately familiar with how the DPC watchdog is implemented. Give me a whiteboard and a beer, and I can explain it in about 20 minutes.) Also, the behavior of DpcWatchdogCount changed somewhere between Windows 8.1 and Windows 10, and is likely to subtlely change again.

@Jan_Bottorff said:

Some kinds of drivers will absolutely need DPC_WATCHDOG avoidance code, like probably any driver that can received unbounded fast data streams. The classic example is a high speed network card, Ethernet packets just arrive, and you can’t just use more and more DPC time as the received rate goes up.

^^This^^. If your driver has no bound on the number of DPCs that get queued up, then your driver has a 133/1 bugcheck waiting for it.

† Implementation detail, subject to change in any release, even a QFE. In fact, some of these details have already changed since the DPC watchdog was invented.

On Oct 19, 2018, at 5:18 PM, Jan_Bottorff wrote:
>
> I’m pretty sure the DPC watchdog limit is a lot longer than 128 uSec, although 20 seconds does seem like an extremely long time.

I suppose we should do an experiment to find out, but I think it really is 128us. I encountered this bug check a year or two ago, and it was for an operation much less than a millisecond.

Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

Thanks very much for all your replies. After 2 weeks’s desperately debugging, unfortunately I still can not fix this problem. But I still want to give an update here. We have opened one ticket with Microsoft WDK group and provided 2 dump files to Microsoft engineers. However no solution found yet.

I really do not think our DPC function is the root reason. The same driver code has been used almost for 10 years from Win7 to Win10 (1607, 1703, 1709 and 1803 version) and no problems. Only for the latest Win10 1809 version, the BSOD can consistently happen (with the same hardware).

My debug shows the root reason is from the “PutScatterGatherList()” function inside of our DPC. The “PutScatterGatherList()” can not finish and take forever to run, which I do not understand.

To prove this, I did another test to disable all the interrupts in our driver to enforce the DPC function not running. Since the “PutScatterGatherList()” function is also used in our “CancelActiveRead()” function, I can make another BSOD by canceling the DMA read only without DPC running.

Our “CancelActiveRead()” function has the code like:

VOID CancelActiveRead(  PDEVICE_OBJECT DeviceObject,   PIRP Irp)
{
    //-----------------------------------------------------------------------------
    PGEMINI_DEVICE_EXT  DevExt = (PGEMINI_DEVICE_EXT)DeviceObject->DeviceExtension;
    KIRQL OldIrql;
    //-----------------------------------------------------------------------------

    DbgPrint("\nCancelActiveRead() enter()\n");

    ......  // Some code here

    //-------------------------------------------------------------------------
    KeAcquireSpinLock(&DevExt->ImageReadStateLock, &OldIrql);

    DbgPrint("\nCancelActiveRead() get ImageReadStateLock\n");

    //
    //  This drives everything practically into reset
    //
    SetPCICardReset(DevExt);

    //
    //  Clear the Transfer length
    //
    ProgramDmaTransferLength (DevExt, 0);

    //
    //  Undo some of the resets
    //
    ClearPCICardReset(DevExt);

    //
    //  Disarm the Card!  - yes we are doing this underneath the
    //  app sort of, except that either the app told up to CancelIo,
    //  or Windows is killing the thread...
    //
    ClearControlRegisterBits( DevExt, CTRL_BIT_ARM );
    SetControlRegisterBits( DevExt, CTRL_BIT_FLUSH_INPUT );

    //
    //  If these are equal, the DPC hasn't seen it yet
    //
    if (DevExt->CurrentReadIrp == Irp)
        DevExt->CurrentReadIrp = NULL;


    //
    //  Return the scatter/gather list we've been using
    //
    //  Third parameter is FALSE because this is a Read
    //  operation
    //
    DbgPrint("\nCancelActiveRead() Call PutScatterGatherList()\n");
    DevExt->ReadAdapter->DmaOperations->PutScatterGatherList(
                                        DevExt->ReadAdapter,
                                        DevExt->ReadScatterGatherList,
                                        FALSE);
    DbgPrint("\nCancelActiveRead() end PutScatterGatherList()\n");
    //
    //  Make sure we don't have a stale pointer
    //
    DevExt->ReadScatterGatherList = NULL;


    KeReleaseSpinLock(&DevExt->ImageReadStateLock, OldIrql);
	DbgPrint("\nCancelActiveRead() release ImageReadStateLock\n");
    //-------------------------------------------------------------------------

    .......  // other codes here
    .......
}

From the WinDbg, I got
"CancelActiveRead() enter()

CancelActiveRead() get ImageReadStateLock

CancelActiveRead() Call PutScatterGatherList()
"
NO ANY output after this.

Since I added two dbgprints just before and after the “PutScatterGatherList()”, I can only see the first dbgprint is displayed but the second dbgprint never show up. And the target machine freezes.

In our driver, the “PutScatterGatherList()” function is only used in “CancelActiveRead()” and DPC function. Therefore I think the “PutScatterGatherList()” take forever to run is the root problem caused the BSOD. Because it can not finish forever, finially triggered the DPC_WATCHDOG_VIOLATION.

I just do not understand what causes the “PutScatterGatherList()” can not finish. And I still have no clue how to fix it.

We finally figured out the reason. I give an update here. It may be helpful for others. After working with Microsoft and Scott (OSR), we figured out the Win10 1809 (RS5) has a bug in Version2 DMA adapter. The V2 API can loop forever when the MDL is out of range. We changed to V3 version DMA adapter and tested it. So far the V3 works and we have not find issues yet. Thanks a lot for everyone’s help.

Can i get more information about the bug in “Version2” DMA somewhere? Was this fixed in 1903 update? Is there a way to work around this other than forcing the DMA Version 3 (which is not easily possible on our side atm.).

The reason I am curious is that we are seeing very similar problem with our PCIe driver on one customer’s PC and nowhere else. They have the update 1809 installed. One set of dump files show DPC_WATCHDOG_VIOLATION/DPC_QUEUE_EXECUTION_TIMEOUT_EXCEEDED and the other dump file set always point to our DPC routine on a specific code line - a division by zero bugcheck although there is no division at that line (for the record: the only division in our DPC is done several lines before that and the divisor is a cached performance counter frequency value).

I don’t want to start describing our problem in detail, just trying to figure out if this could be the same problem.

Thank you!

Does the call stack point to your driver calling PutScatterGatherList? That was the footprint of the bug we saw in the V2 version of the structure. It was acknowledged as bug in 1809 but I didn’t follow up to see if it was fixed or not.

Hi Scott! (former student here) No reference to PutScatterGatherList at all in the stack. We do not actually do any DMA/SG related stuff in the DPC. We just create a DMA transaction, fill Common Buffer with SG descriptors at the very beginning and then let the device fill the memory in a circular buffer fashion - interrupting each time an “image” is DMA’d (it’s an image acquisition device, I mean… camera).
I might be on a wrong path then but since the same driver has been working fine for years the 1809 changes raised some suspicion. I am going to dig deeper, and will perhaps post a separate question if I get completely stuck.

For reference, here are the two kind of stack traces we are getting from multiple dumps (the customer sees a BSOD about once a day). The PMPCIE is our driver.

KMODE_EXCEPTION_NOT_HANDLED
nt!KeBugCheckEx
nt!KiFatalExceptionHandler+0x22
nt!RtlpExecuteHandlerForException+0xf
nt!RtlDispatchException+0x430
nt!KiDispatchException+0x144
nt!KiExceptionDispatch+0xc2
nt!KiDivideErrorFault+0x2ef
PMPCIE!PmEvtInterruptDpc+0xaa [c:\dev\...\pmpcieinterruptsupport.cpp @ 188] 
Wdf01000!FxInterrupt::_InterruptDpcThunk+0x9d [minkernel\wdf\framework\shared\irphandlers\pnp\km\interruptobjectkm.cpp @ 410] 
nt!KiExecuteAllDpcs+0x2e7
nt!KiRetireDpcList+0x1ae
nt!KiIdleLoop+0x5a

DPC_WATCHDOG_VIOLATION
nt!KeBugCheckEx
nt!KeAccumulateTicks+0x144c93
nt!KeClockInterruptNotify+0xcf
hal!HalpTimerClockIpiRoutine+0x21
nt!KiCallInterruptServiceRoutine+0xa5
nt!KiInterruptSubDispatchNoLockNoEtw+0xfa
nt!KiInterruptDispatchNoLockNoEtw+0x37
nt!MiResolveProtoPteFault+0x399
nt!MiDispatchFault+0x359
nt!MmAccessFault+0x15a
nt!KiPageFault+0x343

Can you show us the code in PmpEvtInterruptDpc?

Hi Tim, appreciate your help but I am not sure if I can post the whole code. Apologies if the digest below doesn’t help. Perhaps I can send the whole thing via email?

Below is a shortened version and explanation why it’s done this way. Maybe you’ll spot a problem right away.

ISR (not shown here)

  1. ISR can get fired very often, in extreme cases it could be 20000x a second (e.g. at 20000 FPS).
  2. HOWEVER: In this particular case the rate would be much lower around 10 ISRs a second.
  3. Since we had issues with skipped frames at very high rates we implemented an “ISR event queueing”, as described here.
  4. ISR is called on every frame written, i.e. the device interrupts: “hey, I just DMA’d a frame into your circ. buffer”
  5. ISR retrieves a preallocated event item (ExInterlockedPopEntrySList), fills the event with a simple frame counter and timestamp from KeQueryPerformanceCounter.
  6. ISR queues the event into a custom ISR-DPC queue (ExInterlockedInsertTailList, max 64 items capacity).
  7. ISR then calls WdfInterruptQueueDpcForIsr (if the same interrupt arrives again before the DPC is executed the WdfInterruptQueueDpcForIsr should fail but the ‘event’ is still queued before that).
  8. ISR routine leaves.

DPC (below)

  1. DPC iterates over the ‘ISR event queue’ (PmIrqEventDequeueExInterlockedRemoveHeadList)
  2. DPC checks the mask for the EOF event. (there might be different flags used for other purposes)
  3. DPC queues a ‘notification’ event into another, different queue, this queue is processed by a thread on passive level that sends notifications to user space.
  4. (in short: there is a queue between ISR and DPC and different queue between DPC and passive level thread)
  5. Meanwhile, the ISR may get fired again, pushing another event to the queue, that’s why we limit the DPC with MAX_IRQ_EVENTS

So far this has been working fine although I’m not an experience kernel developer - if there is a problem with the approach itself I’ll be happy to learn. I am also investigating a possibility that the division by Frequency is really the problem and the Frequency gets corrupted (it’s updated at the start of the acquisition). But again, this has been flawless for years and it still is on 99% of computers we tested, even with very high frame rates. Just the one computer on customer’s site reports these weird bugchecks so far.

Function itself

VOID PmEvtInterruptDpc(IN WDFINTERRUPT WdfInterrupt, IN WDFOBJECT WdfDevice) {
    PDEVICE_EXTENSION DeviceExtension = GetDeviceContext(WdfDevice);
    ULONG ulProcessed = 0;
    PPM_IRQ_EVENT irqEvt = NULL;

    while ((irqEvt = PmIrqEventDequeue(DeviceExtension)) != NULL) {
        if( irqEvt->InterruptMask & PMPCIE_INT_MASK_EOF) {
            // We need to store the last timestamps because these are still used with status IOCTL
            DeviceExtension->TimeStamp = irqEvt->TimeStamp;
            DeviceExtension->TimeStamp.QuadPart -= DeviceExtension->AcquisitionStartTime.QuadPart;
            DeviceExtension->TimeStamp.QuadPart *= RES100US;
            DeviceExtension->TimeStamp.QuadPart /= DeviceExtension->Frequency.QuadPart; //< **** DIVISION HERE! ****

            PM_ACQ_METADATA acqMd;
            acqMd.EofCount = irqEvt->EofCount;
            acqMd.TimeStampEof = DeviceExtension->TimeStamp.QuadPart;
            PmQueueAcqEvent(DeviceExtension,PM_EVENT_DING_NOTIFICATIONS,(void *)NOTIFICATION_EOF, &acqMd); //< **** LINE 188 ****, bugcheck sometimes reported here, division by zero
            KeSetEvent(&DeviceExtension->AcqEvent,0,FALSE);
            irqEvt->InterruptMask &= (ULONG)(~PMPCIE_INT_MASK_EOF); // Clear the flag
        }
        ulProcessed++;
        // Push the event object back for reuse
        PmIrqEventReturnForReuse(DeviceExtension, irqEvt);
        // Since we could theoretically spend infinite time in this loop (because ISR
        // can keep filling the queue, we should stop at some point and give up)
        if (ulProcessed > MAX_IRQ_EVENTS) // Max 64 items
        {
            DbgPrint("PMPCIE: Too many IRQs pending in DPC, quitting.");
            break;
        }
    }
    return;
}

You might add ASSERT(DeviceExtension->Frequency.QuadPart); just before the division.

Is RES100US equal to 10,000? Remember that, on some systems, the frequency in KeQueryPerformanceCounter returns the CPU clock, and in that case it wouldn’t take much of a multiplication to overflow a 64-bit integer.

Have you disassembled the code to see if the exception is really occurring on a div instruction? It could be another case of optimization rearrangement. If you look at the code, it doesn’t actually NEED to compute the division until it passes acqMd to PmQueueAcqEvent.