Please help. PCI driver DMA DPC cause BSOD on Win10 1809 version

Hi Tim, appreciate your help but I am not sure if I can post the whole code. Apologies if the digest below doesn’t help. Perhaps I can send the whole thing via email?

Below is a shortened version and explanation why it’s done this way. Maybe you’ll spot a problem right away.

ISR (not shown here)

  1. ISR can get fired very often, in extreme cases it could be 20000x a second (e.g. at 20000 FPS).
  2. HOWEVER: In this particular case the rate would be much lower around 10 ISRs a second.
  3. Since we had issues with skipped frames at very high rates we implemented an “ISR event queueing”, as described here.
  4. ISR is called on every frame written, i.e. the device interrupts: “hey, I just DMA’d a frame into your circ. buffer”
  5. ISR retrieves a preallocated event item (ExInterlockedPopEntrySList), fills the event with a simple frame counter and timestamp from KeQueryPerformanceCounter.
  6. ISR queues the event into a custom ISR-DPC queue (ExInterlockedInsertTailList, max 64 items capacity).
  7. ISR then calls WdfInterruptQueueDpcForIsr (if the same interrupt arrives again before the DPC is executed the WdfInterruptQueueDpcForIsr should fail but the ‘event’ is still queued before that).
  8. ISR routine leaves.

DPC (below)

  1. DPC iterates over the ‘ISR event queue’ (PmIrqEventDequeueExInterlockedRemoveHeadList)
  2. DPC checks the mask for the EOF event. (there might be different flags used for other purposes)
  3. DPC queues a ‘notification’ event into another, different queue, this queue is processed by a thread on passive level that sends notifications to user space.
  4. (in short: there is a queue between ISR and DPC and different queue between DPC and passive level thread)
  5. Meanwhile, the ISR may get fired again, pushing another event to the queue, that’s why we limit the DPC with MAX_IRQ_EVENTS

So far this has been working fine although I’m not an experience kernel developer - if there is a problem with the approach itself I’ll be happy to learn. I am also investigating a possibility that the division by Frequency is really the problem and the Frequency gets corrupted (it’s updated at the start of the acquisition). But again, this has been flawless for years and it still is on 99% of computers we tested, even with very high frame rates. Just the one computer on customer’s site reports these weird bugchecks so far.

Function itself

VOID PmEvtInterruptDpc(IN WDFINTERRUPT WdfInterrupt, IN WDFOBJECT WdfDevice) {
    PDEVICE_EXTENSION DeviceExtension = GetDeviceContext(WdfDevice);
    ULONG ulProcessed = 0;
    PPM_IRQ_EVENT irqEvt = NULL;

    while ((irqEvt = PmIrqEventDequeue(DeviceExtension)) != NULL) {
        if( irqEvt->InterruptMask & PMPCIE_INT_MASK_EOF) {
            // We need to store the last timestamps because these are still used with status IOCTL
            DeviceExtension->TimeStamp = irqEvt->TimeStamp;
            DeviceExtension->TimeStamp.QuadPart -= DeviceExtension->AcquisitionStartTime.QuadPart;
            DeviceExtension->TimeStamp.QuadPart *= RES100US;
            DeviceExtension->TimeStamp.QuadPart /= DeviceExtension->Frequency.QuadPart; //< **** DIVISION HERE! ****

            PM_ACQ_METADATA acqMd;
            acqMd.EofCount = irqEvt->EofCount;
            acqMd.TimeStampEof = DeviceExtension->TimeStamp.QuadPart;
            PmQueueAcqEvent(DeviceExtension,PM_EVENT_DING_NOTIFICATIONS,(void *)NOTIFICATION_EOF, &acqMd); //< **** LINE 188 ****, bugcheck sometimes reported here, division by zero
            KeSetEvent(&DeviceExtension->AcqEvent,0,FALSE);
            irqEvt->InterruptMask &= (ULONG)(~PMPCIE_INT_MASK_EOF); // Clear the flag
        }
        ulProcessed++;
        // Push the event object back for reuse
        PmIrqEventReturnForReuse(DeviceExtension, irqEvt);
        // Since we could theoretically spend infinite time in this loop (because ISR
        // can keep filling the queue, we should stop at some point and give up)
        if (ulProcessed > MAX_IRQ_EVENTS) // Max 64 items
        {
            DbgPrint("PMPCIE: Too many IRQs pending in DPC, quitting.");
            break;
        }
    }
    return;
}

You might add ASSERT(DeviceExtension->Frequency.QuadPart); just before the division.

Is RES100US equal to 10,000? Remember that, on some systems, the frequency in KeQueryPerformanceCounter returns the CPU clock, and in that case it wouldn’t take much of a multiplication to overflow a 64-bit integer.

Have you disassembled the code to see if the exception is really occurring on a div instruction? It could be another case of optimization rearrangement. If you look at the code, it doesn’t actually NEED to compute the division until it passes acqMd to PmQueueAcqEvent.

Can you please show us the code where the divide is being done?

Peter

Peter, the division is done right on this line:
DeviceExtension->TimeStamp.QuadPart /= DeviceExtension->Frequency.QuadPart;

Tim, yes RES100US is defined as 10000. Not sure about the overflow you mentioned, even if the dividend would overflow to zero it’s the divisor Frequency that worries me (unless I misunderstood completely).

Good point about the optimization! WinDbg is clear in this regard:

FAULTING_IP: 
PMPCIE!PmEvtInterruptDpc+aa [c:\...\pmpcieinterruptsupport.cpp @ 188]
fffff802`80339926 48f7bfe0130000  idiv    rax,qword ptr [rdi+13E0h]

And the disassembly (may not exactly correspond to the code in above post)

    fffff802`803398fa 747d           je      PMPCIE!PmEvtInterruptDpc+0xfd (fffff802`80339979)
    fffff802`803398fc 488b4b30       mov     rcx, qword ptr [rbx+30h]
    fffff802`80339900 4c8d4dd0       lea     r9, [rbp-30h]
    fffff802`80339904 482b8fd8130000 sub     rcx, qword ptr [rdi+13D8h]
    fffff802`8033990b 4c8b87f8130000 mov     r8, qword ptr [rdi+13F8h]
    fffff802`80339912 4869c110270000 imul    rax, rcx, 2710h
    fffff802`80339919 4c8987f0130000 mov     qword ptr [rdi+13F0h], r8
    fffff802`80339920 4899           cqo     
    fffff802`80339922 4c8945d8       mov     qword ptr [rbp-28h], r8
    fffff802`80339926 48f7bfe0130000 idiv    rax, qword ptr [rdi+13E0h] // <===
    fffff802`8033992d 8bd0           mov     edx, eax
    fffff802`8033992f 488987e8130000 mov     qword ptr [rdi+13E8h], rax
    fffff802`80339936 412bd0         sub     edx, r8d
    fffff802`80339939 488945e0       mov     qword ptr [rbp-20h], rax
    fffff802`8033993d 899700140000   mov     dword ptr [rdi+1400h], edx
    fffff802`80339943 8b4b28         mov     ecx, dword ptr [rbx+28h]
    fffff802`80339946 894dd0         mov     dword ptr [rbp-30h], ecx
    fffff802`80339949 8b4b2c         mov     ecx, dword ptr [rbx+2Ch]
    fffff802`8033994c 8955e8         mov     dword ptr [rbp-18h], edx
    fffff802`8033994f ba02000000     mov     edx, 2
    fffff802`80339954 894dd4         mov     dword ptr [rbp-2Ch], ecx
    fffff802`80339957 488bcf         mov     rcx, rdi
    fffff802`8033995a 448d4206       lea     r8d, [rdx+6]
    fffff802`8033995e e801efffff     call    PMPCIE!PmQueueAcqEvent (fffff802`80338864)

Now assuming the Frequency is really the culprit - which means we must have a synchronization problem - this raises a question:

Can a div/0 exception in DPC lead to a DPC_WATCHDOG_VIOLATION? Perhaps the OS is saving the stack trace, handling the exception, further delaying the DPC, causing the watchdog to cause a bugcheck sooner, essentially hiding the root cause in the other dump file?

In any case I now cache the Frequency only once, in EvtDeviceAdd (MSDN says it’s set at system boot and never changes if we target modern OSes) and waiting for the customer to see if that helps. If yes, it will confirm the synchronization problem.

My caution about overflow was not related to your div by zero problem. I’m just pointing out the potential math problems. I don’t think it’s an issue for you, because your multiplier is smallish, but let me describe how this kind of code can cause problems.

On some systems, KeQueryPerformanceCounter is based on the CPU’s cycle counter. That means the frequency is the clock frequency. Say your CPU is 3.6 GHz. That means the counter overflows a 32-bit integer every second. That’s OK, the field is 64-bits wide.

But some people try to convert to Windows’ 100ns units by multiplying the counter by 10,000,000 and dividing by the frequency. That’s a problem, because it only takes 9 minutes at 3.6GHz for cycle counter x 10,000,000 to overflow a 64-bit integer. You’ll still get a result, but truncated.

Unlikely that the divide by zero is causing the watchdog. For the DPC_WATCHDOG_VIOLATION post the output of the following:

!running -ti

Unfortunately !running -ti gives:

6: kd> !running -ti
GetUlongFromAddress: unable to read from fffff80126b602c4
Couldn't find processor info

Probably because this is an “Automatic” small minidump. Also it seems we actually have only one dump with this watchdog error so far. It might be completely unrelated - the customer simply sent the whole Minidump folder… So far no reports on further crashes, fingers crossed this was the div/0 problem.

Yeah, that’s a minidump and it’s not going to be much use for the DPC_WATCHDOG problem. The clipped version of the stack you provided looks reasonably normal, so the problem is probably on another processor. With a minidump you only get the one processor so it’s a dead end. We’ll all just pretend we didn’t see that one :slight_smile:

Can a div/0 exception in DPC lead to a DPC_WATCHDOG_VIOLATION?

It’s not necessary div/0 exception, it also might be division overflow (happens when the result of division doesn’t fit to AX). And it actually fits the instruction sequence you posted above:

cqo

idiv

if rax has MSB set and the divisor is too small, it might lead to the exception.