Inverted IOCTL completion hangs after thousands of successful completions?

In the WDF portion of the driver for my GPU, I’m using the Inverted IOCTL mechanism to send events (framebuffer flips, hardware command completions, etc.) back to the usermode application that currently has ownership of and is controlling the hardware. From the usermode side, I create a completion port bound to the device through which I enqueue 16 separate IOCTL requests, and have a thread which continually checks to see if the completion port has any completed requests. When a request is completed, it is immediately re-enqueued. On the kernel driver side, an IO queue is created and the inverted IOCTLs are forwarded to it. When an event occurs, a request is pulled from the queue and completed with the event information. Basically exactly the same setup as is described in the OSR blog post regarding this topic.

This works very well… for a time. After running a usermode application that ends up getting tens of thousands of these event notifications through this mechanism, the driver seems to hit a deadlock calling WdfRequestCompleteWithInformation:

nt!KiAcquireKobjectLockSafe+0x30
nt!KiExitDispatcher+0x195
nt!KeInsertQueueEx+0x113
nt!IopInsertIrpInCompletionQueue+0x79
nt!IopCompleteIrpInFileObjectList+0x4d
nt!IopfCompleteRequest+0x5f7
nt!IofCompleteRequest+0x17
Wdf01000!FxIrp::CompleteRequest+0x13 [minkernel\wdf\framework\shared\inc\private\km\FxIrpKm.hpp @ 75]
Wdf01000!FxRequest::CompleteInternal+0x23a [minkernel\wdf\framework\shared\core\fxrequest.cpp @ 869]
Wdf01000!FxRequest::Complete+0x31 [minkernel\wdf\framework\shared\inc\private\common\FxRequest.hpp @ 805]
Wdf01000!FxRequest::CompleteWithInformation+0x3c [minkernel\wdf\framework\shared\inc\private\common\FxRequest.hpp @ 820]
Wdf01000!imp_WdfRequestCompleteWithInformation+0xa1 [minkernel\wdf\framework\shared\core\fxrequestapi.cpp @ 571]
FuryGPU_WDF!WdfRequestCompleteWithInformation+0x4f [C:\Program Files (x86)\Windows Kits\10\Include\wdf\kmdf\1.15\wdfrequest.h @ 1062]
FuryGPU_WDF!KmdWdfGlobal::NotifyEventCallback+0x99 [E:\FPGA_Projects\fury_gpu\driver\host\windows\FuryGPU_WDDM\FuryGPU_WDF\WdfDriver.cpp @ 409]
FuryGPU_KMD!FuryKmAdapter::InterruptRoutine+0x19d [E:\FPGA_Projects\fury_gpu\driver\host\windows\FuryGPU_WDDM\FuryGPU_KMD\KmdAdapter.cpp @ 553]
FuryGPU_KMD!KmdDdi::DdiInterruptRoutine+0x2a [E:\FPGA_Projects\fury_gpu\driver\host\windows\FuryGPU_WDDM\FuryGPU_KMD\KmdDdi.cpp @ 92]
dxgkrnl!DpiFdoMessageInterruptRoutine+0x5c
nt!KiInterruptMessageDispatch+0x11
nt!KiCallInterruptServiceRoutine+0xa5
nt!KiInterruptSubDispatch+0x11f
nt!KiInterruptDispatch+0x37

I’ve been unable to figure out why this suddenly happens after working correctly for an extremely large number of events. When I hit this issue, the entire machine locks up and I am unable to investigate what the usermode application is trying to do in the debugger.

Any insight would be greatly appreciated!

On the off-chance it might have been the issue, I changed it so that the interrupt handler just fires off a DPC that eventually does the WDF request completion. It hasn’t triggered this specific deadlock again with that change (yet), so that might have been the problem.

A full memory dump will include the UM context, so you can investigate that on a hung system

You should run some driver verifier tests

Directly completing a WDF request from an ISR is a problem. Your IRQL is too high. Queueing a DPC is the standard approach. In the DPC you should attempt to complete as many requests as possible - not just the one that triggered your ISR. Not only will you avoid a common bug (hardware coalesced interrupts) but your performance will improve tangibly for bursts of traffic

That makes sense! I’m already sort of doing that, but I still need to modify the hardware to support multiple DMA completions in between each interrupt. I’ll set up the driver to loop over the completion status until the hardware reports everything’s complete, and avoid sending additional interrupts until that process finishes.

The standard term for this is interrupt moderation and it can have a dramatic impact on performance.

The highest performing devices maintain a queue of pending ‘operations’ in the hardware - only queueing in the driver those operations that are beyond the hardware queue limitations. And use a hybrid interrupt / polling approach for detecting completions

And use a hybrid interrupt / polling approach for detecting completions

Or, even JUST a polling approach.