Memory leak in queues

I need to know what I am doing wrong. We have a legacy driver that takes IOCTL’s request and performs DMA transfers on a single frame.
An optimization that was asked for was a quad frame DMA transfers. The hardware supports this.
After some help from OSR group, I got that running. Now it seems I have a memory leak in the 4 queues that handle the video DMA transfers.
I take the original request from the IOCTL, and create 4 local driver allocated requests.
My allocated requests have an AllocRequest boolean in them, so I can tell the request was allocated.
I reference count the allocated request and I know I am not losing track on any allocate requests.
Normally to complete the request, WdfRequestCompletexxxx is called, but the allocated requests are special. You must call WdfObjectDelete.
This is how I complete the request. This is how I got it running and was able to shut down normally.
StopAcknowledge and CancelSendRequest allow the driver to shut down, on power down.
The need to call those 2 functions, maybe a symptom of the problem.
One of my thoughts is since it does not call a completion routine, it just calls WdfObjectDelete.
Maybe wdf does not send some sort of message to remove the request from the queue.
What the memory leak is, all my allocated requests are stuck in the 4 queues even though I think I have removed them.
The basic system is, to remove the allocated resource from the DMA queue’s and place them into a collection to be freed later.

This is the delete request subroutine.

static VOID HdDeleteRequest(PHD_DEVICE_CONTEXT devContext, WDFREQUEST Request_)
{
	PREQUEST_CONTEXT_DMA ReqContext_ = GetRequestContext_DMA( Request_ );
	if ( ( ReqContext_ != NULL ) && ( ReqContext_->AllocRequest ) )
	{
		// Was made by WdfRequestCreate
		//
		WdfRequestStopAcknowledge( Request_, FALSE );	/* Don't requeue. */
		WdfRequestCancelSentRequest( Request_ );
		WdfObjectDelete( Request_ );
	}
	else
	{
		WdfRequestCompleteWithPriorityBoost( Request_, STATUS_SUCCESS, 0 );
	}
}

This is the loop used to go through the delay request completion collection.
The queued data is moved to the collection to be disposed of later.

WDFREQUEST Request_ = (WDFREQUEST) WdfCollectionGetFirstItem( devContext->Collection_DpcDelayRequestComplete );
while(Request_)
{	WdfCollectionRemoveItem( devContext->Collection_DpcDelayRequestComplete, 0 );
	HdDeleteRequest( devContext, Request_ );
	Request_ = (WDFREQUEST) WdfCollectionGetFirstItem( devContext->Collection_DpcDelayRequestComplete );
}

This is, I think, where the problem of the allocated requests being stuck in the queue first shows up.
Normally in the FindRequest / RetrieveFoundRequest logic, you need to call WdfObjectDereference after the RetrieveFoundRequest.
But if I do, I BSOD. So I have to check for allocated requests after RevrievedFoundRequest and not call WdfObjectDerererence.

This is the subroutine to Dereference the request.

static VOID HdObjectDereference(WDFREQUEST Request)
{
	PREQUEST_CONTEXT_DMA ReqContext_ = GetRequestContext_DMA(Request);
	if ((ReqContext_ != NULL) && (ReqContext_->AllocRequest))
	{
	}
	else
	{
		WdfObjectDereference(Request);
	}
}

Code block with FindRequest / RetrieveFoundRequest.
Note calling the subroutine with the HdObjectDereference.
Added code to check for invalid ntStatus_ results, there are only successful removals.

WDFREQUEST		Request_ = NULL;
WDF_REQUEST_PARAMETERS	RequestParams_;
WDF_REQUEST_PARAMETERS_INIT(&RequestParams_);

ntStatus_ = WdfIoQueueFindRequest( Queue_, RequestPrev_, NULL, &RequestParams_, &Request_ );
ntStatus_ = WdfIoQueueRetrieveFoundRequest( Queue_, Request_, &Request );
HdObjectDereference(Request_);

The code that does this, come from this article, almost exactly.

https://github.com/MicrosoftDocs/windows-driver-docs-ddi/blob/staging/wdk-ddi-src/content/wdfio/nf-wdfio-wdfioqueuefindrequest.md

To shut down / power down, I had to add a QueueIOStop routine.
The QueueIOStop routine is never called. If I don’t add the function, the system will not shut down.
This is the startup code that sets up the queues.

static VOID HdIODMAQueueIOStop( _In_ WDFQUEUE Queue, _In_ WDFREQUEST Request, _In_ ULONG ActionFlags )
{
}

This is how I build the DMA queues. The quad video frames are in DMA channels 32 throught 35.

// - - - - - - - - - - - - - - - - - - - - - - - - - - - - MANUAL QUEUE DMA
//
for (int Channel_ = 0; Channel_ < MAX_DMA_CHANNELS; Channel_++)
{
	WDF_IO_QUEUE_CONFIG ioQueueConfig;
	WDF_IO_QUEUE_CONFIG_INIT( &ioQueueConfig, WdfIoQueueDispatchManual );
	ioQueueConfig.PowerManaged = WdfTrue;

	if( Channel_ >= 32 && Channel_ <= 35 )
	{
		ioQueueConfig.EvtIoStop = HdIODMAQueueIOStop;
	}
	ntStatus_ = WdfIoQueueCreate ( Device, &ioQueueConfig, WDF_NO_OBJECT_ATTRIBUTES, &devContext_->QueueDMA[Channel_] );
}

This is how I create the delay request collection., on startup.

// - - - - - - - - - - - - - - - - - - - - - DPC DELAYED REQUEST COLLECTION
//
WDF_OBJECT_ATTRIBUTES CollectionAttributes;
WDF_OBJECT_ATTRIBUTES_INIT(&CollectionAttributes);

CollectionAttributes.ParentObject = Device;
ntStatus_ = WdfCollectionCreate( &CollectionAttributes, &devContext_->Collection_DpcDelayRequestComplete );

When the IOCTL is called, with the original request. The mdl list, contains the 4 video frames.
This is the loop that creates the 4 Request_'s from the original Request.

WDFIOTARGET Target_ = WdfDeviceGetIoTarget( devContext->WdfDevice );
for( int Quad_ = 0; NT_SUCCESS( ntStatus_ ) && ( Quad_ < 4 ); Quad_++ )
{
	WDFREQUEST Request_ = NULL;
	ntStatus_ = WdfRequestCreate( WDF_NO_OBJECT_ATTRIBUTES, Target_, &Request_ );

	if( NT_SUCCESS( ntStatus_ ) )
	{
		PREQUEST_CONTEXT_DMA ReqContext_ = NULL;
		{
			WDF_OBJECT_ATTRIBUTES RequestAttributes_;
			WDF_OBJECT_ATTRIBUTES_INIT_CONTEXT_TYPE( &RequestAttributes_, REQUEST_CONTEXT_DMA );

			RequestAttributes_.EvtCleanupCallback = HdRequestCleanup;
			ntStatus_ = WdfObjectAllocateContext( Request_, &RequestAttributes_, (PVOID *) &ReqContext_ );
		}

		if( NT_SUCCESS( ntStatus_ ) )
		{
			ReqContext_->AllocRequest  = true;
			ReqContext_->Channel       = (UCHAR) pHdioDma_->Q[ Quad_ ].Channel;
			....

			WDFDMATRANSACTION DmaTransaction_ = HdAcquireDmaTransaction( devContext, ReqContext_->Channel );
			if( DmaTransaction_ != NULL )
			{
				ReqContext_->DmaTransaction = DmaTransaction_;

				PMDL mdl;
				ntStatus_ = WdfRequestRetrieveOutputWdmMdl( Request, &mdl );
				if( NT_SUCCESS( ntStatus_ ) )
				{
					PVOID virtualAddress = MmGetMdlVirtualAddress( mdl );
					ULONG length         = MmGetMdlByteCount(      mdl );
					ntStatus_ = WdfDmaTransactionInitialize( ReqContext_->DmaTransaction, HdEvtProgramDma_QuadFrame, WdfDmaDirectionWriteToDevice, mdl, virtualAddress, length );
					if( NT_SUCCESS( ntStatus_ ) )
					{
						ntStatus_ = WdfDmaTransactionExecute( ReqContext_->DmaTransaction, (PVOID) ReqContext_ );
						if( NT_SUCCESS( ntStatus_ ) )
						{
							KIRQL oldIrql_;
							KeAcquireSpinLock( &devContext->DpcSpinLock, &oldIrql_ );
							HdFrameBuf_OnAdd( devContext, Request_, ReqContext_ );
							KeReleaseSpinLock( &devContext->DpcSpinLock, oldIrql_ );
						}
					}
				}
			}
		}
	}
}

Or you can create the 4 requests from an IRP. That does not change the behavior of the memory leak at all.

PIRP irp = IoAlloateIrp( IoGetRemainingStackSize()>>3, FALSE );
ntStatus_ = WdfRequestCreateFromIrp( WDF_NO_OBJECT_ATTRIBUTES, irp, TRUE, &Request_);

This is how the original request is terminated.
The original request is terminated, normally before the quad request DMA transfer occur.

ULONG CompletionInformation_ = 0;
WdfRequestCompleteWithInformation( Request, ntStatus_, CompletionInformation_ );

When running the driver normally, it performs excellently.
The driver does shut down when the system is turned off.
But if you try to replace the driver, release hardware is called and the driver hangs with the buffers in flight message.
If you execute !wdfkf.wdfqueue 0xxxxx in the debugger, you get a huge number of buffers with the message Request is marked cancelled.
EtIoStop may not have been called for this request.

kd> !wdfkd.wdfqueue 0x00002AF937218588

Treating handle as a KMDF handle!

Dumping WDFQUEUE 0x00002af937218588
=========================
Manual, Power-managed, PowerPurgeDriverNotified, Shut down, Cannot accept, Can dispatch, ExecutionLevelDispatch, SynchronizationScopeNone
    Number of driver owned requests: 8212
    Power transition in progress
    Number of waiting requests: 0

Abort: list count of 200 entries exceeded, could be a corrupted list
    Number of requests notified about power change: 201
    !wdfrequest 0x00002af92f7fb648  !irp 0xffffd506bd90b550
                (Request is marked cancelled, EvtIoStop may not have been called for this request)
   ...

 Abort: list count of 20 entries exceeded
Use 0x10 flag to view unlimited number of requests

    EvtIoStop: (0xfffff8048cb893d0) NewTekHD
0: kd> g
Thread 0xFFFFD506BD5A3080 is waiting for all inflight requests to be acknowledged on WDFQUEUE 0x00002AF937218588

A WDFCOLLECTION is not a WDFQUEUE. The rules that apply to a queue do not apply to a collection.

That windbg says you have 8,212 driver-owned requests. That means requests you have pulled out of the queue but have not completed. That count does not include the requests you created, because you won’t put those in a queue. There must be some path where you finish with a request but don’t complete it.

I have been trying to complete the request for the last week. I have tried everything I could come up with. Something has a reference somehow to the original request, maybe. I have the original request and the 4 quad requests I create for review. Assuming it is the original request, I added the following code on exit of the IOCTL:

WdfRequestStopAcknowledge( Request, FALSE );
WdfRequestCancelSentRequest( Request );
WdfRequestCompleteWithPriorityBoost( Request, ntStatus_, 0 );

Nothing seems to have an affect. I tried copying the output mdl list. That had no affect. I tried making the 4 local requests, with and without an irp. In the first block of code I posted, is the delete routine for the quad requests.

I need to know where to look. The rest of the driver is legacy and well tested . Basically the quad requests, go into a collection and prepared for the dma queue. After dma completes, the requests are move to a collection to be deleted later, The only thing I found in that, is for the quad requests, should not call WdfObjectDereference() in the find/found logic. If I call WdfObjectDereference(), the driver BSODs. Any help on where to look will be useful.

I found a solution.
Pre-allocate the request in a circular queue on startup.
Free the pre-allocated requests on exit.
Just pull a request in the IOCTL when needed.
In the code that deleted the request object, manually call the cleanup routine.
That makes the problem in the queues go away.
No more BSODs with the queues being confused about which references are valid.
Something is going on with the queue items and freeing the requests.
I found that when I crashed, it was the same address for the new request,
as a request I pulled from the queue, move to a collection, and freed just a little earlier.