Help Needed: Troubleshooting Deadlock in UpperFilter DiskDrive

GaryZachen · July 1, 2024, 9:04am

Hello driver experts,

I am currently troubleshooting a locking issue with a rather complex driver, which functions as an UpperFilter for the DiskDrive class. I have narrowed the problem down to the IRP_MJ_READ operation. The deadlock occurs when processing the IRP asynchronously. Here is the sequence of actions leading to the deadlock:

Pending the original IRP.
Adding it to the worker thread queue.
Making a copy of the original IRP.
Sending the new IRP to the lower device.
Waiting for the result.
Completing the original request.

The read operation must come from the pagefile, specifically after the pagefile has been increased in size.

Here is the stack when its deadlocked.

nt!KiSwapThread+0x500
nt!KiCommitThreadWait+0x14f
nt!KeWaitForSingleObject+0x233
nt!ExfAcquirePushLockExclusiveEx+0x1a0
nt!ExAcquirePushLockExclusiveEx+0x1a2
nt!RtlpHpSegPageRangeShrink+0x423
nt!ExFreeHeapPool+0x6b2
nt!ExFreePool+0x9
Wof!FileProvReadCompressedOnNewStackExtendedCompletion+0x376
Wof!FileProvReadCompressedCompletionWorker+0x75
Wof!FileProvReadCompressedCompletion+0x3e
FLTMGR!FltpPassThroughCompletionWorker+0x48a
FLTMGR!FltpPassThroughCompletion+0xc
nt!IopfCompleteRequest+0x1a5
nt!IofCompleteRequest+0x17

MyDisk!ReadDispatchSync+0x241
MyDisk!ThreadCallbackRead+0xb4
nt!PspSystemThreadStartup+0x55
nt!KiStartSystemThread+0x28

And here is function which "clones" the original IRP, sends it down , waits for response and completes the original IRP.

NTSTATUS
CompletionRoutine(
In PDEVICE_OBJECT DeviceObject,
In PIRP Irp,
In PVOID Context)
{
UNREFERENCED_PARAMETER(DeviceObject);

if (Irp->PendingReturned)
{
    PKEVENT event = (PKEVENT)Context;
    KeSetEvent(event, IO_NO_INCREMENT, FALSE);
}
return STATUS_MORE_PROCESSING_REQUIRED;

}

NTSTATUS
ReadDispatchSync(
In PDEVICE_OBJECT DeviceObject,
Inout PIRP Irp)
{
PIO_STACK_LOCATION irpStack = IoGetCurrentIrpStackLocation(Irp);
DEVICE_EXTENSION* deviceExtension = (DEVICE_EXTENSION*)DeviceObject->DeviceExtension;
NTSTATUS status;
KEVENT event;
PIRP newIrp;
PMDL mdl;

// Initialize an event to wait for the completion of the new IRP
KeInitializeEvent(&event, SynchronizationEvent, FALSE);

newIrp = IoAllocateIrp(DeviceObject->StackSize, FALSE);
if (newIrp == NULL)
{
    Irp->IoStatus.Status = STATUS_INSUFFICIENT_RESOURCES;
    IoCompleteRequest(Irp, IO_NO_INCREMENT);
    return STATUS_INSUFFICIENT_RESOURCES;
}

PIO_STACK_LOCATION newStack = IoGetNextIrpStackLocation(newIrp);
newStack->Parameters.Read.ByteOffset.QuadPart = irpStack->Parameters.Read.ByteOffset.QuadPart;
newStack->Parameters.Read.Length = irpStack->Parameters.Read.Length;

newStack->MajorFunction = irpStack->MajorFunction;
newStack->MinorFunction = irpStack->MinorFunction;


// Allocate an MDL for the new IRP
// get buffer from the parent IRP
mdl = IoAllocateMdl(
    MmGetSystemAddressForMdlSafe(Irp->MdlAddress, HighPagePriority), 
    irpStack->Parameters.Read.Length, FALSE, FALSE, newIrp);

if (mdl == NULL)
{
    IoFreeIrp(newIrp);
    Irp->IoStatus.Status = STATUS_INSUFFICIENT_RESOURCES;
    IoCompleteRequest(Irp, IO_NO_INCREMENT);
    return STATUS_INSUFFICIENT_RESOURCES;
}

MmBuildMdlForNonPagedPool(mdl);

newIrp->MdlAddress = mdl;

IoSetCompletionRoutine(newIrp, CompletionRoutine, &event, TRUE, TRUE, TRUE);

newIrp->Flags = Irp->Flags;
newIrp->Tail.Overlay.Thread = Irp->Tail.Overlay.Thread;

// Send the new IRP down the stack
status = IoCallDriver(deviceExtension->LowerDeviceObject, newIrp);
if (status == STATUS_PENDING)
{
    // Wait for the new IRP to complete
    KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);
    status = newIrp->IoStatus.Status;
}

// Complete the original IRP
// Copy status and information
Irp->IoStatus.Status = status;
Irp->IoStatus.Information = newIrp->IoStatus.Information;

// complete original IRP, 
// !!!!!!!! Locks up here
IoCompleteRequest(Irp, IO_NO_INCREMENT);

IoFreeMdl(mdl);

IoFreeIrp(newIrp);

return status;

}

Everything else it pretty much boilerplate code

Increasing the number of worker threads only delays the deadlock. Using work items with the NormalWorkQueue seems to solve the problem, but I suspect this is not the correct solution and may cause issues later. I believe I need to lock the pages with MmProbeAndLockPages, but my initial attempt at this has failed.

I would appreciate any insights or suggestions

Thank you!

GaryZachen · July 2, 2024, 7:05am

Hello again,

I've done more work to identify the root cause of the deadlock. I have tried processing pended requests using a thread pool (up to 16 threads), using a work item with type DelayedWorkQueue, and using a work item with type NormalWorkQueue.

In the first two cases, the deadlock occurs, but much later compared to using a single worker thread. However, in the third case, when I use a work item with NormalWorkQueue, I cannot reproduce the issue. The system remains stable, and everything works fine even with the driver verifier enabled.

I am trying to understand how the NormalWorkQueue worker threads differ from threads created by PsCreateSystemThread or threads in DelayedWorkQueue, but I haven't found any information on this. Do you have any suggestions or ideas?

Thank you in advance for your help!

Slava_Imameev · July 2, 2024, 9:30am

Hi,

When a deadlock happens, are other pool (worker) threads blocked inside nt!ExfAcquirePushLock* on the same lock?
Did you search for any thread(s) waiting on a page fault processing, especially for page faults inside ExFreeHeapPool / EaAllocateHeapPool ?
If an IO (paged or non-paged) happens in a thread pool context how is it being processed? Will it be queued in the same thread pool for processing? This might cause a deadlock if a page fault happens when a lock is being held.

One concern with this design is that it doesn't differentiate IO issued to a pagefile and to a regular file. There are no dedicated threads for a page file IO processing, to prevent deadlocks when a page backed by a page file is needed to process a regular mapped file paging IO inside a completion routine, which might happen with some file system drivers. It might happen that all pool(working) threads are busy with regular file paging IO processing and there is no worker thread for pagefile IO to unblock regular file IO waiting for a page backed by a page file. For example, searching the Internet I stumbled across this call stack for Wof!FileProvReadCompressedCompletion, so it can cause page faults from a completion routine, which is unusual but possible with some caution, like not doing this at elevated IRQL.

[...]
tnt!ST_STORE<SM_TRAITS>::StDmSinglePageCopy+0x146
nt!ST_STORE<SM_TRAITS>::StDmSinglePageTransfer+0xa0
nt!ST_STORE<SM_TRAITS>::StDmpSinglePageRetrieve+0x186
nt!ST_STORE<SM_TRAITS>::StDmPageRetrieve+0xc1
nt!SMKM_STORE<SM_TRAITS>::SmStDirectReadIssue+0x85
nt!SMKM_STORE<SM_TRAITS>::SmStDirectReadCallout+0x21
nt!KeExpandKernelStackAndCalloutInternal+0x78
nt!SMKM_STORE<SM_TRAITS>::SmStDirectRead+0xad
nt!SMKM_STORE<SM_TRAITS>::SmStWorkItemQueue+0x1b4
nt!SMKM_STORE_MGR<SM_TRAITS>::SmIoCtxQueueWork+0xce
nt!SMKM_STORE_MGR<SM_TRAITS>::SmPageRead+0x168
nt!SmPageRead+0x2e
nt!MiIssueHardFaultIo+0x11f
nt!MiIssueHardFault+0x3ed
nt!MmAccessFault+0x3ed
nt!KiPageFault+0x343
nt!RtlDecompressBufferXpressHuff+0x19c
nt!RtlDecompressBufferEx+0x60
Wof!FileProvDecompressChunks+0x27f
Wof!FileProvReadCompressedOnNewStackExtendedCompletion+0x237
Wof!FileProvReadCompressedCompletionWorker+0x16e
Wof!FileProvReadCompressedCompletion+0x101
FLTMGR!FltpPassThroughCompletionWorker+0x3c2
FLTMGR!FltpPassThroughCompletion+0xc
nt!IovpLocalCompletionRoutine+0x174
nt!IopfCompleteRequest+0x1cd
nt!IovCompleteRequest+0x1bd
nt!IofCompleteRequest+0x17e28b
[...]

I could imagine a page fault happens inside ExFreeHeapPool / EaAllocateHeapPool in a concurrent thread (any thread, not necessary from the thread pool) after acquiring the push lock. The thread experiencing this deadlock is blocked on paging IO read, so making it pending inside your driver filter doesn't unblock this thread and doesn't release the lock, a page access fault needs to be processed for the thread to continue. A paging IO read hits your driver. At some point, all pool threads are blocked waiting for the ExFreeHeapPool / EaAllocateHeapPool lock, while a thread holding the lock is waiting for paging IO read processing queued inside your driver and waiting for an available worker thread, while all worker threads are blocked waiting for the lock.

I think NormalWorkQueue and DelayedWorkQueue difference can be explained by different timing patterns. I do not think NormalWorkQueue resolves a deadlock, it just makes it less probable for a system used for testing. On other system NormalWorkQueue might deadlock.

Scott_Noone_OSR · July 2, 2024, 3:36pm

Historically threads in the lower priority queues waited for work with a WaitMode of UserMode, thus making their kernel stacks pageable. Queueing disk I/O to those would eventually deadlock because you'd need to do I/O to queue the I/O.

Not sure if that's the problem here, I haven't looked at the worker thread details in a long time, you'd need to do more spelunking.

Mark_Roddy · July 2, 2024, 5:11pm

Out of curiosity, why are you deferring processing of disk io requests? As you have discovered, deferring paging path read requests has issues.

GaryZachen · July 2, 2024, 5:31pm

Hi Slava, Thanks for your reply really appreciate it. I was also suspecting that there is an IO happening in the completion routine. And I thought that the 4++ worker threads should helps, but it didnt. Based on you recommendation I tried the separate worker thread for the IRPs which have IRP_PAGING_IO in Flags, but it didnt help. That thread for paging io is stuck with the same thread stack. The Flags for the stuck IRP are IRP_PAGING_IO | IRP_NOCACHE | IRP_CLOSE_OPERATION ..
I am a bit confused by IRP_CLOSE_OPERATION , it it for IRP_MJ_READ, so why would it have a close flag ??
Do you have any other suggestions I can try ?? Or if you would point to where I can read about proper async handling for IRPs with IRP_PAGING_IO

Mark, this is old driver I have developed a while back , it might modifies the data which is being written to the disk.
I take original IRP and it could be split up into few different IRP, once those IRPs are completed (sent to lower device and notification event is set in the completion routine), I complete the original IRP. And this was working for many many years.... since Windows 2000 till Win 10.

GaryZachen · July 2, 2024, 5:33pm

Scott , I figured that the NormalWorkQueue is not really a solution for this, but rather a bandaid fix, and want to completely understand why this is happening. Any recommendations would be appreciated. Thank You !

Mark_Roddy · July 2, 2024, 7:00pm

Ok, so none of that requires using worker threads. Instead you just need to account for all the partial irp completions before completing the original irp. That can be done entirely by your completion handlers. It is a really simple state machine.

Also if you are building multiple requests using the same source mdl then you should consider using IoBuildPartialMdl. There are mdl related accounting procedures that can get badly messed up and cause memory management disasters otherwise.

Slava_Imameev · July 2, 2024, 7:52pm

There are two usual deadlock scenarios for such designs

IO is issued to process data related to a general file, this can be any - paging or non-paging.
This general file IO is enqueued into a threads pool and being processed by a working thread.
While processing this IO a page fault happens when accessing data backed by a pagefile, e.g. accessing a paged pool allocated data or paged code.
Memory Manager issues a paging IO read requests to retrieve a page from a paging file.
This paging IO is enqueued into the same threads pool, but all threads in the pool are blocked waiting for page fault processing for data backed by a pagefile.

OR the alternative scenario

IO is issued to process data related to a general file, this can be any - paging or non-paging.
This general file IO is enqueued into a threads pool and being processed by a working thread.
While processing this IO system needs to allocate physical pages.
If there is not enough free physical pages, the system starts modified page writer to move some pages to a pagefile and repurpose them.
This pagefile IO is enqueued into the same threads pool as general file IO and blocks there indefinitely as all threads in pool are waiting for modified page writer completes its work. But modified page writer IO is queued in this thread pool waiting for an available thread.

The Windows kernel breaks this dependency cycle by dedicating some threads for page file IO. There are two threads type in the system

Mapped page writer for general files IO.
Modified page writer for page file IO.

GaryZachen · July 2, 2024, 8:00pm

Mark,

this is very intriguing idea. I have been thinking about something like this long time ago when I was implementing the worker threads, but never got around to actually do it.

I think the logic here would be in the dispatch , i create all necessary child IRPs and add them to some IRP list and set completion routine.
Allocate the context which would have the IRPs list as well as original IRP. And send down the first child IRP.
In completion routine I check the status and also will have to accumulate the IoStatus.Information. After I reach the last child IRP I complete the parent IRP with the accumulated IoStatus.Information. And also if somewhere in the middle I see that status from child IRPs in invalid, I cancel all outstanding IRPs and complete the parent IRP with invalid status.

Does it makes sense ??? Did i miss anything ?

Thank you

Mark_Roddy · July 3, 2024, 10:05am

I would make the entire operation asynchronous. Don’t wait for anything anywhere. Each completion handler callback can determine if it is the last partial request and compute the status and complete the original irp.