Holding an IO resource across multiple IRP_MJ_DIRECTORY_CONTROL calls

OSR_Community_User · February 11, 2010, 9:47am

I’ve built a minifilter that redirects all writes to a shadow directory tree on a different volume (done in IRP_MJ_CREATE) and then combines all the written files with the read-only files (done in IRP_MJ_DIRECTORY_CONTROL). I pass all requests down during IRP_MJ_DIRECTORY_CONTROL until I get STATUS_NO_MORE_FILES or STATUS_NO_SUCH_FILE. At that point, I do a FltCreateFileEx to get a FILE_OBJECT for the write directory and put it in a stream handle context. Then on each new IRP_MJ_DIRECTORY_CONTROL I query more items from the FILE_OBJECT to fill up the buffer as much as I can.

Is there a deadlock danger in holding on to this FILE_OBJECT (which is on another volume than the original FILE_OBJECT in the IRP_MJ_DIRECTORY_CONTROL call)? Most of the time this works fine but sometimes I’m getting a deadlock that the verifier doesn’t catch. Several processes that were chugging away and using my filter all go to 0% CPU and are blocked in NtClose or other IO operations. I could gather all the items from my other FILE_OBJECT at once and then manually put them into the request buffer over multiple calls, but before I do that I’m wondering if what I’m doing should work. I’ve been reading threads like http://www.osronline.com/showThread.cfm?link=128120 that make it sound like what I’m doing is right, but I can’t ever find my code on the callstack in the debugger when it deadlocks so I’m not sure.

Scott_Noone_OSR · February 11, 2010, 11:37am

Have you been able to characterize the hang any more than this? ERESOURCE
deadlock, APC deadlock, etc?

Nothing is setting off any serious warning bells yet, so understanding the
hang better would be the first step for me.

-scott

–
Scott Noone
Consulting Associate
OSR Open Systems Resources, Inc.
http://www.osronline.com

wrote in message news:xxxxx@ntfsd…
> I’ve built a minifilter that redirects all writes to a shadow directory
> tree on a different volume (done in IRP_MJ_CREATE) and then combines all
> the written files with the read-only files (done in
> IRP_MJ_DIRECTORY_CONTROL). I pass all requests down during
> IRP_MJ_DIRECTORY_CONTROL until I get STATUS_NO_MORE_FILES or
> STATUS_NO_SUCH_FILE. At that point, I do a FltCreateFileEx to get a
> FILE_OBJECT for the write directory and put it in a stream handle context.
> Then on each new IRP_MJ_DIRECTORY_CONTROL I query more items from the
> FILE_OBJECT to fill up the buffer as much as I can.
>
> Is there a deadlock danger in holding on to this FILE_OBJECT (which is on
> another volume than the original FILE_OBJECT in the
> IRP_MJ_DIRECTORY_CONTROL call)? Most of the time this works fine but
> sometimes I’m getting a deadlock that the verifier doesn’t catch. Several
> processes that were chugging away and using my filter all go to 0% CPU and
> are blocked in NtClose or other IO operations. I could gather all the
> items from my other FILE_OBJECT at once and then manually put them into
> the request buffer over multiple calls, but before I do that I’m wondering
> if what I’m doing should work. I’ve been reading threads like
> http://www.osronline.com/showThread.cfm?link=128120 that make it sound
> like what I’m doing is right, but I can’t ever find my code on the
> callstack in the debugger when it deadlocks so I’m not sure.
>

OSR_Community_User · February 11, 2010, 1:09pm

>chugging away and using my filter all go to 0% CPU and are blocked in NtClose or other IO

!process 0 7 is the tool to investigate deadlocks.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

OSR_Community_User · February 12, 2010, 12:49pm

Thanks for the advice. I didn’t know about !process 0 7 which was great to show every stack trace in the system. Also, I tried !apc which didn’t come up with anything, but maybe there’s some other way to get more information about that.

From !locks:

Resource @ 0xfffffa8007ce65d0 Exclusively owned
Contention Count = 9
NumberOfExclusiveWaiters = 2
Threads: fffffa800764fae0-01<*>
Threads Waiting On Exclusive Access:
fffffa8004b65b60 fffffa800425eb60

If I look at thread fffffa800764fae0, my minifilter is calling into FltCreateFileEx to get a handle to a folder on the write volume. Also, from !thread fffffa800764fae0:

THREAD fffffa800764fae0 Cid 08c0.13e4 Teb: 000000007efdb000 Win32Thread: 0000000000000000 WAIT: (WrGuardedMutex) KernelMode Non-Alertable
fffffa8005ea7388 Gate

My thought was I should try to figure out what the guarded mutex is that fffffa800764fae0 is blocked on. I’m really new to kernel debugging, but I tried doing:

0: kd> dt _KGUARDED_MUTEX fffffa8005ea7388
nt!_KGUARDED_MUTEX
+0x000 Count : 393479
+0x008 Owner : 0xfffffa800764fbe8 _KTHREAD +0x010 Contention : 0x4b28358 +0x018 Gate : _KGATE +0x030 KernelApcDisable : 7 +0x032 SpecialApcDisable : 0 +0x030 CombinedApcDisable : 7 0: kd\> !thread fffffa800764fbe8
fffffa800764fbe8 is not a thread object, interpreting as stack value…
TYPE mismatch for thread object at fffffa800764fbe

If it helps at all, fffffa8004b65b60 is calling FltClose with a handle to a directory on the write volume and fffffa800425eb60 is a worker thread in System that got blocked after calling Ntfs!NtfsFspClose.

OSR_Community_User · February 13, 2010, 4:05am

I got some help on the WinDbg list to dump out the mutex. The mutex is owned by:

THREAD fffffa8004267680 Cid 0004.0020 Teb: 0000000000000000 Win32Thread: 0000000000000000 WAIT: (WrQueue) UserMode Non-Alertable
fffff80002c2b5a0 QueueObject
Child-SP RetAddr : Args to Child : Call Site
fffff8800318ca30 fffff80002a8f052 : fffff8800318cb80 fffffa8004267680 0000000000000002 fffff8000000000e : nt!KiSwapContext+0x7a
fffff8800318cb70 fffff80002a92ac1 : fffff88003a1ad00 fffff8800318cc58 0000000000000000 fffff80000000000 : nt!KiCommitThreadWait+0x1d2
fffff8800318cc00 fffff80002a95139 : fffffa8004268500 fffff80002a7bb98 fffff80002c8d140 fffffa8000000000 : nt!KeRemoveQueueEx+0x301
fffff8800318ccb0 fffff80002d2b166 : fa8008e08c9004c0 fffffa8004267680 0000000000000080 fffffa800424a040 : nt!ExpWorkerThread+0xe9
fffff8800318cd40 fffff80002a66486 : fffff88002f64180 fffffa8004267680 fffff88002f6efc0 fa8008e08c9004c0 : nt!PspSystemThreadStartup+0x5a
fffff8800318cd80 0000000000000000 : fffff8800318d000 fffff88003187000 fffff8800318c1b0 0000000000000000 : nt!KxStartSystemThread+0x16

Every other thread that has Ntfs in the stack is waiting on the same mutex and queue, except for two threads trying to close an Ntfs handle, which are waiting on a thread trying to open a file. So, I’m guessing the threads trying to close the handles would eventually add something to the queue that would unblock the threads trying to open files. Does this mean I need to synchronize access between my opens and closes? In this case, the close is happening on the same directory another thread is trying to open, so maybe the synchronization needs to happen on a per-file basis.