referencing one workitem five times?

A filter might spawn off a generic workitem (allocate, queue, free if queue failed) from pre-create. Workitem does some work (reads from registry, etc) and exits. A seemingly important detail is that in some circumstances a workitem might trigger an instance detach by calling FltDetachVolume(). An existing code bug leads to avalanches of workitems, that is, for several create operations in a row workitems is spawned. I’ve observed up to 9 on my dualcore test machine.

Despite the code bug, everything seems just fine, specifically, workitems clean up after themselves and system continues running. I am absolutely positive that the workitems cleanup after themselves, that is, FltFreeGenericWorkItem() is unconditionally called at the end of a workitem.

However at least once, workitems were leaked, and in a quite spectacular way.

0: kd> !filter fffffa800189b6a0 8 1

FLT_FILTER: fffffa800189b6a0 “” “”
InstanceList : (fffffa800189b6f8)
Resource (fffffa800189b760) List [fffffa800189b760-fffffa800189b760] rCount=0
Object usage/reference information:
References to FLT_CONTEXT : 0
Allocations of FLT_CALLBACK_DATA : 0
Allocations of FLT_DEFERRED_IO_WORKITEM : 0
Allocations of FLT_GENERIC_WORKITEM : 5 <– methinks, it means five workitems were leaked
References to FLT_FILE_NAME_INFORMATION : 0
Open files : 0
References to FLT_OBJECT : 0
List of objects used/referenced::
FLT_VERIFIER_OBJECT: fffffa80034f5670
Object: fffffa80018b49e0 Type: FLT_GENERIC_WORKITEM RefCount: 00000005 <— methinks, in means that the SAME workitem was leaked FIVE times.

0: kd> dt -r1 _FLT_VERIFIER_OBJECT fffffa80034f5670
fltmgr!_FLT_VERIFIER_OBJECT
+0x000 TreeLink : _TREE_NODE
+0x000 Link : _RTL_SPLAY_LINKS
+0x018 TreeRoot : 0xfffffa8003c44410 _TREE_ROOT<br> +0x020 Key1 : 0xfffffa80018b49e0
+0x028 Key2 : (null)
+0x030 Flags : 0x10000
+0x038 Type : 3
+0x040 Object : 0xfffffa80018b49e0 <br> +0x048 RefCount : 5<br><br>0: kd&gt; !pool 0xfffffa80018b49e0
Pool page fffffa80018b49e0 region is Nonpaged pool
*fffffa80018b49d0 size: 70 previous size: a0 (Allocated) *FMwi
Pooltag FMwi : Work item structures, Binary : fltmgr.sys

0: kd> dps 0xfffffa80018b49d0<br>fffffa80018b49d0 69774d460207000a<br>fffffa80018b49d8 fffffa80033f5230<br>fffffa80018b49e0 0000000000000000<br>fffffa80018b49e8 0000000000000000<br>fffffa80018b49f0 fffff80001a16610 nt!ExWorkerQueue+0x70<br>fffffa80018b49f8 fffff8800131eed0 fltmgr!FltpProcessGenericWorkItem<br>fffffa80018b4a00 fffffa80018b49e0<br>fffffa80018b4a08 fffffa8000000001<br>fffffa80018b4a10 fffff8800828bd88 <my workitem routine><br>fffffa80018b4a18 0000000000000000<br>fffffa80018b4a20 ffffffff00000010<br>fffffa80018b4a28 0000000200000005<br>fffffa80018b4a30 fffffa800189b6a0<br>fffffa80018b4a38 0000000000013b23<br>fffffa80018b4a40 61436d4d020d0007<br>fffffa80018b4a48 fffffa80033ffbf8<br><br>0: kd&gt; db 0xfffffa80018b49d0
fffffa80018b49d0 0a 00 07 02 46 4d 77 69-30 52 3f 03 80 fa ff ff ....FMwi0R?.....<br>fffffa80018b49e0 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 …
fffffa80018b49f0 10 66 a1 01 00 f8 ff ff-d0 ee 31 01 80 f8 ff ff .f........1.....<br>fffffa80018b4a00 e0 49 8b 01 80 fa ff ff-01 00 00 00 80 fa ff ff .I…
fffffa80018b4a10 88 bd 28 08 80 f8 ff ff-00 00 00 00 00 00 00 00 ..(.............<br>fffffa80018b4a20 10 00 00 00 ff ff ff ff-05 00 00 00 02 00 00 00 …
fffffa80`018b4a30 a0 b6 89 01 80 fa ff ff-23 3b 01 00 00 00 00 00 …#;…

The only thread that references my code is the one that is doing the unload. !work doesn’t show any pending work.

I fail to comprehend how can that be, no matter what kind of bug can I have in my code (short of corrupting the memory that doesn’t belong to me in a sharpshooter way). I tried various ways to inject a bug into my code, like, forget to cleanup a workitem, queue a workitem twice, reference it using FlrObjectReference, etc… Neither way made such interesing consequences. Honestly, I don’t have any explanation short of a synchronization bug somewhere in FltpvLinkResourceToFilter().

Interesting issue. I’m not sure I’m persuaded by the information provided that this is an OS bug, but then again, there’s nothing obvious here that suggests it is not.

I would suggest something that we routinely do: use your own work queue. It’s not so difficult to do and has some nice advantages - you can control the size of the queue, priority of the threads and avoid deadlocks with anyone using the system work queues.

If this really is a bug, it also provides you with a mechanism to avoid it.

Tony
OSR

Looking forward to seeing everyone at the next File Systems class in Vancouver, BC, October 20-23, 2009.