referencing one workitem five times?

Sergei_Vorobiev · September 8, 2009, 1:16am

A filter might spawn off a generic workitem (allocate, queue, free if queue failed) from pre-create. Workitem does some work (reads from registry, etc) and exits. A seemingly important detail is that in some circumstances a workitem might trigger an instance detach by calling FltDetachVolume(). An existing code bug leads to avalanches of workitems, that is, for several create operations in a row workitems is spawned. I’ve observed up to 9 on my dualcore test machine.

Despite the code bug, everything seems just fine, specifically, workitems clean up after themselves and system continues running. I am absolutely positive that the workitems cleanup after themselves, that is, FltFreeGenericWorkItem() is unconditionally called at the end of a workitem.

However at least once, workitems were leaked, and in a quite spectacular way.

0: kd> !filter fffffa800189b6a0 8 1

FLT_FILTER: fffffa800189b6a0 “” “”
InstanceList : (fffffa800189b6f8)
Resource (fffffa800189b760) List [fffffa800189b760-fffffa800189b760] rCount=0
Object usage/reference information:
References to FLT_CONTEXT : 0
Allocations of FLT_CALLBACK_DATA : 0
Allocations of FLT_DEFERRED_IO_WORKITEM : 0
Allocations of FLT_GENERIC_WORKITEM : 5 <– methinks, it means five workitems were leaked
References to FLT_FILE_NAME_INFORMATION : 0
Open files : 0
References to FLT_OBJECT : 0
List of objects used/referenced::
FLT_VERIFIER_OBJECT: fffffa80034f5670
Object: fffffa80018b49e0 Type: FLT_GENERIC_WORKITEM RefCount: 00000005 <— methinks, in means that the SAME workitem was leaked FIVE times.

0: kd> dt -r1 _FLT_VERIFIER_OBJECT fffffa80034f5670
fltmgr!_FLT_VERIFIER_OBJECT
+0x000 TreeLink : _TREE_NODE
+0x000 Link : _RTL_SPLAY_LINKS
+0x018 TreeRoot : 0xfffffa8003c44410 _TREE_ROOT +0x020 Key1 : 0xfffffa80018b49e0
+0x028 Key2 : (null)
+0x030 Flags : 0x10000
+0x038 Type : 3
+0x040 Object : 0xfffffa80018b49e0 +0x048 RefCount : 5 0: kd> !pool 0xfffffa80018b49e0
Pool page fffffa80018b49e0 region is Nonpaged pool
*fffffa80018b49d0 size: 70 previous size: a0 (Allocated) *FMwi
Pooltag FMwi : Work item structures, Binary : fltmgr.sys

0: kd> dps 0xfffffa80018b49d0 fffffa80018b49d0 69774d460207000a fffffa80018b49d8 fffffa80033f5230 fffffa80018b49e0 0000000000000000 fffffa80018b49e8 0000000000000000 fffffa80018b49f0 fffff80001a16610 nt!ExWorkerQueue+0x70 fffffa80018b49f8 fffff8800131eed0 fltmgr!FltpProcessGenericWorkItem fffffa80018b4a00 fffffa80018b49e0 fffffa80018b4a08 fffffa8000000001 fffffa80018b4a10 fffff8800828bd88 <my workitem routine> fffffa80018b4a18 0000000000000000 fffffa80018b4a20 ffffffff00000010 fffffa80018b4a28 0000000200000005 fffffa80018b4a30 fffffa800189b6a0 fffffa80018b4a38 0000000000013b23 fffffa80018b4a40 61436d4d020d0007 fffffa80018b4a48 fffffa80033ffbf8 0: kd> db 0xfffffa80018b49d0
fffffa80018b49d0 0a 00 07 02 46 4d 77 69-30 52 3f 03 80 fa ff ff ....FMwi0R?..... fffffa80018b49e0 00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00 …
fffffa80018b49f0 10 66 a1 01 00 f8 ff ff-d0 ee 31 01 80 f8 ff ff .f........1..... fffffa80018b4a00 e0 49 8b 01 80 fa ff ff-01 00 00 00 80 fa ff ff .I…
fffffa80018b4a10 88 bd 28 08 80 f8 ff ff-00 00 00 00 00 00 00 00 ..(............. fffffa80018b4a20 10 00 00 00 ff ff ff ff-05 00 00 00 02 00 00 00 …
fffffa80`018b4a30 a0 b6 89 01 80 fa ff ff-23 3b 01 00 00 00 00 00 …#;…

The only thread that references my code is the one that is doing the unload. !work doesn’t show any pending work.

I fail to comprehend how can that be, no matter what kind of bug can I have in my code (short of corrupting the memory that doesn’t belong to me in a sharpshooter way). I tried various ways to inject a bug into my code, like, forget to cleanup a workitem, queue a workitem twice, reference it using FlrObjectReference, etc… Neither way made such interesing consequences. Honestly, I don’t have any explanation short of a synchronization bug somewhere in FltpvLinkResourceToFilter().

OSR_Community_User · September 14, 2009, 10:30pm

Interesting issue. I’m not sure I’m persuaded by the information provided that this is an OS bug, but then again, there’s nothing obvious here that suggests it is not.

I would suggest something that we routinely do: use your own work queue. It’s not so difficult to do and has some nice advantages - you can control the size of the queue, priority of the threads and avoid deadlocks with anyone using the system work queues.

If this really is a bug, it also provides you with a mechanism to avoid it.

Tony
OSR

Looking forward to seeing everyone at the next File Systems class in Vancouver, BC, October 20-23, 2009.