OK, it wasn’t that I didn’t believe you, but I don’t believe anyone
So, I
had to set this up myself and see if I could reproduce the behavior…
I think there’s definitely a bug in PartMgr on build 1703. Can you try an
experiment and see if it solves the problem? I want to try disabling the
routine partmgr!PmSendTelemetry on your target machine.
First enter the assembler:
a partmgr!PmSendTelemetry
Then type:
ret
And hit enter.
If you want to undo the change and revert to the old behavior just run the
following:
!chkimg -f partmgr.sys
If you can no longer reproduce the behavior with that routine disabled,
here’s the long story that sort of explains it (and a mitigation at the
end):
I put an access breakpoint on the device object’s reference count and upon
surprise removal PartMgr went, for lack of a better phrase, completely up
its own ass. The “notification work item” was endlessly trying to “send
telemetry” and thus referencing/deferencing the top of the device stack:
nt!IoGetAttachedDeviceReference+11dde3
partmgr!PmCreatePartition+9d
partmgr!PmUpdatePartitionsLegacyRemovable+bb
partmgr!memset+9fb
partmgr!PmGetDriveLayoutEx+6c
partmgr!PmCollectTelemetry+e3
partmgr!PmSendTelemetry+67
partmgr!PmNotificationWorkItem+380
nt!IopProcessWorkItem+fb
nt!ExpWorkerThread+d8
nt!PspSystemThreadStartup+47
nt!KiStartSystemThread+16
It would eventually stop, but only after overflowing the object reference
trace buffer.
Doing some stepping through here, in the surprise remove path PartMgr wakes
up the notification work item (partmgr!PmWakeupNotificationWorkItem). The
work item then proceeds to call partmgr!PmSendTelemetry, which does some
work and…queues the work item. it keeps doing this until a list is empty.
Looking at the list, the only thing on it at this point is a reference to
the device object that we’re trying to surprise remove.
I’m running on a single CPU VM, so the re-queuing of the work item at the
end of the work item causes the surprise remove thread to stall and you get
the “up its ass” behavior (i.e. high priority work item continues to queue
itself until the list is empty, but the list won’t be empty until some other
thread gets to run).
Going back to 1607 this call to partmgr!PmSendTelemetry doesn’t exist and,
given that this doesn’t look quite right, my SWAG is that there’s a race
here causing problems with the reference count on remove. Disabling the
routine seems to cause the system to behave much more sanely on surprise
removal and properly clean up the reference count, hence my suggestion
above.
Of course, this doesn’t actually solve your problem because you can’t go
disabling things on client machines…If this DOES work then I would suggest
simply removing your Unload entry point and preventing your filter from
unloading. You’ll still get dangling device object references out there, but
(if I’m right) this isn’t your bug and exists even without your filter.
-scott
OSR
@OSRDrivers