OK, it wasnât that I didnât believe you, but I donât believe anyone So, I
had to set this up myself and see if I could reproduce the behaviorâŚ
I think thereâs definitely a bug in PartMgr on build 1703. Can you try an
experiment and see if it solves the problem? I want to try disabling the
routine partmgr!PmSendTelemetry on your target machine.
First enter the assembler:
a partmgr!PmSendTelemetry
Then type:
ret
And hit enter.
If you want to undo the change and revert to the old behavior just run the
following:
!chkimg -f partmgr.sys
If you can no longer reproduce the behavior with that routine disabled,
hereâs the long story that sort of explains it (and a mitigation at the
end):
I put an access breakpoint on the device objectâs reference count and upon
surprise removal PartMgr went, for lack of a better phrase, completely up
its own ass. The ânotification work itemâ was endlessly trying to âsend
telemetryâ and thus referencing/deferencing the top of the device stack:
nt!IoGetAttachedDeviceReference+11dde3
partmgr!PmCreatePartition+9d
partmgr!PmUpdatePartitionsLegacyRemovable+bb
partmgr!memset+9fb
partmgr!PmGetDriveLayoutEx+6c
partmgr!PmCollectTelemetry+e3
partmgr!PmSendTelemetry+67
partmgr!PmNotificationWorkItem+380
nt!IopProcessWorkItem+fb
nt!ExpWorkerThread+d8
nt!PspSystemThreadStartup+47
nt!KiStartSystemThread+16
It would eventually stop, but only after overflowing the object reference
trace buffer.
Doing some stepping through here, in the surprise remove path PartMgr wakes
up the notification work item (partmgr!PmWakeupNotificationWorkItem). The
work item then proceeds to call partmgr!PmSendTelemetry, which does some
work andâŚqueues the work item. it keeps doing this until a list is empty.
Looking at the list, the only thing on it at this point is a reference to
the device object that weâre trying to surprise remove.
Iâm running on a single CPU VM, so the re-queuing of the work item at the
end of the work item causes the surprise remove thread to stall and you get
the âup its assâ behavior (i.e. high priority work item continues to queue
itself until the list is empty, but the list wonât be empty until some other
thread gets to run).
Going back to 1607 this call to partmgr!PmSendTelemetry doesnât exist and,
given that this doesnât look quite right, my SWAG is that thereâs a race
here causing problems with the reference count on remove. Disabling the
routine seems to cause the system to behave much more sanely on surprise
removal and properly clean up the reference count, hence my suggestion
above.
Of course, this doesnât actually solve your problem because you canât go
disabling things on client machinesâŚIf this DOES work then I would suggest
simply removing your Unload entry point and preventing your filter from
unloading. Youâll still get dangling device object references out there, but
(if Iâm right) this isnât your bug and exists even without your filter.
-scott
OSR
@OSRDrivers