Upper filter not loading on Win10 RS2 (version 1703)

Doron_Holan · July 5, 2017, 10:32am

Do you have AV software installed? Do you get this behavior in other device classes?

Bent from my phone

From: xxxxx@lists.osr.com on behalf of xxxxx@yahoo.com
Sent: Wednesday, July 5, 2017 6:35:52 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Upper filter not loading on Win10 RS2 (version 1703)

[You don’t often get email from xxxxx@YAHOO.COM. Learn why this is important at http://aka.ms/LearnAboutSenderIdentification.]

Javier,

How does the whole stack looks like? Do you install it on ‘clean’ OS? Is that possible that some user mode app opens a handle that pins the DO and the DRVO in memory? Also, you could try to put a hardware bp on access to DO’s ReferenceCount and then kb on every stop. I guess you won’t get too many stops, so it looks doable and might shed some light on who references you.

—
NTDEV is sponsored by OSR

Visit the list online at: https:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at https:

To unsubscribe, visit the List Server section of OSR Online at https:</https:></https:></https:>

Javier_Guerrero · July 5, 2017, 3:43pm

@Mikae, yes I am testing on a fully clean OS, Windows 10 Enterprise Creators Update (90 days evaluation version).
I also tried putting breakpoints on access to refcounts, you can check the output earlier in this same thread.
I don’t have my work machine at this moment (sorry, different time zones), next time I will put !devstack output.

@Doron, no AV software or any other kind installed, OS is fully clean, with no 3rd party programs involved.
I haven’t tried with other device classes but I suspect it has something to do with storage stack.

It’s also worth noting, that if I set the driver as LowerFilter, the problem goes away.
It also works OK if I change the order of UpperFilter values, so the Toaster filter is first and “partmgr” is second.

Thanks guys for your time!

Jamey_Kirby · July 5, 2017, 4:14pm

Try to place a space in front of the filter drivers name in the
UpperFIlters. For some reason, this fixes the odd issue I am seeing.

On Wed, Jul 5, 2017 at 3:44 PM wrote:

> @Mikae, yes I am testing on a fully clean OS, Windows 10 Enterprise
> Creators Update (90 days evaluation version).
> I also tried putting breakpoints on access to refcounts, you can check the
> output earlier in this same thread.
> I don’t have my work machine at this moment (sorry, different time zones),
> next time I will put !devstack output.
>
> @Doron, no AV software or any other kind installed, OS is fully clean,
> with no 3rd party programs involved.
> I haven’t tried with other device classes but I suspect it has something
> to do with storage stack.
>
> It’s also worth noting, that if I set the driver as LowerFilter, the
> problem goes away.
> It also works OK if I change the order of UpperFilter values, so the
> Toaster filter is first and “partmgr” is second.
>
> Thanks guys for your time!
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer>
></http:>

Javier_Guerrero · July 5, 2017, 4:38pm

@Jamey, just tried that: bugcheck INACCESSIBLE_BOOT_DEVICE (7b) when rebooting.
Note that it’s set as an upper filter for disk devices.

Jamey_Kirby · July 5, 2017, 4:51pm

OK. Mine is volume filter and adding a space fixes the issue for my filter.
There is something going on.

On Wed, Jul 5, 2017, 4:38 PM wrote:

> @Jamey, just tried that: bugcheck INACCESSIBLE_BOOT_DEVICE (7b) when
> rebooting.
> Note that it’s set as an upper filter for disk devices.
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer>
></http:>

Mikae · July 5, 2017, 6:23pm

@Javier

I had slightly different idea: you put the ‘ba w’ with ‘kb’ just after the DO creation and then disconnect the device. Thus you monitor all refs/derefs on the device and print stack frames. Something like !obtrace but with more control.

Also, you better do it on 32bit OS since it works better with stack frames. Then paste the output here, it could be interesting to try to figure out what happens there.

OSR_Community_User · July 5, 2017, 9:24pm

@Jamey

OK. Mine is volume filter and adding a space fixes the issue for my filter.
There is something going on.

Look at the service name in the registry and INF file. Check if it begins with a space.

@Javier

USB sticks must be prepared for safe removal. But with an outstanding reference for the disk upper filter DO, you will not receive the message telling you that the device can be safely unplugged. That is why I think that you should not be able to safely unplug the device if you have a reference leak.

When you have a usb mass storage device, you have at least two devices being created: a USB one (enumerated by the USB enumerator) and a disk device enumerated by USBSTOR. When the usb stick is prepared for safe removal, the disk device disappears and the usb device is bangged. But in your case, the disk device cannot be deleted.

Javier_Guerrero · July 6, 2017, 4:08am

@Mikae

Tried that, output it’s odd because traces keep coming out long time after extracting device. Funny thing is, after doing the break on memory thing, the problem goes away and the driver reloads ok after reinserting the thumbdrive. I guess it’s something about timing and race condition.

Oh, and the safe removal procedure is always working fine, I get the message telling me I can remove the device.

OTOH, this is the output I get from !devstack command:

(this is right after creating DO):

3: kd> !devstack 0x8c5b2b70
!DevObj !DrvObj !DevExt ObjectName

8c5b2b70 \Driver\PSINDvct 8c5b2c28
883b1bf8 \Driver\partmgr 883b1cb0
8a16d7c8 \Driver\Disk 8a16d880 DR1
883357f0 \Driver\USBSTOR 883358a8 0000008b
!DevNode 8d004320 :
DeviceInst is “USBSTOR\Disk&Ven_General&Prod_UDisk&Rev_5.00\8&19613ac9&0&_&0”
ServiceName is “disk”

(and this is after safe removal)

2: kd> !devstack 0x8c5b2b70
!DevObj !DrvObj !DevExt ObjectName

8c5b2b70 \Driver\PSINDvct 8c5b2c28

(this is the object info after removal)

0: kd> !devobj 8c5b2b70
Device object (8c5b2b70) is for:
\Driver\PSINDvct DriverObject 8a04d870
Current Irp 00000000 RefCount 0 Type 00000007 Flags 00002010
Vpb 8c5016b8 SecurityDescriptor 852749b0 DevExt 8c5b2c28 DevObjExt 8c5b2c98
ExtensionFlags (0x00000812) DOE_DELETE_PENDING, DOE_START_PENDING,
DOE_DEFAULT_SD_PRESENT
Characteristics (0x00040101) FILE_REMOVABLE_MEDIA, FILE_DEVICE_SECURE_OPEN,
FILE_PORTABLE_DEVICE
Device queue is not busy.

I noticed that “DOE_START_PENDING” extension flag, I think it shouldn’t be there at that point, right?

Slava_Imameev · July 6, 2017, 5:59am

I believe this flag is set by the PNP Manager in IRP_MN_REMOVE phase in addition to DOE_DELETE_PENDING.

In other cases this flag usually means that IRP_MN_START has not been sent or failed. If the flag is set a device is unavailable to user mode applications as NtCreateFile returns with an error.

Scott_Noone_OSR · July 6, 2017, 11:47am

OK, it wasn’t that I didn’t believe you, but I don’t believe anyone So, I
had to set this up myself and see if I could reproduce the behavior…

I think there’s definitely a bug in PartMgr on build 1703. Can you try an
experiment and see if it solves the problem? I want to try disabling the
routine partmgr!PmSendTelemetry on your target machine.

First enter the assembler:

a partmgr!PmSendTelemetry

Then type:

ret

And hit enter.

If you want to undo the change and revert to the old behavior just run the
following:

!chkimg -f partmgr.sys

If you can no longer reproduce the behavior with that routine disabled,
here’s the long story that sort of explains it (and a mitigation at the
end):

I put an access breakpoint on the device object’s reference count and upon
surprise removal PartMgr went, for lack of a better phrase, completely up
its own ass. The “notification work item” was endlessly trying to “send
telemetry” and thus referencing/deferencing the top of the device stack:

nt!IoGetAttachedDeviceReference+11dde3
partmgr!PmCreatePartition+9d
partmgr!PmUpdatePartitionsLegacyRemovable+bb
partmgr!memset+9fb
partmgr!PmGetDriveLayoutEx+6c
partmgr!PmCollectTelemetry+e3
partmgr!PmSendTelemetry+67
partmgr!PmNotificationWorkItem+380
nt!IopProcessWorkItem+fb
nt!ExpWorkerThread+d8
nt!PspSystemThreadStartup+47
nt!KiStartSystemThread+16

It would eventually stop, but only after overflowing the object reference
trace buffer.

Doing some stepping through here, in the surprise remove path PartMgr wakes
up the notification work item (partmgr!PmWakeupNotificationWorkItem). The
work item then proceeds to call partmgr!PmSendTelemetry, which does some
work and…queues the work item. it keeps doing this until a list is empty.
Looking at the list, the only thing on it at this point is a reference to
the device object that we’re trying to surprise remove.

I’m running on a single CPU VM, so the re-queuing of the work item at the
end of the work item causes the surprise remove thread to stall and you get
the “up its ass” behavior (i.e. high priority work item continues to queue
itself until the list is empty, but the list won’t be empty until some other
thread gets to run).

Going back to 1607 this call to partmgr!PmSendTelemetry doesn’t exist and,
given that this doesn’t look quite right, my SWAG is that there’s a race
here causing problems with the reference count on remove. Disabling the
routine seems to cause the system to behave much more sanely on surprise
removal and properly clean up the reference count, hence my suggestion
above.

Of course, this doesn’t actually solve your problem because you can’t go
disabling things on client machines…If this DOES work then I would suggest
simply removing your Unload entry point and preventing your filter from
unloading. You’ll still get dangling device object references out there, but
(if I’m right) this isn’t your bug and exists even without your filter.

-scott
OSR
@OSRDrivers

Javier_Guerrero · July 7, 2017, 2:52am

@scott,

Thanks, that did the trick! So I guess we can confirm the bug. Hopefully it will be addressed soon

Thanks a lot to all of you guys: H. G. , Jamey, Peter, Doron, Slava, Mikae, Scott, for taking time to read and answer my questions.

Cheers!