[OSR-DETECTED-SPAM] RE: [OSR-DETECTED-SPAM] RE: [OSR-DETECTED-SPAM] RE: Conflicting statements in

Doron_Holan · November 25, 2013, 5:50pm

Yes, that window exists. You need to put up a gate to stop sending io and processing new io upon reception of the remove. That also means canceling times, docs, disconnecting interrupts before sending it down. Then you send the remove. Then you wait. Then you do state cleanup. This is what wdf does upon remove.

Wdf is not adding anything new here in terms of rules. Wdf may be slightly more picky, but the same exists in wdm.

d

Bent from my phone

From: Scott Noonemailto:xxxxx
Sent: ?11/?25/?2013 2:41 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: Re:[ntdev] [OSR-DETECTED-SPAM] RE: [OSR-DETECTED-SPAM] RE: Conflicting statements in MSDN about IoReleaseRemoveLockAndWait

OK, so let’s say you have a WDM FDO with a KMDF lower filter. The WDM FDO
sends I/Os down the stack in the background under a remove lock. During
remove processing, the FDO:

1. Send the IRP_MN_REMOVE down the stack

2. Waits on the remove lock

Now we have a window between 1-2 where I/Os could still be sent to the lower
filter after the lower filter has processed the remove. If the framework
doesn’t allow for that, then we’ll crash in the same way as we have here.

I’m thinking that’s exactly what the OP is seeing. Here’s the crashing
stack:

nt!KiBugCheckDispatch+0x69
nt!KiPageFault+0x260
Wdf01000!FxDevice::Dispatch+0x22
Wdf01000!FxDevice::DispatchWithLock+0xa6
nt!IovCallDriver+0x566
nt!KiProcessTimerDpcTable+0x6c
nt!KiProcessExpiredTimerList+0xc6

Clearly KiProcessTimerDpcTable doesn’t call IovCallDriver, so the return has
been optimized out. However, the crashing WDF device object is for
“storflt”, which is still attached below Disk:

1: kd> !devobj @rcx
Device object (fffffa8003d40370) is for:
\Driver\storflt DriverObject fffffa8002c97e70
Current Irp 00000000 RefCount 0 Type 0000002d Flags 00000010
DevExt 00000000 DevObjExt fffffa8003d404e8
ExtensionFlags (0x9400080a) DOE_DELETE_PENDING, DOE_REMOVE_PROCESSED,
DOE_DESIGNATED_FDO
Unknown flags 0x14000800
AttachedDevice (Upper) fffffa8004062060 \Driver\Disk
Device queue is not busy.

And there’s a thread waiting for the equivalent of the remove lock in the
Disk instance:

nt!KiSwapContext+0x7a
nt!KiCommitThreadWait+0x1d2
nt!KeWaitForSingleObject+0x19f
CLASSPNP!ClassRemoveDevice+0x12a
CLASSPNP! ?? ::NNGAKEGL::`string’+0x17a6
nt!IovCallDriver+0x566
…

Thus I’m assuming that it’s Disk calling down into storflt from a timer DPC
and this operation is accounted for in Disk’s remove lock. Disk waits for
its remove lock after it sends the IRP_MN_REMOVE down to the lower driver,
which would mean we’ve hit the window between 1 and 2 above.

-scott
OSR

“Doron Holan” wrote in message
news:xxxxx@ntdev…
Yes, you do the wait after send to let the underlying bus cleanup and
complete io first. Then you wait on the lock, then you clean up state

d

Bent from my phone
From: Scott Noone
Sent: ?11/?25/?2013 12:42 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] [OSR-DETECTED-SPAM] RE: Conflicting statements in MSDN
about IoReleaseRemoveLockAndWait

If that’s the case, wouldn’t that mean you have to wait on the remove lock
before sending the remove IRP? Otherwise the window is there between
remove and wait and I/Os arriving at the device.

-scott
OSR

“Doron Holan” wrote in message
news:xxxxx@ntdev…
Definitely a calm into a wdf driver after the remove irp has been processed.
You need to wait on the remlock after sending the remove irp

d

Bent from my phone
From: Scott Noone
Sent: ?11/?25/?2013 12:00 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] Conflicting statements in MSDN about
IoReleaseRemoveLockAndWait

Do you have any timer DPCs in your drivers? It looks to me like someone is
calling into a WDF driver after the device has been removed. From the stack
I can’t tell what either of the drivers are though (tail call optimization
obscures the caller, the framework owning the entry points obscures the
callee).

I set a breakpoint on this routine on my test system and at this point RCX
should contain a device object (RCX+40 is the DeviceExtension field, WDF is
trying to access some header before this field). !devobj @rcx or "dt
nt!_device_object @rcx’ might tell you the driver, which would shed some
light on things.

-scott
OSR

wrote in message news:xxxxx@ntdev…

Thank you Scott.

My driver is WDM. I can recreate the BSOD but not regularly. Every time I
can though it shows my driver’s stack to be in the call passing the remove
IRP down, hence my suspicion on the ordering.
I will put my stack as well but first is the !analyze -v output. I have
driver verifier running on the machine and it did not detect anything
special.

<<
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at
an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: ffffffffffffffd0, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffff880010d0862, address which referenced memory

Debugging Details:
------------------

READ_ADDRESS: ffffffffffffffd0

CURRENT_IRQL: 2

FAULTING_IP:
Wdf01000!FxDevice::Dispatch+22
fffff880010d0862 4c8b50d0 mov r10,qword ptr [rax-30h] DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT BUGCHECK_STR: 0xD1 PROCESS_NAME: System TRAP_FRAME: fffff8800203d2c0 -- (.trap 0xfffff8800203d2c0) NOTE: The trap frame does not contain all registers. Some register values may be zeroed or incorrect. rax=0000000000000000 rbx=0000000000000000 rcx=fffffa8003d40370 rdx=fffff9800b642e10 rsi=0000000000000000 rdi=0000000000000000 rip=fffff880010d0862 rsp=fffff8800203d450 rbp=0000000000000002 r8=fffffa8002c97e70 r9=0000000000000001 r10=fffff9800b642e10 r11=fffff9800b642e02 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei pl zr na po nc Wdf01000!FxDevice::Dispatch+0x22: fffff880010d0862 4c8b50d0 mov r10,qword ptr [rax-30h]
ds:dc18:ffd0=???
Resetting default scope

LAST_CONTROL_TRANSFER: from fffff80001dc9a12 to fffff80001cd0a30

STACK_TEXT:
fffff8800203ca08 fffff80001dc9a12 : ffffffffffffffd0 fffff8800201ff40
0000000000000065 fffff80001d13888 : nt!RtlpBreakWithStatusInstruction
fffff8800203ca10 fffff80001dca7fe : 0000000000000003 0000000000000000
fffff80001d140e0 00000000000000d1 : nt!KiBugCheckDebugBreak+0x12
fffff8800203ca70 fffff80001cd8d04 : fffffa80036e9e20 fffff88000e60872
0000000000000001 fffff80001d550c4 : nt!KeBugCheck2+0x71e
fffff8800203d140 fffff80001cd81a9 : 000000000000000a ffffffffffffffd0
0000000000000002 0000000000000000 : nt!KeBugCheckEx+0x104
fffff8800203d180 fffff80001cd6e20 : 0000000000000000 00000000a0000003
fffff9801f920e10 fffff9800b642e10 : nt!KiBugCheckDispatch+0x69
fffff8800203d2c0 fffff880010d0862 : fffffa8003d40370 0000000000000002
fffffa800473ff40 fffff8000217ee96 : nt!KiPageFault+0x260
fffff8800203d450 fffff880010d0aa6 : fffff9800b642e10 0000000000000002
fffffa8003d40370 fffff8800203dc18 : Wdf01000!FxDevice::Dispatch+0x22
fffff8800203d490 fffff80002182d26 : fffff9800b642e10 0000000000000002
fffffa8003d40370 e83f03ffffe00007 :
Wdf01000!FxDevice::DispatchWithLock+0xa6
fffff8800203d4d0 fffff80001ce384c : fffff8800203d5f0 0000000000000000
0000000000000001 fffffa800473ff40 : nt!IovCallDriver+0x566
fffff8800203d530 fffff80001ce36e6 : fffffa8004360748 fffffa8004360748
0000000000000000 0000000000000000 : nt!KiProcessTimerDpcTable+0x6c
fffff8800203d5a0 fffff80001ce35ce : 0000001479a9100e fffff8800203dc18
0000000000089683 fffff880020185e8 : nt!KiProcessExpiredTimerList+0xc6
fffff8800203dbf0 fffff80001ce33b7 : 0000140b8ff202c5 0000140b00089683
0000140b8ff20263 0000000000000083 : nt!KiTimerExpiration+0x1be
fffff8800203dc90 fffff80001cd090a : fffff88002015180 fffff8800201ff40
0000000000000001 fffff80000000000 : nt!KiRetireDpcList+0x277
fffff8800203dd40 0000000000000000 : fffff8800203e000 fffff88002038000
fffff8800203dd00 0000000000000000 : nt!KiIdleLoop+0x5a

STACK_COMMAND: kb

FOLLOWUP_IP:
Wdf01000!FxDevice::Dispatch+22
fffff880010d0862 4c8b50d0 mov r10,qword ptr [rax-30h] SYMBOL_STACK_INDEX: 6 SYMBOL_NAME: Wdf01000!FxDevice::Dispatch+22 FOLLOWUP_NAME: MachineOwner MODULE_NAME: Wdf01000 IMAGE_NAME: Wdf01000.sys DEBUG_FLR_IMAGE_TIMESTAMP: 5010aa89 FAILURE_BUCKET_ID: X64_0xD1_VRF_Wdf01000!FxDevice::Dispatch+22 BUCKET_ID: X64_0xD1_VRF_Wdf01000!FxDevice::Dispatch+22 Followup: MachineOwner --------- >> and the thread's stack related to the PnP device removal that's going through my driver (DcsPMF and DcsDf) at the time of the BSOD (note: the drivers send the IRP down and wait for its completion so if STATUS_PENDING is returned there's a WaitForSingleObject that wakes up when the IRP completion routine sets a private event). << *** Stack trace for last set context - .thread/.cxr resets it Child-SP RetAddr : Args to Child : Call Site fffff880059aac50 fffff80001cce612 : 0000000000000017 fffffa80044153c0 0000000000000000 0000000000000008 : nt!KiSwapContext+0x7a fffff880059aad90 fffff80001cdf98f : fffffa8002ca6001 fffffa80040621b0 0000000000000000 0000000000000000 : nt!KiCommitThreadWait+0x1d2 fffff880059aae20 fffff880013e08fa : 0000000000000000 0000000000000000 fffffa8004062000 fffffa8004062100 : nt!KeWaitForSingleObject+0x19f fffff880059aaec0 fffff880013dcb08 : fffffa8003d40370 0000000000000000 0000000000000002 fffff880013d7110 : CLASSPNP!ClassRemoveDevice+0x12a fffff880059aaf00 fffff80002182d26 : fffffa80040621b0 fffffa8002ca6038 fffff9801f614c60 fffff8000217e48e : CLASSPNP! ?? ::NNGAKEGL::string’+0x17a6
fffff880059aaf90 fffff800021855ea : fffff9801f614e50 fffffa8004062060
fffffa80044372a0 fffffa80018fc320 : nt!IovCallDriver+0x566
fffff880059aaff0 fffff80002182d26 : fffff9801f614c60 0000000000000002
fffffa80044372a0 fffffa800421a0b0 : nt!ViFilterDispatchPnp+0xea
fffff880059ab020 fffff880017e270c : fffff9801f614c60 0000000000000002
fffff9801f614f18 fffffa800421a0b0 : nt!IovCallDriver+0x566
fffff880059ab080 fffff880017e6298 : fffffa8002d26830 fffff9801f614c60
fffffa8002d26830 fffffa80032a1410 :
DcsDf!DiskPerfForwardIrpSynchronous+0xac
fffff880059ab110 fffff880017e5778 : fffffa8002d26830 fffff9801f614c60
0000000000000000 0000000000000000 : DcsDf!DiskPerfRemoveDevice+0x58
fffff880059ab150 fffff80002182d26 : fffffa8002d26830 fffff9801f614c60
fffff9801f614c60 fffff8000217e48e : DcsDf!DiskPerfDispatchPnp+0x148
fffff880059ab1d0 fffff800021855ea : fffff9801f614ee0 fffffa8002d26830
fffffa80042f5af0 fffffa80032a1410 : nt!IovCallDriver+0x566
fffff880059ab230 fffff80002182d26 : fffff9801f614c60 0000000000000002
fffffa80042f5af0 fffffa8003212a70 : nt!ViFilterDispatchPnp+0xea
fffff880059ab260 fffff880014b6926 : fffff9801f614c60 0000000000000002
fffff9801f614fa8 fffffa8003212a70 : nt!IovCallDriver+0x566
fffff880059ab2c0 fffff880014b82c5 : fffffa8003405060 fffff9801f614c60
fffff880059ab3f0 fffff880059ab3d8 : DcsPMF!PMFForwardIrpSynchronous+0xd6
fffff880059ab340 fffff880014b8d78 : fffffa8003405060 fffff9801f614c60
0000000000000003 0000000000000000 : DcsPMF!PMFRemoveDevice+0x2e5
fffff880059ab510 fffff80002182d26 : fffffa8003405060 fffff9801f614c60
fffff9801f614c60 fffff8000217e48e : DcsPMF!PMFDispatchPnp+0x738
fffff880059ab5d0 fffff800021855ea : fffff9801f614f70 fffffa8003405060
fffffa8003fd2700 fffffa80043d6eb0 : nt!IovCallDriver+0x566
fffff880059ab630 fffff80002182d26 : fffff9801f614c60 0000000000000002
fffffa8003fd2700 fffffa8003b2a010 : nt!ViFilterDispatchPnp+0xea
fffff880059ab660 fffff8800108530e : fffff9801f614c60 fffff9801f614fb8
0000000000000000 fffffa8003b2a010 : nt!IovCallDriver+0x566
fffff880059ab6c0 fffff80002182d26 : fffff9801f614c60 0000000000000002
fffffa80039c0330 00000000c00000bb : partmgr!PmPnp+0xfe
fffff880059ab710 fffff80001f46d11 : fffffa80039c0330 fffff880059ab828
00000000c00000bb fffffa8003b44010 : nt!IovCallDriver+0x566
fffff880059ab770 fffff800020c5681 : fffffa8004743060 0000000000000000
fffffa80047384a0 0000000000000801 : nt!IopSynchronousCall+0xe1
fffff880059ab7e0 fffff80001ddb063 : fffff8a0023b6f20 fffff8a0023b6f20
0000000000000018 0000000000000000 : nt!IopRemoveDevice+0x101
fffff880059ab8a0 fffff800020c51d4 : fffffa80047384a0 0000000000000000
0000000000000002 0000000000000000 : nt!PnpRemoveLockedDeviceNode+0x1a3
fffff880059ab8f0 fffff800020c52e0 : 0000000000000000 fffffa8004743000
fffff8a0074b8900 fffff80001edca80 : nt!PnpDeleteLockedDeviceNode+0x44
fffff880059ab920 fffff800020c53d9 : fffffa8004601602 fffffa8004601600
0000000000000001 0000000000000000 : nt!PnpDeleteLockedDeviceNodes+0xa0
fffff880059ab990 fffff800020c5551 : fffffa8004601600 0000000000000000
fffffa8004601600 0000000000010286 : nt!PnpDelayedRemoveWorker+0x79
fffff880059ab9e0 fffff80001ddb289 : 0000000000000000 fffffa80035e0a50
0000000000000001 0000000000000000 : nt!PnpChainDereferenceComplete+0x131
fffff880059aba20 fffff80002156570 : 0000000000000000 fffffa8004738400
fffff8a00398cef0 0000000000000001 : nt!PnpIsChainDereferenced+0xc9
fffff880059abaa0 fffff8000215680c : fffff880059abc78 0000000000010200
fffff880059abc00 0000000000000000 : nt!PnpProcessQueryRemoveAndEject+0xff0
fffff880059abbe0 fffff8000203f9ae : 0000000000000000 fffffa800466eb00
fffff8a00398cef0 fffff880059abc00 : nt!PnpProcessTargetDeviceEvent+0x4c
fffff880059abc10 fffff80001ce2251 : fffff80001f45b98 fffff8a00398cef0
fffff80001e7e2d8 fffff80001e7e2d8 : nt! ?? ::NNGAKEGL::string'+0x552ab fffff880059abc70 fffff80001f76ede : 0000000000000000 fffffa80044153c0 0000000000000080 fffffa8001876890 : nt!ExpWorkerThread+0x111 fffff880059abd00 fffff80001cc9906 : fffff88002015180 fffffa80044153c0 fffffa80028dcb50 0000000000000202 : nt!PspSystemThreadStartup+0x5a fffff880059abd40 0000000000000000 : fffff880059ac000 fffff880059a6000 fffff880059aa8a0 00000000`00000000 : nt!KxStartSystemThread+0x16
>>

—
NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

—
NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

—
NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer</mailto:xxxxx></mailto:xxxxx>

Thierry_Laurent · November 25, 2013, 7:07pm

Scott, Doron: The current code in our drivers currently performs the IoReleaseRemoveLockAndWait() *before* passing the remove IRP down and the BSOD I get is with this configuration. In addition there’s no timer DPC routine or self-generated IRPs from our drivers that can go down after the remove IRP is received.

Of course I can try to move the IoReleaseRemoveLockAndWait *after* as Doron advises but then I’ll need to create this extra mechanism to guard from new I/Os passing in (the gate), which would resemble a lot like what the ‘master’ remove lock is supposed to do.

I can create a second remove lock just for the I/O gate purpose with the remove IRP (I call it the ‘gate lock’), acquire it in addition wherever I acquire the ‘master’ remove lock and use IoReleaseRemoveLockAndWait(‘gate lock’) to prevent further I/O access *before* passing the IRP remove down. Then do a IoReleaseRemoveLockAndWait(‘master lock’) *after* the remove IRP comes back.
But wouldn’t doing all that be equivalent to what we do today in the driver (i.e. having the IoReleaseRemoveLockAndWait() call *before* the remove IRP is passed down) ?

Thierry

Alex_Grig · November 25, 2013, 7:42pm

>In addition there’s no timer DPC routine or self-generated IRPs from our drivers that can go down after the remove IRP is received.

Do you cancel your timers when you get REMOVE_DEVICE? Because the crash is definitely when the timer fired.

Scott_Noone_OSR · November 25, 2013, 9:19pm

Right now, my best guess is that this is looking like an issue between the Disk class driver and KMDF’s hard requirement on no IO after remove. If I’m right, there isn’t anything the OP can do about it in their driver. I need to do some more code reading tomorrow though to make sure that I’m not missing something.

-scott

OSR_Community_User · November 25, 2013, 9:59pm

>>In addition there’s no timer DPC routine or self-generated IRPs from our

> drivers that can go down after the remove IRP is received.

Do you cancel your timers when you get REMOVE_DEVICE? Because the crash is
definitely when the timer fired.

OMG! Some bugs never change! We used to get a crash on TSS/360, ca.
1968, at precisely 2pm, several days a week. Turns out that when a
process logged out (remember, in those days when you logged on you got a
process and that was your one-and-only process. From time to time, you
might cause it to execute code from some file instead of its command loop,
but that was it: one process) with a timer set, the timer was not deleted,
and when the timer fired, the system crashed. Our courses started at 20
minutes after the hour, so one of our professors set up a notification
alarm: at 2:00, it typed out (literally; we were more than a decade from
using CRTs as devices!) “Time to get ready for class”. So if he never
logged on, the system was healthy; if he logged on and was still logged on
at 2:00, the system was healthy, but if he logged on, then logged out
before 2pm, it was a “System ABEND”, which is what we call the BSOD. In
this case, it typed out “System ABEND” on the console, saved a snapshot of
the memory to a file on disk, and waited for the operator to reboot it.
Auto-reboot? Who would want that when you have a human being watching the
console typeout for every minute the computer is servicing users?

Oh, yes, and it had this huge, obnoxious bell, the kind you found in
building fire alarms, inside one of the cabinets. When it crashed, it
rang the bell. Even a zombie (or a third-shift operator, they were hard
to tell apart) could not sleep through that bell.

Here we are, 45 years later, with the same problem!
joe

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · November 25, 2013, 10:20pm

> Scott, Doron: The current code in our drivers currently performs the

IoReleaseRemoveLockAndWait() *before* passing the remove IRP down and the
BSOD I get is with this configuration. In addition there’s no timer DPC
routine or self-generated IRPs from our drivers that can go down after the
remove IRP is received.

Of course I can try to move the IoReleaseRemoveLockAndWait *after* as
Doron advises but then I’ll need to create this extra mechanism to guard
from new I/Os passing in (the gate), which would resemble a lot like what
the ‘master’ remove lock is supposed to do.

No. The master remove lock blocks the processing of the remove IRP until
the driver has quiesced. It in no way is involved, ever, in blocking
incoming IRPs. In WDM drives, one of the things you had to do was flush
any IRPs you were queueing because of the IRP_MN_STOP IRP (there could be
another IRP_MN_START, in which case you dequeued those IRPs and did what
they asked). When the IRP_MN_STOP comes in, you have to “close the gate”
against new IRPs, so if you haven’t done this, you are already in trouble.
After the IRP_MN_STOP, any device registers, I/O ports, or interrupts you
might have set are toast. You will redo these values when the
IRP_MN_START comes in.

I don’t know how WDF handles this, but it has GOT to be better than the
horrible mess that WDM was.

When you flushed the queue during a remove, you completed the IRPs with an
error code, such as that the device was no longer available. Any new IRPs
were not even queued, but immediately completed with the same error.

However, there may have been any number of IRPs “in flight” when the
IRP_MN_STOP comes in, and they have to complete before you can allow
processing of the IRP_MN_REMOVE. It is your responsibility to decide what
to do with an IRP that is “in flight”. For example, if it is queued for
the device to process, you might just move it to the pending queue (you
have to do this for the IRP_MN_STOP as well). If it has already activated
the device, you have to decide how you are going to stop the device, such
as aborting a DMA request, or let the hardware finish what it is doing, as
appropriate for the problem space of the device.

My understanding is that the default behavior of WDF is supposed to
support this, and when and how to do these things is controlled by that
“state machine” of song and legend, the one that is *actually* correct,
instead of the mess of nearly-right-and-maybe-fatally-wrong implementation
models we were given in WDM.

Basically, if you don’t already close that gate for new IRPs in your
IRP_MN_STOP handler, your driver is wrong. Once you let the stop proceed,
there is no hardware to talk to. It might come back (after a bus
reconfiguration) or it might be followed by an IRP_MN_REMOVE.
joe

I can create a second remove lock just for the I/O gate purpose with the
remove IRP (I call it the ‘gate lock’), acquire it in addition wherever I
acquire the ‘master’ remove lock and use IoReleaseRemoveLockAndWait(‘gate
lock’) to prevent further I/O access *before* passing the IRP remove down.
Then do a IoReleaseRemoveLockAndWait(‘master lock’) *after* the remove
IRP comes back.
But wouldn’t doing all that be equivalent to what we do today in the
driver (i.e. having the IoReleaseRemoveLockAndWait() call *before* the
remove IRP is passed down) ?

Thierry

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Scott_Noone_OSR · November 26, 2013, 10:38am

As far as I can tell, this is a legitimate bug that is entirely between
ClassPnP/Disk and the framework. All it would take is a lower disk filter
written in KMDF and some really bad timing.

Using the 8.1 WDK ClassPnP source as a guide, here’s the general flow of
processing an I/O request in the class driver:

ClassReadWrite is called, acquires the remove lock
ServiceTransferRequest calls SubmitTransferPacket, which sends the IRP
down to the lower filter with completion routine TransferPktComplete set
TransferPktComplete checks to see if the request was successful. If it is
successful, the remove lock is dropped the function returns. If it is not
successful, the remove lock is *not* dropped and a timer DPC is queued to
retry the operation (by calling SubmitTransferPacket, which brings us back
to Step 2).

On the PnP side, during remove processing the class driver sends the
IRP_MN_REMOVE IRP down to the lower driver and then waits on the remove
lock. As far as I can tell, no attempt is made to make sure that the
background retry timers are canceled/flushed prior to sending the
IRP_MN_REMOVE down to the lower driver.

It’s not clear to me that resolving this is simply a matter of making the
timers go away. It appears as though IRPs being dispatched might also have a
chance to squeeze in between the IRP_MN_REMOVE and the remove lock being
released (i.e. I’m not seeing an extra properly serialized state variable
used to account for this window).

Bottom line to the OP: I’m not an expert in this driver, so it’s possible
that I’m missing a subtlety in how either of these are handled. However, I
think you’ve stumbled upon an O/S bug that’s going to require involving
Microsoft Support to resolve.

-scott
OSR

Tim_Roberts · November 26, 2013, 12:49pm

xxxxx@flounder.com wrote:

>
> Do you cancel your timers when you get REMOVE_DEVICE? Because the crash is
> definitely when the timer fired.
OMG! Some bugs never change! We used to get a crash on TSS/360, ca.
1968, at precisely 2pm, several days a week.

You are right about that. In 2004, the communications system in the
Palmdale Air Traffic Control Center in southern California suddenly
failed. Controllers couldn’t reach aircraft or each other.

The story is that, because of a maintenance screwup, the computer
controlling the comm system had been running for 47 days without
rebooting. Those of you with a memory for numbers will recall that, if
you are counting milliseconds, the 47th day is when that number goes
from 32 bits to 33 bits. That’s the point when GetTickCount rolls over.

There is now an established procedure to reboot the comm computer once a
week. I would have thought it was better to fix the app bug, but I must
not think like a government contractor any more.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Thierry_Laurent · November 26, 2013, 3:08pm

Meanwhile I have re-checked that all operations from my driver going downstream were under the remove lock held. There were a few places where the remove lock was released right after coming back from IoCallDriver and that could cause an issue if it was STATUS_PENDING so I put the release remove lock in a completion routine added to the IRP. Same BSOD occurs.

I am now trying a new build where I did move the removelockandwait *after* passing the remove IRP down. For debugging I am also tracing if any read/write comes in while the PnP state is not in the ‘started’ state, at least to verify if some I/Os do sneak between passing the remove down and the removelockandwait.
I’ll post the results here.

Scott_Noone_OSR · November 26, 2013, 3:35pm

Is StorFlt your driver as well? Try removing it as a lower disk filter and
see if the problem goes away (I suspect it will).

-scott
OSR

wrote in message news:xxxxx@ntdev…

Meanwhile I have re-checked that all operations from my driver going
downstream were under the remove lock held. There were a few places where
the remove lock was released right after coming back from IoCallDriver and
that could cause an issue if it was STATUS_PENDING so I put the release
remove lock in a completion routine added to the IRP. Same BSOD occurs.

I am now trying a new build where I did move the removelockandwait *after*
passing the remove IRP down. For debugging I am also tracing if any
read/write comes in while the PnP state is not in the ‘started’ state, at
least to verify if some I/Os do sneak between passing the remove down and
the removelockandwait.
I’ll post the results here.

Scott_Noone_OSR · December 13, 2013, 3:43pm

**Update and Resolution**

I took a look at one of these dumps and worked with Thierry to figure out
what was causing this crash. For those not following along at home, the
configuration was:

Standard Disk class driver
In box KMDF lower disk filter
Bunch of third party drivers playing with storage, including filters

The crash happened in the KMDF filter. The call stack was from Disk sending
down a retried I/O operation from a timer callback *after* it had processed
IRP_MN_REMOVE. KMDF crashes if I/O is sent to a framework device after
IRP_MN_REMOVE has been received, thus hardening up the rule of no I/O while
in this state, and explains the bugcheck.

Of course, the question then becomes: whose responsibility is it to make
sure that Disk cleans up this dangling I/O before the IRP_MN_REMOVE is sent
down the stack? My first inclination was that this should be Disk’s problem
(and I could still argue that), but something had to be unique about
Thierry’s environment or this would be more common. As it turned out, this
offending I/O was generated by one of the third party drivers, which was of
course suspicious.

The driver appeared to do things correctly. It was careful to maintain an
open File Object to the disk device while performing I/O, closed that File
Object as a result of surprise remove notification, and did not send any
further I/O to the device after processing the surprise remove operation.
However, it did not *reference* the File Object for each I/O submitted (I
bet we all remember that step in our driver generated I/O, right?). Thus,
the sequence of events was:

DriverA creates File Object to Disk
DriverA submits I/O to Disk
User yanks device out, Disk gets IRP_MN_SURPRISE_REMOVE
Disk gets an I/O error from the pending I/O, queues a retry to a timer
DPC
DriverA receives notification surprise remove, closes handle to File
Object
No more references to Disk, Disk gets IRP_MN_REMOVE_DEVICE and forwards
it to KMDF
Disk timer DPC fires, sends I/O down to the jacked up KMDF filter
instance
Boom…

To fix this without changing Disk, DriverA now takes a reference to the File
Object at step 2 and drops it in a registered completion routine. This
delays the IRP_MN_REMOVE_DEVICE IRP from being sent to Disk until the last
outstanding I/O operation completes and resolves the issue. Alternatively,
they could have created a mechanism for flushing out any pending I/O
operations at step 5.

I wanted to make sure I got this out there as it’s an issue that more than a
few people have a chance of bumping into, especially with more and more KMDF
filters showing up and enforcing this rule. The documentation for
GUID_TARGET_DEVICE_REMOVE_COMPLETE should probably provide more guidance in
this area, as of right now it simply says to close the handle and remove any
reference to the file object (which is exactly what the OP was doing).

-scott
OSR