Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Before Posting...
Please check out the Community Guidelines in the Announcements and Administration Category.

[OSR-DETECTED-SPAM] RE: [OSR-DETECTED-SPAM] RE: [OSR-DETECTED-SPAM] RE: Conflicting statements in

Doron_HolanDoron_Holan Member - All Emails Posts: 10,435
Yes, that window exists. You need to put up a gate to stop sending io and processing new io upon reception of the remove. That also means canceling times, docs, disconnecting interrupts before sending it down. Then you send the remove. Then you wait. Then you do state cleanup. This is what wdf does upon remove.

Wdf is not adding anything new here in terms of rules. Wdf may be slightly more picky, but the same exists in wdm.

d

Bent from my phone
________________________________
From: Scott Noone
Sent: ?11/?25/?2013 2:41 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] [OSR-DETECTED-SPAM] RE: [OSR-DETECTED-SPAM] RE: Conflicting statements in MSDN about IoReleaseRemoveLockAndWait

OK, so let's say you have a WDM FDO with a KMDF lower filter. The WDM FDO
sends I/Os down the stack in the background under a remove lock. During
remove processing, the FDO:

1. Send the IRP_MN_REMOVE down the stack

2. Waits on the remove lock

Now we have a window between 1-2 where I/Os could still be sent to the lower
filter after the lower filter has processed the remove. If the framework
doesn't allow for that, then we'll crash in the same way as we have here.

I'm thinking that's exactly what the OP is seeing. Here's the crashing
stack:

nt!KiBugCheckDispatch+0x69
nt!KiPageFault+0x260
Wdf01000!FxDevice::Dispatch+0x22
Wdf01000!FxDevice::DispatchWithLock+0xa6
nt!IovCallDriver+0x566
nt!KiProcessTimerDpcTable+0x6c
nt!KiProcessExpiredTimerList+0xc6

Clearly KiProcessTimerDpcTable doesn't call IovCallDriver, so the return has
been optimized out. However, the crashing WDF device object is for
"storflt", which is still attached below Disk:

1: kd> !devobj @rcx
Device object (fffffa8003d40370) is for:
\Driver\storflt DriverObject fffffa8002c97e70
Current Irp 00000000 RefCount 0 Type 0000002d Flags 00000010
DevExt 00000000 DevObjExt fffffa8003d404e8
ExtensionFlags (0x9400080a) DOE_DELETE_PENDING, DOE_REMOVE_PROCESSED,
DOE_DESIGNATED_FDO
Unknown flags 0x14000800
AttachedDevice (Upper) fffffa8004062060 \Driver\Disk
Device queue is not busy.

And there's a thread waiting for the equivalent of the remove lock in the
Disk instance:

nt!KiSwapContext+0x7a
nt!KiCommitThreadWait+0x1d2
nt!KeWaitForSingleObject+0x19f
CLASSPNP!ClassRemoveDevice+0x12a
CLASSPNP! ?? ::NNGAKEGL::`string'+0x17a6
nt!IovCallDriver+0x566
...

Thus I'm assuming that it's Disk calling down into storflt from a timer DPC
and this operation is accounted for in Disk's remove lock. Disk waits for
its remove lock *after* it sends the IRP_MN_REMOVE down to the lower driver,
which would mean we've hit the window between 1 and 2 above.

-scott
OSR

"Doron Holan" wrote in message
news:xxxxx@ntdev...
Yes, you do the wait after send to let the underlying bus cleanup and
complete io first. Then you wait on the lock, then you clean up state

d

Bent from my phone
From: Scott Noone
Sent: ?11/?25/?2013 12:42 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] [OSR-DETECTED-SPAM] RE: Conflicting statements in MSDN
about IoReleaseRemoveLockAndWait



Definitely a calm into a wdf driver after the remove irp has been processed.
You need to wait on the remlock after sending the remove irp


If that's the case, wouldn't that mean you have to wait on the remove lock
*before* sending the remove IRP? Otherwise the window is there between
remove and wait and I/Os arriving at the device.


-scott
OSR


"Doron Holan" wrote in message
news:xxxxx@ntdev...
Definitely a calm into a wdf driver after the remove irp has been processed.
You need to wait on the remlock after sending the remove irp

d

Bent from my phone
From: Scott Noone
Sent: ?11/?25/?2013 12:00 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] Conflicting statements in MSDN about
IoReleaseRemoveLockAndWait


Do you have any timer DPCs in your drivers? It looks to me like someone is
calling into a WDF driver after the device has been removed. From the stack
I can't tell what either of the drivers are though (tail call optimization
obscures the caller, the framework owning the entry points obscures the
callee).

I set a breakpoint on this routine on my test system and at this point RCX
should contain a device object (RCX+40 is the DeviceExtension field, WDF is
trying to access some header before this field). !devobj @rcx or "dt
nt!_device_object @rcx' might tell you the driver, which would shed some
light on things.

-scott
OSR

wrote in message news:xxxxx@ntdev...

Thank you Scott.

My driver is WDM. I can recreate the BSOD but not regularly. Every time I
can though it shows my driver's stack to be in the call passing the remove
IRP down, hence my suspicion on the ordering.
I will put my stack as well but first is the !analyze -v output. I have
driver verifier running on the machine and it did not detect anything
special.

<<
DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)
An attempt was made to access a pageable (or completely invalid) address at
an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If kernel debugger is available get stack backtrace.
Arguments:
Arg1: ffffffffffffffd0, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, value 0 = read operation, 1 = write operation
Arg4: fffff880010d0862, address which referenced memory

Debugging Details:
------------------


READ_ADDRESS: ffffffffffffffd0

CURRENT_IRQL: 2

FAULTING_IP:
Wdf01000!FxDevice::Dispatch+22
fffff880`010d0862 4c8b50d0 mov r10,qword ptr [rax-30h]

DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT

BUGCHECK_STR: 0xD1

PROCESS_NAME: System

TRAP_FRAME: fffff8800203d2c0 -- (.trap 0xfffff8800203d2c0)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=0000000000000000 rbx=0000000000000000 rcx=fffffa8003d40370
rdx=fffff9800b642e10 rsi=0000000000000000 rdi=0000000000000000
rip=fffff880010d0862 rsp=fffff8800203d450 rbp=0000000000000002
r8=fffffa8002c97e70 r9=0000000000000001 r10=fffff9800b642e10
r11=fffff9800b642e02 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0 nv up ei pl zr na po nc
Wdf01000!FxDevice::Dispatch+0x22:
fffff880`010d0862 4c8b50d0 mov r10,qword ptr [rax-30h]
ds:dc18:ffd0=????????????????
Resetting default scope

LAST_CONTROL_TRANSFER: from fffff80001dc9a12 to fffff80001cd0a30

STACK_TEXT:
fffff880`0203ca08 fffff800`01dc9a12 : ffffffff`ffffffd0 fffff880`0201ff40
00000000`00000065 fffff800`01d13888 : nt!RtlpBreakWithStatusInstruction
fffff880`0203ca10 fffff800`01dca7fe : 00000000`00000003 00000000`00000000
fffff800`01d140e0 00000000`000000d1 : nt!KiBugCheckDebugBreak+0x12
fffff880`0203ca70 fffff800`01cd8d04 : fffffa80`036e9e20 fffff880`00e60872
00000000`00000001 fffff800`01d550c4 : nt!KeBugCheck2+0x71e
fffff880`0203d140 fffff800`01cd81a9 : 00000000`0000000a ffffffff`ffffffd0
00000000`00000002 00000000`00000000 : nt!KeBugCheckEx+0x104
fffff880`0203d180 fffff800`01cd6e20 : 00000000`00000000 00000000`a0000003
fffff980`1f920e10 fffff980`0b642e10 : nt!KiBugCheckDispatch+0x69
fffff880`0203d2c0 fffff880`010d0862 : fffffa80`03d40370 00000000`00000002
fffffa80`0473ff40 fffff800`0217ee96 : nt!KiPageFault+0x260
fffff880`0203d450 fffff880`010d0aa6 : fffff980`0b642e10 00000000`00000002
fffffa80`03d40370 fffff880`0203dc18 : Wdf01000!FxDevice::Dispatch+0x22
fffff880`0203d490 fffff800`02182d26 : fffff980`0b642e10 00000000`00000002
fffffa80`03d40370 e83f03ff`ffe00007 :
Wdf01000!FxDevice::DispatchWithLock+0xa6
fffff880`0203d4d0 fffff800`01ce384c : fffff880`0203d5f0 00000000`00000000
00000000`00000001 fffffa80`0473ff40 : nt!IovCallDriver+0x566
fffff880`0203d530 fffff800`01ce36e6 : fffffa80`04360748 fffffa80`04360748
00000000`00000000 00000000`00000000 : nt!KiProcessTimerDpcTable+0x6c
fffff880`0203d5a0 fffff800`01ce35ce : 00000014`79a9100e fffff880`0203dc18
00000000`00089683 fffff880`020185e8 : nt!KiProcessExpiredTimerList+0xc6
fffff880`0203dbf0 fffff800`01ce33b7 : 0000140b`8ff202c5 0000140b`00089683
0000140b`8ff20263 00000000`00000083 : nt!KiTimerExpiration+0x1be
fffff880`0203dc90 fffff800`01cd090a : fffff880`02015180 fffff880`0201ff40
00000000`00000001 fffff800`00000000 : nt!KiRetireDpcList+0x277
fffff880`0203dd40 00000000`00000000 : fffff880`0203e000 fffff880`02038000
fffff880`0203dd00 00000000`00000000 : nt!KiIdleLoop+0x5a


STACK_COMMAND: kb

FOLLOWUP_IP:
Wdf01000!FxDevice::Dispatch+22
fffff880`010d0862 4c8b50d0 mov r10,qword ptr [rax-30h]

SYMBOL_STACK_INDEX: 6

SYMBOL_NAME: Wdf01000!FxDevice::Dispatch+22

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: Wdf01000

IMAGE_NAME: Wdf01000.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 5010aa89

FAILURE_BUCKET_ID: X64_0xD1_VRF_Wdf01000!FxDevice::Dispatch+22

BUCKET_ID: X64_0xD1_VRF_Wdf01000!FxDevice::Dispatch+22

Followup: MachineOwner
---------
>>

and the thread's stack related to the PnP device removal that's going
through my driver (DcsPMF and DcsDf) at the time of the BSOD (note: the
drivers send the IRP down and wait for its completion so if STATUS_PENDING
is returned there's a WaitForSingleObject that wakes up when the IRP
completion routine sets a private event).

<<
*** Stack trace for last set context - .thread/.cxr resets it
Child-SP RetAddr : Args to Child
: Call Site
fffff880`059aac50 fffff800`01cce612 : 00000000`00000017 fffffa80`044153c0
00000000`00000000 00000000`00000008 : nt!KiSwapContext+0x7a
fffff880`059aad90 fffff800`01cdf98f : fffffa80`02ca6001 fffffa80`040621b0
00000000`00000000 00000000`00000000 : nt!KiCommitThreadWait+0x1d2
fffff880`059aae20 fffff880`013e08fa : 00000000`00000000 00000000`00000000
fffffa80`04062000 fffffa80`04062100 : nt!KeWaitForSingleObject+0x19f
fffff880`059aaec0 fffff880`013dcb08 : fffffa80`03d40370 00000000`00000000
00000000`00000002 fffff880`013d7110 : CLASSPNP!ClassRemoveDevice+0x12a
fffff880`059aaf00 fffff800`02182d26 : fffffa80`040621b0 fffffa80`02ca6038
fffff980`1f614c60 fffff800`0217e48e : CLASSPNP! ??
::NNGAKEGL::`string'+0x17a6
fffff880`059aaf90 fffff800`021855ea : fffff980`1f614e50 fffffa80`04062060
fffffa80`044372a0 fffffa80`018fc320 : nt!IovCallDriver+0x566
fffff880`059aaff0 fffff800`02182d26 : fffff980`1f614c60 00000000`00000002
fffffa80`044372a0 fffffa80`0421a0b0 : nt!ViFilterDispatchPnp+0xea
fffff880`059ab020 fffff880`017e270c : fffff980`1f614c60 00000000`00000002
fffff980`1f614f18 fffffa80`0421a0b0 : nt!IovCallDriver+0x566
fffff880`059ab080 fffff880`017e6298 : fffffa80`02d26830 fffff980`1f614c60
fffffa80`02d26830 fffffa80`032a1410 :
DcsDf!DiskPerfForwardIrpSynchronous+0xac
fffff880`059ab110 fffff880`017e5778 : fffffa80`02d26830 fffff980`1f614c60
00000000`00000000 00000000`00000000 : DcsDf!DiskPerfRemoveDevice+0x58
fffff880`059ab150 fffff800`02182d26 : fffffa80`02d26830 fffff980`1f614c60
fffff980`1f614c60 fffff800`0217e48e : DcsDf!DiskPerfDispatchPnp+0x148
fffff880`059ab1d0 fffff800`021855ea : fffff980`1f614ee0 fffffa80`02d26830
fffffa80`042f5af0 fffffa80`032a1410 : nt!IovCallDriver+0x566
fffff880`059ab230 fffff800`02182d26 : fffff980`1f614c60 00000000`00000002
fffffa80`042f5af0 fffffa80`03212a70 : nt!ViFilterDispatchPnp+0xea
fffff880`059ab260 fffff880`014b6926 : fffff980`1f614c60 00000000`00000002
fffff980`1f614fa8 fffffa80`03212a70 : nt!IovCallDriver+0x566
fffff880`059ab2c0 fffff880`014b82c5 : fffffa80`03405060 fffff980`1f614c60
fffff880`059ab3f0 fffff880`059ab3d8 : DcsPMF!PMFForwardIrpSynchronous+0xd6
fffff880`059ab340 fffff880`014b8d78 : fffffa80`03405060 fffff980`1f614c60
00000000`00000003 00000000`00000000 : DcsPMF!PMFRemoveDevice+0x2e5
fffff880`059ab510 fffff800`02182d26 : fffffa80`03405060 fffff980`1f614c60
fffff980`1f614c60 fffff800`0217e48e : DcsPMF!PMFDispatchPnp+0x738
fffff880`059ab5d0 fffff800`021855ea : fffff980`1f614f70 fffffa80`03405060
fffffa80`03fd2700 fffffa80`043d6eb0 : nt!IovCallDriver+0x566
fffff880`059ab630 fffff800`02182d26 : fffff980`1f614c60 00000000`00000002
fffffa80`03fd2700 fffffa80`03b2a010 : nt!ViFilterDispatchPnp+0xea
fffff880`059ab660 fffff880`0108530e : fffff980`1f614c60 fffff980`1f614fb8
00000000`00000000 fffffa80`03b2a010 : nt!IovCallDriver+0x566
fffff880`059ab6c0 fffff800`02182d26 : fffff980`1f614c60 00000000`00000002
fffffa80`039c0330 00000000`c00000bb : partmgr!PmPnp+0xfe
fffff880`059ab710 fffff800`01f46d11 : fffffa80`039c0330 fffff880`059ab828
00000000`c00000bb fffffa80`03b44010 : nt!IovCallDriver+0x566
fffff880`059ab770 fffff800`020c5681 : fffffa80`04743060 00000000`00000000
fffffa80`047384a0 00000000`00000801 : nt!IopSynchronousCall+0xe1
fffff880`059ab7e0 fffff800`01ddb063 : fffff8a0`023b6f20 fffff8a0`023b6f20
00000000`00000018 00000000`00000000 : nt!IopRemoveDevice+0x101
fffff880`059ab8a0 fffff800`020c51d4 : fffffa80`047384a0 00000000`00000000
00000000`00000002 00000000`00000000 : nt!PnpRemoveLockedDeviceNode+0x1a3
fffff880`059ab8f0 fffff800`020c52e0 : 00000000`00000000 fffffa80`04743000
fffff8a0`074b8900 fffff800`01edca80 : nt!PnpDeleteLockedDeviceNode+0x44
fffff880`059ab920 fffff800`020c53d9 : fffffa80`04601602 fffffa80`04601600
00000000`00000001 00000000`00000000 : nt!PnpDeleteLockedDeviceNodes+0xa0
fffff880`059ab990 fffff800`020c5551 : fffffa80`04601600 00000000`00000000
fffffa80`04601600 00000000`00010286 : nt!PnpDelayedRemoveWorker+0x79
fffff880`059ab9e0 fffff800`01ddb289 : 00000000`00000000 fffffa80`035e0a50
00000000`00000001 00000000`00000000 : nt!PnpChainDereferenceComplete+0x131
fffff880`059aba20 fffff800`02156570 : 00000000`00000000 fffffa80`04738400
fffff8a0`0398cef0 00000000`00000001 : nt!PnpIsChainDereferenced+0xc9
fffff880`059abaa0 fffff800`0215680c : fffff880`059abc78 00000000`00010200
fffff880`059abc00 00000000`00000000 : nt!PnpProcessQueryRemoveAndEject+0xff0
fffff880`059abbe0 fffff800`0203f9ae : 00000000`00000000 fffffa80`0466eb00
fffff8a0`0398cef0 fffff880`059abc00 : nt!PnpProcessTargetDeviceEvent+0x4c
fffff880`059abc10 fffff800`01ce2251 : fffff800`01f45b98 fffff8a0`0398cef0
fffff800`01e7e2d8 fffff800`01e7e2d8 : nt! ?? ::NNGAKEGL::`string'+0x552ab
fffff880`059abc70 fffff800`01f76ede : 00000000`00000000 fffffa80`044153c0
00000000`00000080 fffffa80`01876890 : nt!ExpWorkerThread+0x111
fffff880`059abd00 fffff800`01cc9906 : fffff880`02015180 fffffa80`044153c0
fffffa80`028dcb50 00000000`00000202 : nt!PspSystemThreadStartup+0x5a
fffff880`059abd40 00000000`00000000 : fffff880`059ac000 fffff880`059a6000
fffff880`059aa8a0 00000000`00000000 : nt!KxStartSystemThread+0x16
>>


---
NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


---
NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


---
NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
d

Comments

  • Thierry_LaurentThierry_Laurent Member Posts: 7
    Scott, Doron: The current code in our drivers currently performs the IoReleaseRemoveLockAndWait() *before* passing the remove IRP down and the BSOD I get is with this configuration. In addition there's no timer DPC routine or self-generated IRPs from our drivers that can go down after the remove IRP is received.

    Of course I can try to move the IoReleaseRemoveLockAndWait *after* as Doron advises but then I'll need to create this extra mechanism to guard from new I/Os passing in (the gate), which would resemble a lot like what the 'master' remove lock is supposed to do.

    I can create a second remove lock just for the I/O gate purpose with the remove IRP (I call it the 'gate lock'), acquire it in addition wherever I acquire the 'master' remove lock and use IoReleaseRemoveLockAndWait('gate lock') to prevent further I/O access *before* passing the IRP remove down. Then do a IoReleaseRemoveLockAndWait('master lock') *after* the remove IRP comes back.
    But wouldn't doing all that be equivalent to what we do today in the driver (i.e. having the IoReleaseRemoveLockAndWait() call *before* the remove IRP is passed down) ?

    Thierry
  • Alex_GrigAlex_Grig Member Posts: 3,238
    >In addition there's no timer DPC routine or self-generated IRPs from our drivers that can go down after the remove IRP is received.

    Do you cancel your timers when you get REMOVE_DEVICE? Because the crash is definitely when the timer fired.
  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,142
    Right now, my best guess is that this is looking like an issue between the Disk class driver and KMDF's hard requirement on no IO after remove. If I'm right, there isn't anything the OP can do about it in their driver. I need to do some more code reading tomorrow though to make sure that I'm not missing something.

    -scott

    -scott
    OSR

  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    >>In addition there's no timer DPC routine or self-generated IRPs from our
    >> drivers that can go down after the remove IRP is received.
    >
    > Do you cancel your timers when you get REMOVE_DEVICE? Because the crash is
    > definitely when the timer fired.
    >
    >
    OMG! Some bugs never change! We used to get a crash on TSS/360, ca.
    1968, at precisely 2pm, several days a week. Turns out that when a
    process logged out (remember, in those days when you logged on you got a
    process and that was your one-and-only process. From time to time, you
    might cause it to execute code from some file instead of its command loop,
    but that was it: one process) with a timer set, the timer was not deleted,
    and when the timer fired, the system crashed. Our courses started at 20
    minutes after the hour, so one of our professors set up a notification
    alarm: at 2:00, it typed out (literally; we were more than a decade from
    using CRTs as devices!) "Time to get ready for class". So if he never
    logged on, the system was healthy; if he logged on and was still logged on
    at 2:00, the system was healthy, but if he logged on, then logged out
    before 2pm, it was a "System ABEND", which is what we call the BSOD. In
    this case, it typed out "System ABEND" on the console, saved a snapshot of
    the memory to a file on disk, and waited for the operator to reboot it.
    Auto-reboot? Who would want that when you have a human being watching the
    console typeout for every minute the computer is servicing users?

    Oh, yes, and it had this huge, obnoxious bell, the kind you found in
    building fire alarms, inside one of the cabinets. When it crashed, it
    rang the bell. Even a zombie (or a third-shift operator, they were hard
    to tell apart) could not sleep through that bell.

    Here we are, 45 years later, with the same problem!
    joe

    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
    >
    > OSR is HIRING!! See http://www.osr.com/careers
    >
    > For our schedule of WDF, WDM, debugging and other seminars visit:
    > http://www.osr.com/seminars
    >
    > To unsubscribe, visit the List Server section of OSR Online at
    > http://www.osronline.com/page.cfm?name=ListServer
    >
  • OSR_Community_UserOSR_Community_User Member Posts: 110,217
    > Scott, Doron: The current code in our drivers currently performs the
    > IoReleaseRemoveLockAndWait() *before* passing the remove IRP down and the
    > BSOD I get is with this configuration. In addition there's no timer DPC
    > routine or self-generated IRPs from our drivers that can go down after the
    > remove IRP is received.
    >
    > Of course I can try to move the IoReleaseRemoveLockAndWait *after* as
    > Doron advises but then I'll need to create this extra mechanism to guard
    > from new I/Os passing in (the gate), which would resemble a lot like what
    > the 'master' remove lock is supposed to do.

    No. The master remove lock blocks the processing of the remove IRP until
    the driver has quiesced. It in no way is involved, ever, in blocking
    incoming IRPs. In WDM drives, one of the things you had to do was flush
    any IRPs you were queueing because of the IRP_MN_STOP IRP (there could be
    another IRP_MN_START, in which case you dequeued those IRPs and did what
    they asked). When the IRP_MN_STOP comes in, you have to "close the gate"
    against new IRPs, so if you haven't done this, you are already in trouble.
    After the IRP_MN_STOP, any device registers, I/O ports, or interrupts you
    might have set are toast. You will redo these values when the
    IRP_MN_START comes in.

    I don't know how WDF handles this, but it has GOT to be better than the
    horrible mess that WDM was.

    When you flushed the queue during a remove, you completed the IRPs with an
    error code, such as that the device was no longer available. Any new IRPs
    were not even queued, but immediately completed with the same error.

    However, there may have been any number of IRPs "in flight" when the
    IRP_MN_STOP comes in, and they have to complete before you can allow
    processing of the IRP_MN_REMOVE. It is your responsibility to decide what
    to do with an IRP that is "in flight". For example, if it is queued for
    the device to process, you might just move it to the pending queue (you
    have to do this for the IRP_MN_STOP as well). If it has already activated
    the device, you have to decide how you are going to stop the device, such
    as aborting a DMA request, or let the hardware finish what it is doing, as
    appropriate for the problem space of the device.

    My understanding is that the default behavior of WDF is supposed to
    support this, and when and how to do these things is controlled by that
    "state machine" of song and legend, the one that is *actually* correct,
    instead of the mess of nearly-right-and-maybe-fatally-wrong implementation
    models we were given in WDM.

    Basically, if you don't already close that gate for new IRPs in your
    IRP_MN_STOP handler, your driver is wrong. Once you let the stop proceed,
    there is no hardware to talk to. It might come back (after a bus
    reconfiguration) or it might be followed by an IRP_MN_REMOVE.
    joe

    >
    > I can create a second remove lock just for the I/O gate purpose with the
    > remove IRP (I call it the 'gate lock'), acquire it in addition wherever I
    > acquire the 'master' remove lock and use IoReleaseRemoveLockAndWait('gate
    > lock') to prevent further I/O access *before* passing the IRP remove down.
    > Then do a IoReleaseRemoveLockAndWait('master lock') *after* the remove
    > IRP comes back.
    > But wouldn't doing all that be equivalent to what we do today in the
    > driver (i.e. having the IoReleaseRemoveLockAndWait() call *before* the
    > remove IRP is passed down) ?
    >
    > Thierry
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
    >
    > OSR is HIRING!! See http://www.osr.com/careers
    >
    > For our schedule of WDF, WDM, debugging and other seminars visit:
    > http://www.osr.com/seminars
    >
    > To unsubscribe, visit the List Server section of OSR Online at
    > http://www.osronline.com/page.cfm?name=ListServer
    >
  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,142
    As far as I can tell, this is a legitimate bug that is entirely between
    ClassPnP/Disk and the framework. All it would take is a lower disk filter
    written in KMDF and some really bad timing.

    Using the 8.1 WDK ClassPnP source as a guide, here's the general flow of
    processing an I/O request in the class driver:

    1. ClassReadWrite is called, acquires the remove lock

    2. ServiceTransferRequest calls SubmitTransferPacket, which sends the IRP
    down to the lower filter with completion routine TransferPktComplete set

    3. TransferPktComplete checks to see if the request was successful. If it is
    successful, the remove lock is dropped the function returns. If it is not
    successful, the remove lock is *not* dropped and a timer DPC is queued to
    retry the operation (by calling SubmitTransferPacket, which brings us back
    to Step 2).

    On the PnP side, during remove processing the class driver sends the
    IRP_MN_REMOVE IRP down to the lower driver and then waits on the remove
    lock. As far as I can tell, no attempt is made to make sure that the
    background retry timers are canceled/flushed prior to sending the
    IRP_MN_REMOVE down to the lower driver.

    It's not clear to me that resolving this is simply a matter of making the
    timers go away. It appears as though IRPs being dispatched might also have a
    chance to squeeze in between the IRP_MN_REMOVE and the remove lock being
    released (i.e. I'm not seeing an extra properly serialized state variable
    used to account for this window).

    Bottom line to the OP: I'm not an expert in this driver, so it's possible
    that I'm missing a subtlety in how either of these are handled. However, I
    think you've stumbled upon an O/S bug that's going to require involving
    Microsoft Support to resolve.

    -scott
    OSR

    -scott
    OSR

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 13,019
    xxxxx@flounder.com wrote:
    >>
    >> Do you cancel your timers when you get REMOVE_DEVICE? Because the crash is
    >> definitely when the timer fired.
    > OMG! Some bugs never change! We used to get a crash on TSS/360, ca.
    > 1968, at precisely 2pm, several days a week.

    You are right about that. In 2004, the communications system in the
    Palmdale Air Traffic Control Center in southern California suddenly
    failed. Controllers couldn't reach aircraft or each other.

    The story is that, because of a maintenance screwup, the computer
    controlling the comm system had been running for 47 days without
    rebooting. Those of you with a memory for numbers will recall that, if
    you are counting milliseconds, the 47th day is when that number goes
    from 32 bits to 33 bits. That's the point when GetTickCount rolls over.

    There is now an established procedure to reboot the comm computer once a
    week. I would have thought it was better to fix the app bug, but I must
    not think like a government contractor any more.

    --
    Tim Roberts, xxxxx@probo.com
    Providenza & Boekelheide, Inc.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • Thierry_LaurentThierry_Laurent Member Posts: 7
    Meanwhile I have re-checked that all operations from my driver going downstream were under the remove lock held. There were a few places where the remove lock was released right after coming back from IoCallDriver and that could cause an issue if it was STATUS_PENDING so I put the release remove lock in a completion routine added to the IRP. Same BSOD occurs.

    I am now trying a new build where I did move the removelockandwait *after* passing the remove IRP down. For debugging I am also tracing if any read/write comes in while the PnP state is not in the 'started' state, at least to verify if some I/Os do sneak between passing the remove down and the removelockandwait.
    I'll post the results here.
  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,142
    Is StorFlt your driver as well? Try removing it as a lower disk filter and
    see if the problem goes away (I suspect it will).

    -scott
    OSR

    wrote in message news:xxxxx@ntdev...

    Meanwhile I have re-checked that all operations from my driver going
    downstream were under the remove lock held. There were a few places where
    the remove lock was released right after coming back from IoCallDriver and
    that could cause an issue if it was STATUS_PENDING so I put the release
    remove lock in a completion routine added to the IRP. Same BSOD occurs.

    I am now trying a new build where I did move the removelockandwait *after*
    passing the remove IRP down. For debugging I am also tracing if any
    read/write comes in while the PnP state is not in the 'started' state, at
    least to verify if some I/Os do sneak between passing the remove down and
    the removelockandwait.
    I'll post the results here.

    -scott
    OSR

  • Scott_Noone_(OSR)Scott_Noone_(OSR) Administrator Posts: 3,142
    **Update and Resolution**

    I took a look at one of these dumps and worked with Thierry to figure out
    what was causing this crash. For those not following along at home, the
    configuration was:

    1. Standard Disk class driver
    2. In box KMDF lower disk filter
    3. Bunch of third party drivers playing with storage, including filters

    The crash happened in the KMDF filter. The call stack was from Disk sending
    down a retried I/O operation from a timer callback *after* it had processed
    IRP_MN_REMOVE. KMDF crashes if I/O is sent to a framework device after
    IRP_MN_REMOVE has been received, thus hardening up the rule of no I/O while
    in this state, and explains the bugcheck.

    Of course, the question then becomes: whose responsibility is it to make
    sure that Disk cleans up this dangling I/O before the IRP_MN_REMOVE is sent
    down the stack? My first inclination was that this should be Disk's problem
    (and I could still argue that), but something had to be unique about
    Thierry's environment or this would be more common. As it turned out, this
    offending I/O was generated by one of the third party drivers, which was of
    course suspicious.

    The driver appeared to do things correctly. It was careful to maintain an
    open File Object to the disk device while performing I/O, closed that File
    Object as a result of surprise remove notification, and did not send any
    further I/O to the device after processing the surprise remove operation.
    However, it did not *reference* the File Object for each I/O submitted (I
    bet we all remember that step in our driver generated I/O, right?). Thus,
    the sequence of events was:

    1. DriverA creates File Object to Disk

    2. DriverA submits I/O to Disk

    3. User yanks device out, Disk gets IRP_MN_SURPRISE_REMOVE

    4. Disk gets an I/O error from the pending I/O, queues a retry to a timer
    DPC

    5. DriverA receives notification surprise remove, closes handle to File
    Object

    6. No more references to Disk, Disk gets IRP_MN_REMOVE_DEVICE and forwards
    it to KMDF

    7. Disk timer DPC fires, sends I/O down to the jacked up KMDF filter
    instance

    8. Boom...

    To fix this without changing Disk, DriverA now takes a reference to the File
    Object at step 2 and drops it in a registered completion routine. This
    delays the IRP_MN_REMOVE_DEVICE IRP from being sent to Disk until the last
    outstanding I/O operation completes and resolves the issue. Alternatively,
    they could have created a mechanism for flushing out any pending I/O
    operations at step 5.

    I wanted to make sure I got this out there as it's an issue that more than a
    few people have a chance of bumping into, especially with more and more KMDF
    filters showing up and enforcing this rule. The documentation for
    GUID_TARGET_DEVICE_REMOVE_COMPLETE should probably provide more guidance in
    this area, as of right now it simply says to close the handle and remove any
    reference to the file object (which is exactly what the OP was doing).

    -scott
    OSR

    -scott
    OSR

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Upcoming OSR Seminars
Developing Minifilters 29 July 2019 OSR Seminar Space
Writing WDF Drivers 23 Sept 2019 OSR Seminar Space
Kernel Debugging 21 Oct 2019 OSR Seminar Space
Internals & Software Drivers 18 Nov 2019 Dulles, VA