Lingering file object

OSR_Community_User · June 28, 2012, 2:38pm

It’s been a long time since I’ve turned to this forum, but I’m stumped on a
problem that doesn’t make any sense.

I’m working on a DMA driver for Vista/Win7 for a custom compute engine that
supports multiple channels of DMA. The data flow through the various DMA
channels is all interdependent in such a way that if the driver receives a
cancellation for a DMA request that is queued up for one of the DMA
channels, it needs to idle all of the DMA engines, cancel all of the
outstanding requests for all of the DMA engines, and reinitialize
everything. This, in itself, is not a real problem because the applications
that use this device don’t ever explicitly call IoCancel, and thus the only
real reason any queued up or in-progress DMA requests would get cancelled is
if the application crashes or is killed without performing a clean shutdown
of things. Furthermore, I now seem to have all of the cancellation and
cleanup working just fine - or so it appears.

To test that the cancellation is working correctly, I run a script that
continuously fires up an application that uses the hardware, and then after
a random period of time kills the running app while it’s in the middle of
doing intensive DMA transfers on multiple channels simultaneously. The
script keeps doing this in a loop until something goes wrong.

Due to the nature of how the hardware works in conjunction with their
corresponding application software, it only makes sense for one application
to ever have a particular device open at any given time. Thus, the driver
implements a flag in the device object’s device extension that gets
atomically set during EvtDeviceFileCreate using InterlockedExchange, and
then cleared in EvtFileClose. If the flag is found to have been previously
set in EvtDeviceFileCreate, it completes the create request with
STATUS_SHARING_VIOLATION.

So, after running the script all day long without issue, eventually (like
after 500+ iterations), it stops because after killing the app during the
prior iteration, it cannot open the device again when firing up a new app to
run for the next iteration. EvtDeviceFileCreate gets called by the new app,
but sees that the flag is set, and fails the request. The crux of the
problem is that EvtFileClose never got called following the killing of the
app during the prior iteration. The docs state, “The framework calls a
driver’s EvtFileClose callback function when the last handle for a file
object has been closed and released, and all outstanding I/O requests have
been completed or canceled.”

Fortunately, I captured this while running a debug version of the driver
with WinDBG attached, so I’ve got the condition captured in a live debug
session. I’ve verified that all outstanding I/O requests from the app that
was last killed were correctly cancelled and no I/O requests are left
outstanding for the device. I’ve poked around looking for all sorts of
things, hoping to find some dangling DMA transaction object or something
else out of sorts, but as far as I can tell, everything was cleaned up
correctly following the killing of the prior app, except that
!wdfopenhandles reveals that there’s still a lingering open file object on
the device, indicating that the file handle used by the app that was last
killed somehow still hasn’t been closed and released properly, and until
that occurs, the EvtFileClose callback won’t get called (in addition to
clearing the flag, this callback also does some other final cleanup tasks
that are required before the device can be used by another app, but that’s
outside of the scope of this discussion - the key issue is that EvtFileClose
isn’t getting called).

Oddly, even though there seems to be a handle still open for the file
object, the app that was killed is gone - i.e. it’s not lingering around
waiting for the last I/O to complete or anything like that. So I can’t
figure out why the file object is still out there. Windows should have taken
care of closing all of the handles to the file object when the app was
killed, so why is there a handle still not closed or released?

Once in this state, the driver is effectively unusable since you can’t open
it with a new app. Furthermore, if you try to uninstall it via Device
Manager, it says you have to reboot. And, at least according to my client (I
haven’t tried it yet myself), when you try to reboot the system in this
condition it BSODs during shutdown. Not good all around.

Note that this is currently being tested on 64-bit Win7.

Does anybody have any suggestions of what I might be missing or what else I
could look for that might be the cause of this problem? I’ve still got the
live debug session going where I’ve captured this state of the device, so
hopefully somebody will suggest a debugger command that I haven’t tried yet
that might provide some insight to the source of the problem.

Looking for a “duh!” moment. I just haven’t had one yet.

Thanks,

Jay

Jay Talbott
Principal Consulting Engineer
SysPro Consulting, LLC
http://www.sysproconsulting.com http:</http:>

Peter_Viscarola_OSR · June 28, 2012, 2:48pm

May I ask a question about basic assumptions, before I even READ the rest of your post?

Why handle cancel of in-progress requests at all? Will they not complete with certainty within a “relatively short” time (like, you know, a few seconds)?

In my experience, the best way to implement cancel on most DMA devices is to not implement it at all, at least not for in-progress requests.

And before people go all “it says in the book” on me, I’m not advocating leaving out cancel for requests that can pend for an arbitrary or long period of time. I *do* on the other hand, strongly advocate implementing cancel only when absolutely NECESSARY and never elsewhere.

Peter
OSR

Alex_Grig · June 28, 2012, 2:52pm

IRP_MJ_CLOSE is sent when a last reference to the file object is being released. This could be a last IRP completed or a last handle closed, or any other reference. Did you receive IRP_MJ_CLEANUP on that file object? It’s sent when a handle is closed. If a process is getting killed, this happens after the IRPs are completed.

OSR_Community_User · June 28, 2012, 3:25pm

The DMA engines have an interdependence on the state of the HW such that
they might not start right away after being programmed on the device, or
might start, but not complete, until all the data generated by the compute
engine is ready to transfer back to the host, and that can be dependent on
data getting send or received on another DMA channel. Thus, a DMA transfer
that is programmed on the device might or might not complete in a timely
manner.

After computing the s/g list for a particular transfer and programming the
descriptor list for the transfer, the corresponding request is parked in a
manual queue. While on the queue, the corresponding transfer might or might
not have yet been started on the device. The request remains in the queue
until the DMA is completed or the request is cancelled. The
EvtIoCanceledOnQueue callback for the queue doesn’t cancel the request, but
starts a process of idling/resetting all of the DMA engines on the device
(required due to their interdependence). Once that process completes, all of
the queued up requests for all of the DMA engines are now invalid, so they
are then all completed as cancelled.

Note that while idling/resetting the DMA engines, if any in-progress DMA
transfers actually complete normally, the DPC still processes them as a
normal request. But once everything has been brought to a halt, it cancels
all other outstanding queued up requests.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-506853-
xxxxx@lists.osr.com] On Behalf Of xxxxx@osr.com
Sent: Thursday, June 28, 2012 11:48 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Lingering file object

if the driver receives a
cancellation for a DMA request that is queued up for one of the DMA
channels, it needs to idle all of the DMA engines, cancel all of the
outstanding requests for all of the DMA engines, and reinitialize
everything.

May I ask a question about basic assumptions, before I even READ the rest
of
your post?

Why handle cancel of in-progress requests at all? Will they not complete
with certainty within a “relatively short” time (like, you know, a few
seconds)?

In my experience, the best way to implement cancel on most DMA devices is
to not implement it at all, at least not for in-progress requests.

And before people go all “it says in the book” on me, I’m not advocating
leaving out cancel for requests that can pend for an arbitrary or long
period
of time. I *do* on the other hand, strongly advocate implementing cancel
only when absolutely NECESSARY and never elsewhere.

Peter
OSR

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · June 28, 2012, 3:33pm

I have an EvtFileCleanup callback, but it doesn’t really leave behind any
easy to detect breadcrumbs to indicate whether it actually got called, so at
this point I don’t know for certain whether or not it got called for the
file object that is still lingering. If a request cancellation didn’t
already trigger the process of idling/resetting the DMA engines, the cleanup
routine initiates it so as to put everything into a known state when an app
normally shuts down. But since the cancel already triggered it, and I can
tell that it all occurred successfully, I have no real indication if
EvtFileCleanup was called.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-506854-
xxxxx@lists.osr.com] On Behalf Of xxxxx@broadcom.com
Sent: Thursday, June 28, 2012 11:51 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Lingering file object

IRP_MJ_CLOSE is sent when a last reference to the file object is being
released. This could be a last IRP completed or a last handle closed, or
any
other reference. Did you receive IRP_MJ_CLEANUP on that file object? It’s
sent when a handle is closed. If a process is getting killed, this happens
after
the IRPs are completed.

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Peter_Viscarola_OSR · June 28, 2012, 4:03pm

THIS would be a very interesting crash for you to see. WHERE is it crashing and why?

You’ve basically described the following case:

There’s an file object with no I/O requests pending in your driver.
The user process that opened this file object is gone.
There is either still an open handle or open reference on this file object.

Hmmmmm…

Do you reference the file object anywhere?

What do you do in EvtFileObjectCleanup?

You ARE aware that you can still get requests on a file object, even AFTER your EvtFileOjbectCleanup routine has been cancelled, right? There are some really squirly race-conditions there.

I agree with Mr. Grig (who usually has excellent insights on these things) that it would be *very* interesting to know if the file object is cleaned-up or not. Have your cleanup callback set a flag in the file object when it’s called.

Peter
OSR

OSR_Community_User · June 28, 2012, 5:08pm

Actually, I just got their crash dump, which I’m going to start analyzing
soon.

However, in the meantime, I did finally find an outstanding request in the
driver that I’m live debugging, so I’m trying to figure out how it got left
behind.

To answer your questions…

No the file object is never referenced in the driver.

The cleanup (and cancellation, for that matter) sets a flag indicating that
the device is no longer in a “ready” state before idling/resetting the
hardware. Since the driver specifies device-level synchronization scope, the
setting of this flag is serialized with anything that pays attention to it.
If the device is not in an “ready” state, any new requests that come in for
the device are completed with an appropriate status code. Thus, by the time
we get the hardware idled and reset and get to actually cleaning out the
outstanding requests, anything outstanding should be in the manual queues
that I mentioned earlier. This resolves the race condition you mentioned.
And, yes, all DPCs, timer routines, etc. are all included in the
synchronization scope.

And, yes, at this point it would be advantageous to add something to
indicate if the cleanup routine got called. I just wanted to keep working
with my existing live debug session before making any changes to the code.

And now that I found the outstanding request still lingering in my live
debugging session, I want to keep digging to see why it didn’t get cancelled
along with the rest of them. I’m pretty sure I know what to do to fix this,
but it doesn’t explain how it got that way, which I’d like to understand
before just blindly “fixing” it.

Actually, now that I’m thinking about it, I have a pretty good idea of where
to look in the code for what’s going on…

Jay

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-506868-
xxxxx@lists.osr.com] On Behalf Of xxxxx@osr.com
Sent: Thursday, June 28, 2012 1:02 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Lingering file object

> From what little you've said above, I don't like your client's hardware
design.
> Maybe it's great, but from that brief description it has "fix it up in the
driver"
> written all over it.
>
> But, be that as it may:
>

And, at least according to my client (I
haven’t tried it yet myself), when you try to reboot the system in this
condition it BSODs during shutdown. Not good all around.

THIS would be a very interesting crash for you to see. WHERE is it
crashing
and why?

You’ve basically described the following case:

There’s an file object with no I/O requests pending in your driver.

The user process that opened this file object is gone.

There is either still an open handle or open reference on this file
object.

Hmmmmm…

Do you reference the file object anywhere?

What do you do in EvtFileObjectCleanup?

You ARE aware that you can still get requests on a file object, even AFTER
your EvtFileOjbectCleanup routine has been cancelled, right? There are
some really squirly race-conditions there.

I agree with Mr. Grig (who usually has excellent insights on these things)
that
it would be *very* interesting to know if the file object is cleaned-up or
not.
Have your cleanup callback set a flag in the file object when it’s called.

Peter
OSR

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Peter_Viscarola_OSR · June 29, 2012, 9:13am

Yes, it sounds like this WOULD prevent that race. Good job.

Find out why you’ve got that request hanging around… and report back when done!

Peter
OSR

OSR_Community_User · June 29, 2012, 6:50pm

OK, I think I’ve got it fixed now. It ultimately had to do with parts of the
code that I inherited, where in certain situations didn’t complete a request
when it should, and in other situations it would complete cancelled requests
when it shouldn’t. It was from the original author’s failed attempt at
getting the cancellation code right (at least with my implementation it
doesn’t complete cancelled requests DURING in-progress DMA transfers…). I
had thought that addressed all of the portions of the code in question, but
I found where I had missed putting in a required change in a couple of
spots. I also added a fail-safe that does a final check for any remaining
requests following a cancellation that didn’t get completed by the
processing to that point, although theoretically, if I fixed everything else
up correctly, this fail-safe code should never end up with any requests to
complete.

Of course, when I resumed testing another race condition issue came up that
caused a different BSOD. But I already knew that one was out there and I had
it on my to-do list to address - it just hadn’t reared its ugly head before,
so I just hadn’t gotten to it yet (I needed to synchronize some code
segments in my EvtIoInCallerContext callback with the rest of the driver
code).

Now that I’ve fixed that, the cancellation testing seems to proceeding
without incident. I ran 350 iterations this morning before stopping it to
make some other (unrelated) code changes, and since then it’s run almost
700, which is the most it’s ever gone through my client’s cancellation test
script without failing (it used to fail about 1 in 50 iterations, so if
nothing else, we are definitely making progress). I’ll leave it running over
the weekend, but at the moment I’m feeling pretty good that I’m on top of it
(famous last words, I know…).

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-506932-
xxxxx@lists.osr.com] On Behalf Of xxxxx@osr.com
Sent: Friday, June 29, 2012 6:12 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Lingering file object

This resolves the race condition you mentioned

Yes, it sounds like this WOULD prevent that race. Good job.

Find out why you’ve got that request hanging around… and report back
when done!

Peter
OSR

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer