How to handle serious errors in a context cleanup callback

If I have a context that contains a pointer to an ERESOURCE, and my
context cleanup callback gets called by the fltmgr while the resource is
still held, what should the appropriate result be?

Obviously, if the callback is being called, nobody holds a reference to
the context and the resource is held erroneously. I.e. somewhere in my
code, I acquired it and never released it, then called FltReleaseContext
to decrement the reference count.

I suppose the options are:

  1. Try to acquire the lock and wait. This is probably a lost cause,
    things are already screwed up at this point, and there’s no reason to
    expect it to ever be released, also this would cause a kernel thread I
    don’t own to wait. This sounds like an all-around terrible idea.

  2. Pull the rug out from under the holder by freeing the memory the
    resource occupies. If the holder never tries to release the resource,
    it wouldn’t cause problems for the holder, but I don’t know what happens
    within the kernel. If there’s any record of the ERESOURCE held
    somewhere, and we destroy it, what’s going to happen? If the holder
    does try to release the resource eventually, it’ll probably bugcheck the
    system.

  3. Raise an exception with the expectation that it probably won’t be
    handled and bring down the system. Things could be unrecoverably
    screwed up anyway if I’m in the situation to begin with. Could further
    execution corrupt data? Probably not, but it could cause more erroneous
    behavior. This seems like a reasonable approach for a checked build.

  4. Leak the memory. No immediate harm, but eventually, things will
    grind to a halt when the NonPagedPool gets burned up. This might be
    preferable to 3 on a free build.

  5. ASSERT that the resource isn’t held. Hopefully by the time the code
    goes to production, anything that could cause the situation to arise in
    the first place has been ironed out of the code.

Any thoughts or suggestions as to which of these is the most correct
thing to do? Obviously, if somebody is holding a pointer to memory that
I clean up in the callback, the system will bugcheck if they dereference
it. Options 2 and 3 seem like the most analogous way to deal with
this, with option 3 potentially being the easier one to track down in
testing (it’s guaranteed to fail immediately). Option 4 seems like the
least destructive thing to do to an end user in the short term.

~Eric

Hi Eric,

If I have a context that contains a pointer to an ERESOURCE, and my
context cleanup callback gets called by the fltmgr while the resource is
still >held, what should the appropriate result be?

How about none of them & reviewing your code to fix the bug. :slight_smile:
Maybe fifth one will help you ensure that you don’t proceed if this
situation occurs. But won’t help very much in finding out WHY it is still
held.
But if in case the problem is not identified before the stipulated time, I
think fourth solution is fine.

Regards,
Ayush Gupta

This looks like be a design issue in your code, and you need to re-think
your design so that no context-related locks can be held when the context is
destroyed.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

You have to fix the bug. (I guess that is more or less option 5.)
!locks will get you to the owning thread. Presumably the thread is
acquiring the resource in your code. You need to fix that to solve
your problem.

Your cleanup code must wait for the resource to be free. That at
least will not crash the system outright.

On Jan 16, 2008 11:18 AM, Eric Diven wrote:
> If I have a context that contains a pointer to an ERESOURCE, and my
> context cleanup callback gets called by the fltmgr while the resource is
> still held, what should the appropriate result be?
>
> Obviously, if the callback is being called, nobody holds a reference to
> the context and the resource is held erroneously. I.e. somewhere in my
> code, I acquired it and never released it, then called FltReleaseContext
> to decrement the reference count.
>
> I suppose the options are:
>
> 1) Try to acquire the lock and wait. This is probably a lost cause,
> things are already screwed up at this point, and there’s no reason to
> expect it to ever be released, also this would cause a kernel thread I
> don’t own to wait. This sounds like an all-around terrible idea.
>
> 2) Pull the rug out from under the holder by freeing the memory the
> resource occupies. If the holder never tries to release the resource,
> it wouldn’t cause problems for the holder, but I don’t know what happens
> within the kernel. If there’s any record of the ERESOURCE held
> somewhere, and we destroy it, what’s going to happen? If the holder
> does try to release the resource eventually, it’ll probably bugcheck the
> system.
>
> 3) Raise an exception with the expectation that it probably won’t be
> handled and bring down the system. Things could be unrecoverably
> screwed up anyway if I’m in the situation to begin with. Could further
> execution corrupt data? Probably not, but it could cause more erroneous
> behavior. This seems like a reasonable approach for a checked build.
>
> 4) Leak the memory. No immediate harm, but eventually, things will
> grind to a halt when the NonPagedPool gets burned up. This might be
> preferable to 3 on a free build.
>
> 5) ASSERT that the resource isn’t held. Hopefully by the time the code
> goes to production, anything that could cause the situation to arise in
> the first place has been ironed out of the code.
>
> Any thoughts or suggestions as to which of these is the most correct
> thing to do? Obviously, if somebody is holding a pointer to memory that
> I clean up in the callback, the system will bugcheck if they dereference
> it. Options 2 and 3 seem like the most analogous way to deal with
> this, with option 3 potentially being the easier one to track down in
> testing (it’s guaranteed to fail immediately). Option 4 seems like the
> least destructive thing to do to an end user in the short term.
>
> ~Eric
>
>
> —
> NTFSD is sponsored by OSR
>
> For our schedule debugging and file system seminars
> (including our new fs mini-filter seminar) visit:
> http://www.osr.com/seminars
>
> You are currently subscribed to ntfsd as: unknown lmsubst tag argument: ‘’
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>


Mark Roddy

It is designed so that nobody holds the resource unless they have a
reference to the context. Having the resource held when the callback
gets called is something I expect to never happen.

If, when the callback gets called, somebody does hold the resource, it
means there’s a bug, and that’s the point to trying to handle it in a
way that’s both non-destructive and informative. I’m not planning on it
happening, but if it does, I’d like to at least have put enough thought
into the resolution that I don’t cause irreproducible crashes or data
corruption on a real system if it slips through testing.

Like I said, this is roughly comparable to dereferencing a pointer after
you’ve freed the memory it points to, except I know what happens when I
do that, and that the result is an immediate bugcheck. At least that
makes it somewhat easy enough to pinpoint the cause of the problem. I
don’t know what happens in the case I’m describing, except for the
obvious cases of leaking memory or intentionally bugchecking the system
through an unhandled exception.

~Eric

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S.
Shatskih
Sent: Wednesday, January 16, 2008 12:37 PM
To: Windows File Systems Devs Interest List
Subject: Re:[ntfsd] How to handle serious errors in a context cleanup
callback

This looks like be a design issue in your code, and you need to
re-think your design so that no context-related locks can be held when
the context is destroyed.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com


NTFSD is sponsored by OSR

For our schedule debugging and file system seminars
(including our new fs mini-filter seminar) visit:
http://www.osr.com/seminars

You are currently subscribed to ntfsd as: xxxxx@edsiohio.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Context destruction can only be called from the MJ_CLOSE path of the last
file referencing this FCB.

Since there no more files, the destruction cannot race with IRPs/FastIos on
another files.

And, this very file is addrefed by IO for the duration of IRPs and FastIos
on it, so, MJ_CLOSE can fire only if the there is no IRPs/FastIos on it, and
there is no chances to initiate more, since IO or the upper driver have derefed
this file to zero.

The race can occur only if some driver - yours or upper - addrefs the file
object, then derefs it, and submit IRPs or FastIos thru already derefed
PFILE_OBJECT
, which is a major bug.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

“Eric Diven” wrote in message news:xxxxx@ntfsd…
It is designed so that nobody holds the resource unless they have a
reference to the context. Having the resource held when the callback
gets called is something I expect to never happen.

If, when the callback gets called, somebody does hold the resource, it
means there’s a bug, and that’s the point to trying to handle it in a
way that’s both non-destructive and informative. I’m not planning on it
happening, but if it does, I’d like to at least have put enough thought
into the resolution that I don’t cause irreproducible crashes or data
corruption on a real system if it slips through testing.

Like I said, this is roughly comparable to dereferencing a pointer after
you’ve freed the memory it points to, except I know what happens when I
do that, and that the result is an immediate bugcheck. At least that
makes it somewhat easy enough to pinpoint the cause of the problem. I
don’t know what happens in the case I’m describing, except for the
obvious cases of leaking memory or intentionally bugchecking the system
through an unhandled exception.

~Eric

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S.
Shatskih
Sent: Wednesday, January 16, 2008 12:37 PM
To: Windows File Systems Devs Interest List
Subject: Re:[ntfsd] How to handle serious errors in a context cleanup
callback

This looks like be a design issue in your code, and you need to
re-think your design so that no context-related locks can be held when
the context is destroyed.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com


NTFSD is sponsored by OSR

For our schedule debugging and file system seminars
(including our new fs mini-filter seminar) visit:
http://www.osr.com/seminars

You are currently subscribed to ntfsd as: xxxxx@edsiohio.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Okay, I think I’ve been unclear about the purpose of the resource which
is causing some confusion. It’s there to protect a couple of mutable
pieces of data within the context. One thread is going to be updating
the state contained in the context as an operation progresses, and other
threads are going to be checking the state. It’s not quite as simple as
that, but that’s the gist of it.

The scenario where the resource is held on destruction occurs when
either a writer (exclusive acquire) or a reader (shared acquire) has the
resource and didn’t release it. That, of course would be a major bug in
my code, and that’s what I’m trying to prevent from causing mayhem if it
should crop up. That’s why I was saying it should never be held when
the destructor gets called.

I suppose that because the resource was acquired in a pre/postop
callback, and there are no longer any references to the file, I’m
guaranteed that nothing will ever try to release the resource, because
like you said, the cleanup callback only occurs after the reference
count on the FCB hits zero.

The whole situation occurring seems rather unlikely - it’s a “can’t
happen” case in the code that indicates something very wrong if it does,
which would hopefully get fixed in testing. I think in light of that,
the best approach would be to halt and catch fire in the checked version
(so I can catch the bug if it occurs), and probably just let it leak in
the release version (since it really should never happen in a release
version).

~Eric

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim S.
Shatskih
Sent: Wednesday, January 16, 2008 3:27 PM
To: Windows File Systems Devs Interest List
Subject: Re:[ntfsd] How to handle serious errors in a context cleanup
callback

Context destruction can only be called from the MJ_CLOSE path of the
last file referencing this FCB.

Since there no more files, the destruction cannot race with
IRPs/FastIos on another files.

And, this very file is addrefed by IO for the duration of IRPs and
FastIos on it, so, MJ_CLOSE can fire only if the there is no
IRPs/FastIos on it, and there is no chances to initiate more, since IO
or the upper driver have derefed this file to zero.

The race can occur only if some driver - yours or upper - addrefs
the file object, then derefs it, and submit IRPs or FastIos thru
already derefed PFILE_OBJECT
, which is a major bug.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

>happen" case in the code that indicates something very wrong if it does,

which would hopefully get fixed in testing.

Then use asserts for this.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

> I’m not planning on it

happening, but if it does, I’d like to at least have put enough thought
into the resolution that I don’t cause irreproducible crashes or data
corruption on a real system if it slips through testing.

Why not add a ULONG and PCHAR to the structure and instead of calling
FltAcquire have a macro which stuffs __LINE and __FILE into these fields?
Then if things go wrong have a smoking gun (Ex locks - you might want a
queue for shared ones, depending on how hot the lock is)