KERNEL_APC_PENDING_DURING_EXIT (20) Bugcheck

Hello,
I am getting KERNEL_APC_PENDING_DURING_EXIT (20) Bugcheck when a ~30minute network backup job is almost completed. During this period the driver processes 1000s of file write events. When my mini-filter driver is disabled the backup succeeds, so it seems the filter driver is to blame.

The arguments are:
Arguments:
Arg1: 00000000, The address of the APC found pending during exit.
Arg2: 0000fffe, The thread’s APC disable count
Arg3: 00000000, The current IRQL
Arg4: 00000001

Having searched the list and google I am still unsure what is causing it - there are no direct calls in the driver to KeEnterCriticalRegion, KeWaitForSingleObject, KeWaitForMultipleObjects, KeWaitForMutexObject, or FsRtlEnterFileSystem.

What other APIs can disable APCs? My driver mainly uses Flt… and Rtl… calls.
Any advice on how to proceed? Thanks

> What other APIs can disable APCs? My driver mainly uses Flt… and Rtl… calls.

Hello.

I have no experience with mini-filters, but as documentation states FltAcquireResourceExclusive internally disables normal kernel APCs too.
So one possible guess might be an orphaned ERESOURCE as described in the following article:

http://blogs.msdn.com/b/ntdebugging/archive/2008/04/23/the-debug-ninja-speaks-debugging-a-stop-0x20.aspx

Hopefully somebody more experienced will correct me if I am wrong or add more suggestions what might be other possible causes of the problem.

Regards

Lenka

As Lenka says, FltAcquireResource does the FsRtlEnterFilesystem thing (which
is a mixed blessing).

So your first reaction should be to do a !locks in windbg and see what
resources still think they are owned by this thread. That may help you find
the mismatch FltReleaseResource s (if they are the cause.)

Are you running verifier? It can help in this situation by turning this into
a APC_INDEX_MISMATCH (or at least the verifier equivalent).

If all else fails it is not difficult (but not for release drivers) to build
(OS release specific) code to look at the APC disable count. Then you can
compare the values at times when you believe that your lock state is
unchanged (although expect a few false positives to start with as you
discover some of the intricacies the re-entrancy that you had forgotten
about).

Rod

Thanks for your replies.

I dont call FltAcquireResourceExclusive or FltAcquireResourceShared, the only FltRelease… calls made are FltReleaseFileNameInformation and FltReleaseContext.

I will enable verifier for the driver and see if this can provide any more information.

A KeEnterCriticalRegion (or FsRtlEnterFileSystem) without a matching KeLeaveCriticalRegion (or FsRtlLeaveFileSystem) will cause this problem.

The problem with this is that when it happens on the process exit you can’t usually tell when this happened. If you use a checked kernel the OS will do this check on every thread return to user mode, rather than on thread exit, but even that isn’t very helpful (but at least it’s closer to the right TIME when the problem first started.)

Another possibility is that the bug causing this problem is not in YOUR driver but in a driver that you call. The sequence I’ve seen trigger this in the past is:

__try { KeEnterCriticalRegion(); /* processing that throws a fault when YOUR driver calls it because it doesn’t like something */; KeLeaveCriticalRegion(); } __except (EXCEPTION_EXECUTE_HANDLER) { /* cleanup, but a missing KeLeaveCriticalRegion */}

This will be blamed on your filter because the exception only happens due to a latent bug in the other driver that you trigger with a call that you make.

At OSR we use an ugly technique for tracking these down: we have an internal library routine that extracts the apc disable count (from the thread). We then store the current value; execute code; and then verify that the current value NOW is the same as it was before the code execution. When you find a mismatch, you know something happened within the code block.

It’s not a pleasant bug to track down (and it isn’t necessarily in the other driver, either. It could be some bogus parameter that you are passing that is being caught by the structured exception handler.)

Tony
OSR

Hope to see everyone September 19 for the next Developing File Systems for Windows seminar (http://www.osr.com/fsd.html)

I managed to reproduce the bug when doing a short network backup, and noticed in the debugger that I had an ASSERT failure of (KeGetCurrentIrql() == PASSIVE_LEVEL) from my IRP_MJ_WRITE PreOperation callback. I need to be at passive level to call FltCreateFile, FltReadFile and FltWriteFile from this method. It appears that during the backup process a few pre-write events are dispatched at APC_LEVEL.

Is it feasable that because of the assert failure the driver entered a state where an APC was disabled, hence the later bugcheck when the thread exits?

After changing the assert to an if… check instead and returning if not at passive level, the backup completes and I do not get the bugcheck and the required actions of the filter driver seem to happen correctly - I assume that a pre-write is subsequently called for the file at passive level.