Minifilter Rundown Protection, and hanged / suspended processes

I’m working on a security product minifilter project.
As part of the product, we monitor file, registry, process and network callbacks, and employ various out of context techniques (system threads, worker items, DPCs and APCs).

During filter unload, we use several rundown protection techniques, to make sure there are no outstanding callbacks (or APCs or DPCs) that are performing work in our driver context.

Recently we’ve run against the following scenario,

  1. A windows process A crashes, while at least one of its threads has acquired rundown protection of our filter.
  2. WER suspends process A and attempts to perform a memory dump / whatever it is
  3. WER processing takes a long time / hangs.
  4. Our filter receives a stop command
  5. During unload procedure, we hang waiting in vain for the rundown protections that process A took, to reach zero,
    (which it won’t do, since process A will never reach that code, as it hangs between suspension and death)

What is the safe and correct way to handle this, without getting our driver stuck forever?

I considered making the rundown callbacks granular per process (to identify callbacks which no longer exist),
however, it seems during WER process the process A is not marked as dead per se.

We could try guessing that it won’t comeback, but I’m not sure if unloading will have averse effects on WER trying to dump its memory/stacks.

Any ideas will be appreciated.

I’ll answer the generic problem because I’m vague on your details…Ignoring the case of system worker threads, a driver should anticipate that the thread they are executing on may be suspended at any moment. In general this doesn’t matter, though it does matter if you acquire some global resource. The idea is that low priority/privilege thread A might acquire the global resource, be suspended, and then starve privileged thread B from doing its work.

In general, the only global resource that a driver acquires is some type of lock (e.g. mutex, spinlock, etc.). To make life easy, most locking primitives disable thread suspension before acquiring the resource to avoid your exact scenario (an exception being ERESOURCEs). If you invent your own locking primitive or global resource then you’re responsible for disabling thread suspension. You can do this by entering a critical region (KeEnterCriticalRegion) before acquiring your resource and leaving it after you drop it (KeLeaveCriticalRegion).

I’ll give some more details then.

The legacy code uses a variation of counters and EX_RUNDOWN_REF to protect against driver code being unloaded before callbacks (and DPCs, APCs) have finished.

To the best of my understanding, these counters are unnecessary for normal callbacks (ie, FLT_REGISTRATION registered callbacks, as well as Process, Registry and Handle callbacks) since the unregister functions (ie FltUnregister, CmUnRegisterCallback etc) should handle the issue of rundown protection themselves.

I plan to remove them, but I fear FltUnregister might hang for the same reason (internal counting of outstanding callbacks) in the scenario described.

What I have seen is that when a process crashes and is picked up by WER, its remains suspended for quite a while (sometimes even stuck),
while a thread remains that has one of our callbacks on its stack.

KeFlushQueuedDpcs takes care of pending DPCs. If you’re using APCs you’re off an undocumented cliff so you’re screwed in any case. Entering a critical region before acquiring your reference might be sufficient. It assumes that you’ll be able to handle a thread being suspended prior to entering the critical region, you tearing everything down, then the thread waking up and trying to get your resource reference. Though honestly I’m still not sure what you’re trying to protect so not sure (sorry for being dense)

Thanks for the response!

The thing is, I’m not sure I need to protect anything. The existing protections were put in years ago, by a whole different team, and I’m not sure they solve the problem they’re supposed to solve. I haven’t touched it so far, because it’s legacy code, and wasn’t an issue until recently.

Here’s what happened in some test suit that made me look at it.

Some user process crashed due a memory exception on thread 1, while thread 2 was trying to read a file, and was in a callback in our filter. This was caught by WER which suspended all threads of the user process. This is an automated test suite, and apparently there was some window displayed (either report/close or JIT debugging) but there is no one to to respond, so the process remains suspended, as WER waits in vain.

The other tests finish running, and the test suite tries to unload our filter. However the unloading hangs, because thread 2 (which was trying to read a file) increased the rundown counter, which it will now never decrease.

Suppose I remove the offending counter. I’m assuming that FltUnregisterFilter does some housekeeping, to prevent the driver code from being unloaded while some app is stuck in a callback potentially running driver code.

Question 1: What will normally happen as thread 2 is suspended (and the program itself crashed) while still in our callback?

Question 2: If our driver does unload successfully (leaving thread 2 supposedly in our code), is there some risk for stability, in case WER (or some windows mechanism) probes the suspended process’ memory?

I’ll try to replicate this on the weekend, by intentionally crashing an app while hanging one of it’s threads, and see how this works. I was hoping someone knows how this should work.

Suppose I remove the offending counter. I’m assuming that FltUnregisterFilter does some housekeeping, to prevent the driver code from being unloaded while some app is stuck in a callback potentially running driver code.

Correct. FltUnregisterFilter will not return until all of your callbacks have finished executing.

Question 1: What will normally happen as thread 2 is suspended (and the program itself crashed) while still in our callback?

The unload call will hang while FltMgr waits for all rundown references to be released.

The unload call will hang while FltMgr waits for all rundown references to be released.
So a follow up question is - what is common practice to prevent this from happening?

So far I’ve witnessed 2 scenarios with user apps getting stuck in our code preventing unload:

  1. When the app crashes and is suspended by WER for debugging, until WER finishes (which could be indefinite if it’s showing UI that the user doesn’t click)
  2. Windows 10 Modern / Immersive applications (such as SearchUI and so on) which often have threads suspended (the rules for which I still haven’t figured out).

Both seem out of my control, but I would rather minimize the chances for this.

Practically speaking no one tries to do anything to mitigate this. The worst thing that happens is your attempt to unload your filter hangs. As long as that doesn’t prevent anything _else _ from hanging you have an annoying but rare edge case.

I see. While this is indeed rare, I could imagine a case where a routine upgrade roll-out could hang “between versions” until the issue is identified and manually resolved. This would leave the endpoint unprotected for that time period, which isn’t ideal, especially as manual resolution could take time for large scale deployments or critical servers, which aren’t easily taken offline for maintenance.

Maybe the right course of thought is to rethink the rollout process to mitigate such cases.

In any case, I really appreciate your patience and detailed replies. Many thanks.