I have just analyzed once more a very strange scenario with FltSendMessage in our minifilter. We reproduced it three or four times in the last days, but each of them was a during a quite long and heavy stress testing, so its very hard to reproduce.
We are using FltSendMessage to send requests to a user mode process, up to N requests in parallel at any time (properly synchronized using a counter and a spinlock around the counter), and we have N threads in user mode waiting each of them with FilterGetMessage to receive one message (using a per-thread overlapped structure, and WaitForMultipleObjects to wait on the overlapped structure to signal message-received and a termination event, then GetOverlappedResult if necessary), then process it and send a reply using FilterReplyMessage.
The minifilter waits using FltSendMessage up to 11000 ms to receive an answer. If no answer is received, it timeouts with STATUS_TIMEOUT.
We usually send tens of thousands of requests (send message - wait answer cycles) without problems. Then, very rarely, the FltSendMessage call returns suddenly STATUS_THREAD_IS_TERMINATING, usually very quickly (so, it is NOT a timeout), even before we could process the message in user mode, WITHOUT the user mode thread that processes the message being actually terminated at all (or any other of our user mode processing threads). Then, we get to process the message in user mode, we call FilterReplyMessage, which returns ERROR_FLT_NO_WAITER_FOR_REPLY (normally unexpected by us, however logical in this situation).
For example, on of our concrete cases is:
- before FltSendMessage, timestamp 18:18:31.91
- message reveiced in user mode, after GetOverlappedResult, timestamp 18:18:31:91
- after FltSendMessage, timestamp 18:18:31.97, STATUS_THREAD_IS_TERMINATING
- message processed in user mode, before FilterReplyMessage, timestamp 18:18:32.24
- after FilterReplyMessage, timestamp 18:18:32.24, ERROR_FLT_NO_WAITER_FOR_REPLY
The strange part is, that we precisely trace the start and termination of all of our message processing threads. In this particular case, all threads where gracefully terminated upon request (upon signaling that request-terminate event that WaitForMultipleObjects was waiting on in each thread) LONG after this event. All threads terminated between timestamp 18:39:27.82 and 18:39:27.84, so almost 11 minutes later (after we found during the test this particular error at FltSendMessage). Also, beside this error, the whole stress testing, all other messages / requests both before and after this particular request where processed correctly, both in the minifilter and in all of the user mode threads.
any ideas, opinions, similar behaviors observed?
thank you very much,
Virus Analyst, SOFTWIN