In the past year my development responsibilities expanded to include the maintenance & troubleshooting of a software driver written in C++ using the DriverWorks framework. Much of the code shows a revision history with very few changes made over the 15+ year span of time since it was originally written. Recently, a bugcheck started occurring and I was able to track down the root-cause, which is a double-remove operation being performed on an item in a doubly linked list. While analyzing the source code and a crash dump I noticed that the callback function registered via PsSetLoadImageNotifyRoutine() appears to be getting called simultaneously by two different threads reporting image load events for the same process, yet is coded with an implicit assumption that image load notifications for a given process will occur sequentially, not concurrently. Obviously, that assumption is incorrect based on the content of the crash dump and the double-remove operation that caused the bugcheck.
PsSetCreateProcessNotifyRoutine(), PsSetCreateThreadNotifyRoutine() and PsSetLoadImageNotifyRoutine() are used by this driver.
The NTDDK documentation for PCREATE_PROCESS_NOTIFY_ROUTINE states “The operating system calls the driver’s process-notify routine at PASSIVE_LEVEL inside a critical region with normal kernel APCs disabled.”.
The NTDDK documentation for PCREATE_THREAD_NOTIFY_ROUTINE states “The driver’s thread-notify routine runs at IRQL = PASSIVE_LEVEL or APC_LEVEL.”.
The NTDDK documentation for PLOAD_IMAGE_NOTIFY_ROUTINE states “The operating system calls the driver’s load-image notify routine at PASSIVE_LEVEL inside a critical region with normal kernel APCs always disabled and sometimes with both kernel and special APCs disabled.”.
I read the OSR article “Understanding Critical Regions” and am very clear in my understanding that a critical region is not a critical section. There is explicit mention that acquiring any Spinlocks, Kernel Mutexes, Fast Mutexes or Guarded Mutexes will implicitly result in entering a critical region, while other locking/synchronization constructs do not implicitly enter a critical region.
Two of the callbacks are documented as being called by the kernel while in a critical region, so it’s clear that the thread executing the callback cannot be interrupted in those cases. However, no statements are made about any kind of serialization of the invocation of those callbacks by the kernel, especially as it relates to notifications of a given type for the same process or for different processes.
Also of note, thread create/delete notifications don’t even happen within a critical region, and the driver code I’m working has a function for that callback that makes other calls that ultimately lead to obtaining & releasing a lock while manipulating a linked list. I need to determine if the lock mechanism being used falls under the category of locks that implicitly enter a critical region or of its using one of the other types that do not have that behavior. If it is using the latter type, then that represents another place where a fault could occur in the driver code leading to a deadlock or some other eventually fatal problem if the thread were to be interrupted while it held the lock.
Question: What documentation, if any, defines the concurrency rules for when callback functions are called to deliver the notification of image load events and thread create/delete events?
It’s obvious that a process create/delete event should only ever get delivered a single time for a given process and cannot, by definition, both be delivered concurrently for the same process by different threads.
What is of concern is that this driver has been working just fine, as-is, until recently and only has the bugcheck occurring in a specific runtime environment. I suspect that there is some negative interaction between two or more security enhancement products being present and functional at the same time such that a newly created process is being subjected to DLL injection or some other code injection that results in additional threads being created and DLLs loaded in a way that is causing the simultaneous delivery of image load events. The lack of locking in the callback functions then leads to the double-remove and subsequent bugcheck. Also of note, it has not been possible, yet, to reproduce the crash in a controlled development lab environment, but that is likely due to not having sufficiently duplicated the runtime environment in a virtual machine. In the environment where the failure happens, though, it is easily repeatable and the complete memory dump that is produced always shows the same underlying double-remove as the cause for the bugcheck.