BSODs on DPC or Stack related system memory corruption with a filter storage driver

MikeS · August 29, 2022, 11:40pm

Hi guys,

We have a filter storage driver that hooks up on a couple of levels in the driver stack. The system BSODs occasionally with no consistency at any particular point. All BSODs related to some kind of a memory corruption though. Mostly, crashes at either:

“IRQL_NOT_LESS_OR_EQUAL (a), an attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses.” or,
“KERNEL_SECURITY_CHECK_FAILURE (139), ExceptionCode: c0000409 (Security check failure or stack buffer overrun).”
I will provide a few WinDbg outputs of BSODs below, however, don’t pay too much attention at any particular one, because those are just some of them and others fail pretty much all around the place, at different points: within our driver, system kernel and even at application level (see “My NOTES” within the outputs below). unfortunately, I can’t post complete BSOD outputs here, because of a message size limitation , so I will cut most of them.
We also ran this on different Windows 10 x64 systems with the same results, so, that definitely doesn’t belong to the system.
One more thing to note. In quite a few BSODs we’ve seen the following error message for the “>!timer” cmd: “Timer at has wrong Blink! (Blink , should be )” and that was regarding one of our TimerDPC procedures. So, I would very much appreciate if you gave me any insights on how the timer’s blink address can get corrupted. Well, I was hoping our DPC code is an issue, however, some BSODs occurred without that DPC is even being initialized by KeInitializeDpc/KeInitializeTimer.
The verifier usually crashes at our DPC timer handler accessing one of its members within DPC DeferredContext located in NonPaged Pool (see one of the outputs below), however, that’s not always the case. As I mentioned above, BSODs occur even without our DPC being initialized, as it gets initialized only when certain conditions occur.
I have two guesses at this point. It is either something happens with our DPC timer/handler or with a system thread stack. So, I was wondering if you could have a look into this and let me know where we should dig into further.
I would greatly appreciate any help with any of your ideas on what might be happening here. Please let me know should you need more information on anything.

Thank you!
Mike

IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses.
Arguments:
Arg1: ffffb04e0000002d, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: fffff800410c556b, address which referenced memory

READ_ADDRESS: Unable to get offset of nt!_MI_VISIBLE_STATE.SpecialPool
Unable to get value of nt!_MI_VISIBLE_STATE.SessionSpecialPool
ffffb04e0000002d

STACK_TEXT:
nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiPageFault+0x469
nt!RtlpHpLfhSubsegmentDecommitPages+0xcb
nt!RtlpHpLfhOwnerCompact+0x90
nt!RtlpHpLfhContextCompact+0xaf
nt!RtlpHpHeapCompact+0x76
nt!ExpHpCompactionRoutine+0x207
nt!ExpWorkerThread+0x105
nt!PspSystemThreadStartup+0x55
nt!KiStartSystemThread+0x28

KERNEL_SECURITY_CHECK_FAILURE (139)
A kernel component has corrupted a critical data structure. The corruption could potentially allow a malicious user to gain control of this machine.
Arguments:
Arg1: 0000000000000003, A LIST_ENTRY has been corrupted (i.e. double remove).
Arg2: fffff8887b1e69c0, Address of the trap frame for the exception that caused the bugcheck
Arg3: fffff8887b1e6918, Address of the exception record for the exception that caused the bugcheck
Arg4: 0000000000000000, Reserved

CURRENT_IRQL: 2
ERROR_CODE: (NTSTATUS) 0xc0000409 - The system detected an overrun of a stack-based buffer in this application. This overrun could potentially allow a malicious user to gain control of this application.

STACK_TEXT:
nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiFastFailDispatch+0xd0
nt!KiRaiseSecurityCheckFailure+0x323
nt!RtlpHpLfhSlotAllocate+0x19d53c
nt!ExAllocateHeapPool+0x2b1
nt!ExAllocatePoolWithTag+0x64
nt!ObWaitForMultipleObjects+0x399
nt!NtWaitForMultipleObjects+0x119
nt!KiSystemServiceCopyEnd+0x25
nt!KiServiceLinkage
dxgkrnl!CTokenManager::ProcessTokens+0x1bd
dxgkrnl!CTokenManager::TokenThread+0x79
dxgkrnl!NtTokenManagerThread+0x1be
nt!KiSystemServiceCopyEnd+0x25
win32u!NtTokenManagerThread+0x14
dwmcore!CGlobalSurfaceManager::ProcessKernelTokens+0x114
dwmcore!CGlobalSurfaceManager::s_TokenThreadMain+0x9
KERNEL32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21

KERNEL_SECURITY_CHECK_FAILURE (139)
Arguments:
Arg1: 0000000000000003, A LIST_ENTRY has been corrupted (i.e. double remove).
Arg2: ffff9b82e8e36e90, Address of the trap frame for the exception that caused the bugcheck
Arg3: ffff9b82e8e36de8, Address of the exception record for the exception that caused the bugcheck
Arg4: 0000000000000000, Reserved

STACK_TEXT:
nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiFastFailDispatch+0xd0
nt!KiRaiseSecurityCheckFailure+0x30e
nt!RtlpHpLfhSlotAllocate+0x1830fa
nt!ExAllocateHeapPool+0x98b
nt!ExAllocatePoolWithTag+0x3d
Wdf01000!FxPoolAllocator+0x73 [minkernel\wdf\framework\shared\object\wdfpool.cpp @ 337]
Wdf01000!FxIoTarget::FormatIoctlRequest+0x329 [minkernel\wdf\framework\shared\targets\general\km\fxiotargetkm.cpp @ 373]
Wdf01000!FxIoTargetSendIoctl+0x158 [minkernel\wdf\framework\shared\targets\general\fxiotargetapi.cpp @ 1193]
Wdf01000!imp_WdfIoTargetSendIoctlSynchronously+0x48 [minkernel\wdf\framework\shared\targets\general\fxiotargetapi.cpp @ 1421]

My NOTES: This one occured from within our applcation (as you will see this wasn’t even calling our driver):

KERNEL_SECURITY_CHECK_FAILURE (139)
A kernel component has corrupted a critical data structure. The corruption
could potentially allow a malicious user to gain control of this machine.
Arguments:
Arg1: 0000000000000003, A LIST_ENTRY has been corrupted (i.e. double remove).
Arg2: ffff81088d55b550, Address of the trap frame for the exception that caused the bugcheck
Arg3: ffff81088d55b4a8, Address of the exception record for the exception that caused the bugcheck
Arg4: 0000000000000000, Reserved

PROCESS_NAME: .exe

STACK_TEXT:
nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiFastFailDispatch+0xd0
nt!KiRaiseSecurityCheckFailure+0x30e
nt!KiInsertTimerTable+0x14bc6c
nt!KiCommitThreadWait+0x4e4
nt!KeWaitForSingleObject+0x520
win32kfull!UmfdClientSendAndWaitForCompletion+0x12e
win32kfull!UmfdQueryAdvanceWidths+0xdc
win32kfull!RFONTOBJ::bGetWidthTable+0x123
win32kfull!NtGdiGetWidthTable+0x192
nt!KiSystemServiceCopyEnd+0x28
win32u!NtGdiGetWidthTable+0x14
gdi32full!bFillWidthTableForGTE+0x166
gdi32full!pcfLocateCFONT+0x310
gdi32full!GetTextExtentPointWInternal+0x141
gdi32full!GetTextExtentPoint32W+0xe
GDI32!GetTextExtentPoint32WStub+0x43
_7ff671310000!CGraphWnd::DrawData+0x2b5
_7ff671310000!CGraphWnd::OnPaint+0x11b

My NOTES: This one occured at our driver code accessing a member within our DPC DeferredContext, which was located in NonPaged Pool:

IRQL_NOT_LESS_OR_EQUAL (a)
Arguments:
Arg1: ffffa78dd9d80f50, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000001, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: fffff8067c8953d1, address which referenced memory

PROCESS_NAME: svchost.exe

STACK_TEXT:
nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiPageFault+0x454
nt!KiCancelTimer+0x41
nt!KiSetTimerEx+0x7b
nt!KeSetTimer+0x14
VerifierExt!KeSetTimer_wrapper+0x3e
nt!VerifierKeSetTimer+0x10
!Xxxxx::timerDpc+0x2c8

KERNEL_SECURITY_CHECK_FAILURE (139)
Arguments:
Arg1: 0000000000000003, A LIST_ENTRY has been corrupted (i.e. double remove).
Arg2: ffff848326fb1280, Address of the trap frame for the exception that caused the bugcheck
Arg3: ffff848326fb11d8, Address of the exception record for the exception that caused the bugcheck
Arg4: 0000000000000000, Reserved

PROCESS_NAME: MsMpEng.exe
STACK_TEXT:
nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiFastFailDispatch+0xd0
nt!KiRaiseSecurityCheckFailure+0x30e
nt!RtlpHpLfhSubsegmentFreeBlock+0x17f972
nt!ExFreePoolWithTag+0xad3
nt!IopFreeIrp+0x1ab
nt!IopCompleteRequest+0x89c
nt!IoRemoveIoCompletion+0x26c
nt!NtWaitForWorkViaWorkerFactory+0x99a
nt!KiSystemServiceCopyEnd+0x28
ntdll!NtWaitForWorkViaWorkerFactory+0x14
ntdll!TppWorkerThread+0x274
KERNEL32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21

Tim_Roberts · August 30, 2022, 12:14am

“Blink” refers to the backwards link in a LIST_ENTRY structure. This usually means you have written off the end of an array and damaged the link pointers in the enclosing structure, although it can also come from removing an element twice… Some APIs validate the pointers, some just crash when they’re bad. This can also happen if you allocate something with a LIST_ENTRY on the stack, instead of in heap. When the function returns, the stack gets whacked.

MikeS · August 30, 2022, 12:23am

@Tim_Roberts
Thank you, Tim.
That is not related to writing off the end of an array, since Blink addresses look correct. They just point to the wrong place. Here is a “!timer” command output: “Timer at ffffb08ca6fa31f0 has wrong Blink (Blink fffff8042c4fc848, should be fffff8042c4fbb28)”.
Also, we don’t allocate any lists on the stack, only in NonPaged Pool.
Do you think this can be related to a stack corruption in a any system thread? Is there any way to track this down?
It’s just that, as I said, BSODs occur even without “Blinks” being corrupted.
Thank you!

MikeS · August 30, 2022, 12:30am

I am particularly wondering about the following BSOD occuring in an application without touching any third party driver. So, this must be a serious system kernel level memory corruption cause, somehow, triggered by our driver.

STACK_TEXT:
nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiFastFailDispatch+0xd0
nt!KiRaiseSecurityCheckFailure+0x30e
nt!KiInsertTimerTable+0x14bc6c
nt!KiCommitThreadWait+0x4e4
nt!KeWaitForSingleObject+0x520
win32kfull!UmfdClientSendAndWaitForCompletion+0x12e
win32kfull!UmfdQueryAdvanceWidths+0xdc
win32kfull!RFONTOBJ::bGetWidthTable+0x123
win32kfull!NtGdiGetWidthTable+0x192
nt!KiSystemServiceCopyEnd+0x28
win32u!NtGdiGetWidthTable+0x14
gdi32full!bFillWidthTableForGTE+0x166
gdi32full!pcfLocateCFONT+0x310
gdi32full!GetTextExtentPointWInternal+0x141
gdi32full!GetTextExtentPoint32W+0xe
GDI32!GetTextExtentPoint32WStub+0x43
_7ff671310000!CGraphWnd::DrawData+0x2b5
_7ff671310000!CGraphWnd::OnPaint+0x11b

MikeS · August 30, 2022, 1:25am

Just to highlight another weird thing I am scratching my head about (that is out of those WinDbg outputs I posted above)… Verifier usually crashes at a memory read/write operation within our DPC handler, while accessing its members allocated in a nonpaged pool. It’s like a nonpaged pool got paged out.
Just to note: There is no leakage in a nonpaged pool, as Task Manager shows it to be pretty stable on a very low level of usage.
Well, this BSOD is just showing once again there is a very critical issue affecting a system kernel. So, I am wondering if someone can give me an idea on how to track down such kind of issues, when BSODs occur all over the place.

That is from WinDbg output:
IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses.
Arguments:
Arg1: ffffa78dd9d80f50, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000001, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: fffff8067c8953d1, address which referenced memory

Debugging Details:
…
WRITE_ADDRESS: ffffa78dd9d80f50 Nonpaged pool
CURRENT_IRQL: 2
FAULTING_IP:
nt!KiCancelTimer+41
fffff806`7c8953d1 f00fba2b07 lock bts dword ptr [rbx],7
…
STACK_TEXT:
nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiPageFault+0x454
nt!KiCancelTimer+0x41
nt!KiSetTimerEx+0x7b
nt!KeSetTimer+0x14
VerifierExt!KeSetTimer_wrapper+0x3e
nt!VerifierKeSetTimer+0x10
OUR_DRIVER!XxxContext::timerDpc+0x2c8
nt!KiProcessExpiredTimerList+0x159
nt!KiRetireDpcList+0x4a7
nt!KxRetireDpcList+0x5
nt!KiDispatchInterruptContinue
nt!KiDpcInterruptBypass+0x25
nt!KiInterruptDispatchNoLockNoEtw+0xb1
ntdll!NtQueryInformationProcess+0x14
KERNELBASE!GetSystemInfoInternal+0xbd
KERNELBASE!GetNativeSystemInfo+0x6b
rpcss!GetCatalogClientFlagForProcess+0x23
rpcss!ActivateFromPropertiesPreamble+0x4ca
rpcss!PerformScmStage+0x9a0
rpcss!SCMActivatorCreateInstance+0x134
RPCRT4!Invoke+0x73
RPCRT4!NdrStubCall2+0x42c
RPCRT4!NdrServerCall2+0x1a
RPCRT4!DispatchToStubInCNoAvrf+0x18
RPCRT4!RPC_INTERFACE::DispatchToStubWorker+0x1a0
RPCRT4!RPC_INTERFACE::DispatchToStub+0xcb
RPCRT4!LRPC_SCALL::DispatchRequest+0x35f
RPCRT4!LRPC_SCALL::HandleRequest+0x7fa
RPCRT4!LRPC_ADDRESS::HandleRequest+0x341
RPCRT4!LRPC_ADDRESS::ProcessIO+0x8a2
RPCRT4!LrpcIoComplete+0xc5
ntdll!TppAlpcpExecuteCallback+0x260
ntdll!TppWorkerThread+0x3c8
KERNEL32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21

Peter_Viscarola_OSR · August 30, 2022, 8:33pm

Verifier usually crashes at a memory read/write operation within our DPC handler, while accessing its members allocated in a nonpaged pool. It’s like a nonpaged pool got paged out.

With Verifier, that would sound to ME like a “use after free” error.

P

MikeS · August 30, 2022, 10:49pm

@“Peter_Viscarola_(OSR)” said:
With Verifier, that would sound to ME like a “use after free” error.
Thanks Peter. You always give some fresh ideas to think over and, as to the past issues, some of them really clear things up!
That is not the case here though. Here is a source code of where that Verifier crash occurs:
void ProcedureX()
{
int nChannel = m_nChannel; // this access to m_nChannel, which is allocated in a nonpaged pool, worked fine, as to the memory.dmp
… // some other code
OtherProcedure(m_nChannel); // the crash occurs here while accessing m_nChannel with “An attempt was made to access a pageable (or completely invalid) address”
}
Pretty weird ahhh? As I mentinoed above, the system crashes everywhere with no sensible reason.
I am still trying to see if all of this is happening due to corruption of the system timer queue.
Well, just coming back to your message above in this post regarding a wrong Blink of timer’s LIST_ENTRY that may get corrupted due to removing an element twice or some other improper create/delete sequences. In fact, that looks to be one of possible causes to all of this. However, I am not sure how one can actually corrupt timer’s list entry?
Could you please elaborate on how the timer can actually be removed twice or something can be done to it to corrupt it at all? I have looked at all calls to handle the timer, which is KeInitializeTimer, KeInitializeDpc, KeSetTimer and KeCancelTimer and I can’t see how improper sequence of those calls can actually damage the timers queue. As to wdm documenation, both procedures, KeSetTimer and KeCancelTimer, suppose to check the system timer queue and just return TRUE or FALSE, if the timer is already in the queue without adding an extra copy or removing the element twice.
Well, that being said, according to our latest findings, we still corrupt the system timer queue somehow just by using that API, so, I am still wondering how one can manage to do that?

MikeS · August 30, 2022, 10:51pm

Just wanted to add that “// some other code” above has nothing to do with timer or memory management…

Tim_Roberts · August 30, 2022, 11:23pm

Based on the name, I’m assuming m_nChannel is a C++ class member variable. What structure is it? Are you absolutely, positively certain that it is in NonPagedPool and has not been freed?

MikeS · August 31, 2022, 12:29am

Tim, that is just one weird crash out of very many we have, so, don’t bother getting into this further. Yes, I am sure it is in NonPagedPool, because WinDbg states that besides the read/write address that causes crashing, like “WRITE_ADDRESS: ffffa78dd9d80f50 Nonpaged pool”. I am also sure it hasn’t been freed within the above procedure as I was following its assembly code in listing file looking into every single op code.
Well, I also provided above another crash that occurred within our application right from OnPaint() without even going into any third party code. So, I now see that Windows get to all sorts of unexplainable crashes if the system timer queue gets corrupted.
All I want to understand at this point is how come we corrupt the timer queue, just by using KeXxx API…
P.S.: The most common issue in all of our crashes is that “!timer” command gets something like “Timer at ffffb08ca6fa31f0 has wrong Blink! (Blink fffff8042c4fc848, should be fffff8042c4fbb28)” pointing to one of our timers. All other things have nothing else in common. It just crashes at all unexpected places, which has nothing to do with our driver. It never crashes without our driver though.

MikeS · August 31, 2022, 12:45am

BTW, most of the time, BSODs occur within the system itself. So, most likely it is related to a corruption of system resources, in particular, the system timer queue, because, we didn’t find any other corruption anywhere within the system, as to memory dumps.
Just an example of two most common BSOD scenarious:

IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an interrupt request level (IRQL) that is too high. This is usually caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: ffffb04e0000002d, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000000, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: fffff800410c556b, address which referenced memory

STACK_TEXT:
nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiPageFault+0x469
nt!RtlpHpLfhSubsegmentDecommitPages+0xcb
nt!RtlpHpLfhOwnerCompact+0x90
nt!RtlpHpLfhContextCompact+0xaf
nt!RtlpHpHeapCompact+0x76
nt!ExpHpCompactionRoutine+0x207
nt!ExpWorkerThread+0x105
nt!PspSystemThreadStartup+0x55
nt!KiStartSystemThread+0x28

KERNEL_SECURITY_CHECK_FAILURE (139)
A kernel component has corrupted a critical data structure. The corruption
could potentially allow a malicious user to gain control of this machine.
Arguments:
Arg1: 0000000000000003, A LIST_ENTRY has been corrupted (i.e. double remove).
Arg2: fffff8887b1e69c0, Address of the trap frame for the exception that caused the bugcheck
Arg3: fffff8887b1e6918, Address of the exception record for the exception that caused the bugcheck
Arg4: 0000000000000000, Reserved

nt!KeBugCheckEx
nt!KiBugCheckDispatch+0x69
nt!KiFastFailDispatch+0xd0
nt!KiRaiseSecurityCheckFailure+0x323
nt!RtlpHpLfhSlotAllocate+0x19d53c
nt!ExAllocateHeapPool+0x2b1
nt!ExAllocatePoolWithTag+0x64
nt!ObWaitForMultipleObjects+0x399
nt!NtWaitForMultipleObjects+0x119
nt!KiSystemServiceCopyEnd+0x25
nt!KiServiceLinkage
dxgkrnl!CTokenManager::ProcessTokens+0x1bd
dxgkrnl!CTokenManager::TokenThread+0x79
dxgkrnl!NtTokenManagerThread+0x1be
nt!KiSystemServiceCopyEnd+0x25
win32u!NtTokenManagerThread+0x14
dwmcore!CGlobalSurfaceManager::ProcessKernelTokens+0x114
dwmcore!CGlobalSurfaceManager::s_TokenThreadMain+0x9
KERNEL32!BaseThreadInitThunk+0x14
ntdll!RtlUserThreadStart+0x21

Scott_Noone_OSR · September 2, 2022, 7:17pm

Turn on Verifier for ntoskrnl.exe as well as your driver. Might give you a better crash…