Re[2]: Query about mini filter kernel stack corruption

I suspect that the corruption is somehow causing the misleading
information. As suggested, run with verifier, it may show something
different.

Pete


Kernel Drivers
Windows File System and Device Driver Consulting
www.KernelDrivers.com
866.263.9295

------ Original Message ------
From: mandar.nanivadekar@hp.com
To: “Windows File Systems Devs Interest List”
Sent: 9/29/2016 6:16:31 PM
Subject: RE:[ntfsd] Query about mini filter kernel stack corruption

>Thanks all, I will relook at my code and also try driver verifier.
>
>Tony, one query I have is, in some of the routines I check the kernel
>stack size for each thread by using IoGetRemainingStackSize() and print
>it.
>I can see that the stack size suddenly increases.
>
>I thought that rip becomes NULL because the stack goes bad , and the
>stack size suddenly increases for all the threads in all the routines
>which were working normally so far.
>
>The crash does not happen immediately but after a few I/O calls after
>the stack seems to go bad.
>Can the stack go bad because of access violation or this is a different
>problem altogether.
>
>PostReadWrite: setting up job offset = 26287767552, DataSize = 32768,
>ThreadID =
>0000000000000930, Stack Size = 18904
>?
>?
>?
>PostReadWrite: setting up job offset = 26289569792, DataSize = 32768,
>ThreadID =
>0000000000000DE8, Stack Size = 18904
>
>MessageNotify: Result = 1 Address = FFFFD00022602000, offset = 0,
>DataSize =
>32768 ThreadID = 00000000000003B4 StackSize = 21960 STOFF = 26287767552
>IOPLENGTH = 32768
>
>PostReadWrite: setting up job offset = 26289602560, DataSize = 32768,
>ThreadID =
>0000000000000F00, Stack Size = 18904
>
>MessageNotify: Result = 1 Address = FFFFD00022614000, offset = 0,
>DataSize =
>32768 ThreadID = 00000000000003B4 StackSize = 21960 STOFF = 26287800320
>IOPLENGTH = 32768
>
>PostReadWrite: setting up job offset = 26289635328, DataSize = 32768,
>ThreadID =
>0000000000000930, Stack Size = 556267992
>
>MessageNotify: Result = 1 Address = FFFFD00022653000, offset = 0,
>DataSize =
>32768 ThreadID = 00000000000003B4 StackSize = 544646600 STOFF =
>26287833088
>IOPLENGTH = 32768
>
>PostReadWrite: setting up job offset = 26289668096, DataSize = 32768,
>ThreadID =
>0000000000000324, Stack Size = 576137688
>
>—
>NTFSD is sponsored by OSR
>
>
>MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>software drivers!
>Details at http:
>
>To unsubscribe, visit the List Server section of OSR Online at
>http:</http:></http:>

By itself an increase in stack size isn’t proof of anything. But when that happens, can you see discontinuous stacks? Perhaps someone using KeExpandKernelStackAndCallout (https://msdn.microsoft.com/en-us/library/windows/hardware/ff552030(v=vs.85).aspx)?

There are basically two ways that your RIP go bad: you load something into a register and jump/call to it or you pop it off the stack (from a ret/iret) after you’ve damaged the stack.

Since this was pointing at a worker thread, I’d suggested the former category (workItem->WorkRoutine == 0). Perhaps it is the latter, though I haven’t seen anything you’ve presented thus far that suggest this is the case.

If it is the latter, then it is likely a stack scribble of some sort. Verifier doesn’t help with that class of problem usually. However, if you can break into the debugger at the point you see the stack size grow as you suggested, you could set an access breakpoint on the stack location with the return address ($ra contains the return address, so look for that value on the stack).

Tony
OSR

I enabled driver verifier, and also made changes so that when the stack increases suddenly I can break into the debugger. The following is the latest !analyze -v output.

FOLLOWUP_IP:
HPEDpHsmX64!HpArcHsmDeferredReadWriteCompletion+421 [f:\svn\storage_optimizer\storage_optimizer\trunk\filterdriver\src\hparchsm.c @ 3986]
fffff801`b89e87f1 4c8b5c2448 mov r11,qword ptr [rsp+48h]

FAILED_INSTRUCTION_ADDRESS:
+0
00000000`00000000 ?? ???

BUGCHECK_STR: AV

ANALYSIS_VERSION: 10.0.10240.9 amd64fre

LAST_CONTROL_TRANSFER: from ffffe00049931070 to 0000000000000000

IP_IN_FREE_BLOCK: 0

STACK_TEXT:
ffffd00020a3faf0 ffffe00049931070 : fffff802064f62f9 0000000000000000 ffffe000494c3a20 ffff2801983cbca6 : 0x0
ffffd00020a3faf8 fffff802064f62f9 : 0000000000000000 ffffe000494c3a20 ffff2801983cbca6 0000000000000000 : 0xffffe00049931070 ffffd00020a3fb00 fffff801b89e87f1 : ffffe0004a45f9b8 ffffe000494a4a20 ffffe00049931070 0000000000000000 : nt!VerifierExReleaseFastMutex+0x6d ffffd00020a3fb30 fffff801b89eb425 : ffffe000499a9900 ffffe000498ae880 ffffe000498ae880 0000000001900010 : HPEDpHsmX64!HpArcHsmDeferredReadWriteCompletion+0x421 [f:\svn\storage_optimizer\storage_optimizer\trunk\filterdriver\src\hparchsm.c @ 3986] ffffd00020a3fba0 fffff80205f40794 : fffff801b89f0300 0000000000000000 0000000000000000 0000000000000000 : HPEDpHsmX64!ThreadPoolCallBack+0x265 [f:\svn\storage_optimizer\storage_optimizer\trunk\filterdriver\src\threadpool.c @ 235] ffffd00020a3fc00 fffff80205fcb5c6 : ffffd0005f96e180 ffffe000498ae880 ffffd0005f97a3c0 0000000000000001 : nt!PspSystemThreadStartup+0x58 ffffd00020a3fc60 0000000000000000 : ffffd00020a40000 ffffd00020a3a000 0000000000000000 00000000`00000000 : nt!KiStartSystemThread+0x16

FAULTING_SOURCE_LINE: f:\svn\storage_optimizer\storage_optimizer\trunk\filterdriver\src\hparchsm.c

FAULTING_SOURCE_FILE: f:\svn\storage_optimizer\storage_optimizer\trunk\filterdriver\src\hparchsm.c

FAULTING_SOURCE_LINE_NUMBER: 3986

FAULTING_SOURCE_CODE:
3982: if (!SkipIO) {
3983: if(pHpArcHsmContext)
3984: {
3985: ExAcquireFastMutex(&pHpArcHsmContext->StreamLock);

3986: pHpArcHsmContext->BitmapFileUseCount–;
3987: ExReleaseFastMutex(&pHpArcHsmContext->StreamLock);
3988: }
3989: }
3990:
3991: if (pHpArcHsmContext)

SYMBOL_STACK_INDEX: 3

SYMBOL_NAME: HPEDpHsmX64!HpArcHsmDeferredReadWriteCompletion+421

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: HPEDpHsmX64

IMAGE_NAME: HPEDpHsmX64.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 57f3bbcb

STACK_COMMAND: .cxr 0xffffd00020a3f0c0 ; kb

7: kd> !verifier

Verify Flags Level 0x00000830

STANDARD FLAGS:
(0x00000000) Automatic Checks
(0x00000001) Special pool
(0x00000002) Force IRQL checking
(0x00000008) Pool tracking
(0x00000010) I/O verification
(0x00000020) Deadlock detection
(0x00000080) DMA checking
(0x00000100) Security checks
(0x00000800) Miscellaneous checks
(0x00020000) DDI compliance checking

ADDITIONAL FLAGS:
(0x00000004) Randomized low resources simulation
(0x00000200) Force pending I/O requests
(0x00000400) IRP logging
(0x00002000) Invariant MDL checking for stack
(0x00004000) Invariant MDL checking for driver
(0x00008000) Power framework delay fuzzing
(0x00010000) Port/miniport interface checking
(0x00040000) Systematic low resources simulation
(0x00080000) DDI compliance checking (additional)
(0x00200000) NDIS/WIFI verification
(0x00800000) Kernel synchronization delay fuzzing
(0x01000000) VM switch verification
(0x02000000) Code integrity checks

Indicates flag is enabled

Summary of All Verifier Statistics

RaiseIrqls 0x0
AcquireSpinLocks 0x0
Synch Executions 0x0
Trims 0x0

Pool Allocations Attempted 0xe6f9a
Pool Allocations Succeeded 0xe6f9a
Pool Allocations Succeeded SpecialPool 0x0
Pool Allocations With NO TAG 0x0
Pool Allocations Failed 0x0

Current paged pool allocations 0x0 for 00000000 bytes
Peak paged pool allocations 0x0 for 00000000 bytes
Current nonpaged pool allocations 0x0 for 00000000 bytes
Peak nonpaged pool allocations 0x0 for 00000000 bytes

  1. The bugcheck still seems to be due to IP as 0,
    LAST_CONTROL_TRANSFER: from ffffe00049931070 to 0000000000000000 - and it seems to be after a call to ExReleaseFastMutex. However the custom call to __debugbreak where the check was made for large stack size has not been executed.

However I am unable to understand why a call to ExReleaseFastMutex would result in IP to be set to NULL.

I will post more details later.
Thanks.

I’d suspect that something else is triggering the problem, not the acquire fast mutex.

(1) Why don’t you use special pool?
(2) Are you sure that pHpArcHsmContext is not NULL?
(3) Curious: why do you use a fast mutex around that decrement? Reprogramming the TPR and taking out a KEVENT

This stack looks quite different than before.

Tony
OSR

Hello,

  1. I have enabled the special pool and will post the verifier output again.
  2. I feel that pHpArcHsmContext is not NULL, unless something drastically changes after the if(pHpArcHsmContext) condition.
  3. as for the decrement operation, the driver maintains a bitmap which points to sectors which are cached locally versus sectors which need to be fetched from remote storage. This variable is being accessed at multiple places.

As for the different stack, I have seen the driver crashing at multiple places but the issue seems to be always that RIP is NULL.

>>
I’d suspect that something else is triggering the problem, not the acquire fast
mutex.
<<<

Is it possible that analyze-v points to different source location due to corruption but the problem lies somewhere else?

Thanks.

I’m not sure special pool will tell more, but having that information present may help debug the issue. I suspect this is still a red herring (the mutex) but I was just curious why you didn’t use interlocked operations, rather than a mutex.

Absolutely. This is problematic when there is stack damage. Which reminds me: have you looked at the stack to see if the region below the RSP value has been zeroed? As I noted before, there are really only two ways to get 0 into RIP - either call/jump to it (usually through a NULL pointer) or pop it off the stack. If the stack values below the current RSP (so, “dps @rsp - 100” and then start walking forward) aren’t zeros, then it suggests the former. If they are, then it suggests the latter.

Tony
OSR

A quick update,

Enabling driver verifier with special pool made the operation run slowly and the driver does not crash. I am re-analyzing the code for possible issues.
I will also use the interlocked operations for the bitmap use count decrement.
When it crashes again, I will look at RSP to deduce the cause for RIP to be NULL.

Thanks.
Mandar

Hello,

sorry, I was on leave and hence could not provide any update.
After more testing and debugging the following seems to cause the problem.

  1. Some part of the code does not have NULL checks and I have added them.
  2. No null checking was being done for MmGetSystemAddressForMdlSafe, added
  3. FltGetStreamContext throws an exception in certain cases and at the moment I have failed the I/O but this needs more analysis as I am not sure why an exception would be thrown.
  4. The internal job queue was not being locked properly and relevant locking has been added.

planning to do more testing and will post the results later.

Thabks,
Mandar

I am doing more and more testing and I again saw the random crash (altough it is less frequent now.)

This time the stack seems completely misleading, it shows that the exception occurs because KeAcquireSpinLock. Again RIP seems NULL but the module is ntkrnlmp.exe and not the driver however the issue should be with the driver only.

I checked the code and there does not seem to be anything obvious which will cause KeReleaseSpinLock to fail.

Please let me know if you get any pointers.
Thanks

Windows 8 Kernel Version 9600 MP (8 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Built by: 9600.17031.amd64fre.winblue_gdr.140221-1952
Machine Name:
Kernel base = 0xfffff800ae476000 PsLoadedModuleList = 0xfffff800ae7402d0
Debug session time: Wed Oct 19 23:53:38.441 2016 (UTC - 4:00)
System Uptime: 4 days 20:48:26.185
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

SYSTEM_THREAD_EXCEPTION_NOT_HANDLED_M (1000007e)
This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Some common problems are exception code 0x80000003. This means a hard
coded breakpoint or assertion was hit, but this system was booted
/NODEBUG. This is not supposed to happen as developers should never have
hardcoded breakpoints in retail code, but …
If this happens, make sure a debugger gets connected, and the
system is booted /DEBUG. This will let us see why this breakpoint is
happening.
Arguments:
Arg1: ffffffffc0000005, The exception code that was not handled
Arg2: 0000000000000000, The address that the exception occurred at
Arg3: ffffd001050068b8, Exception Record Address
Arg4: ffffd001050060c0, Context Record Address

Debugging Details:

TRIAGER: Could not open triage file : e:\dump_analysis\program\triage\modclass.ini, error 2

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at “0x%08lx” referenced memory at “0x%08lx”. The memory could not be “%s”.

FAULTING_IP:
+22
00000000`00000000 ?? ???

EXCEPTION_RECORD: ffffd001050068b8 – (.exr 0xffffd001050068b8)
ExceptionAddress: 0000000000000000
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 0000000000000008
Parameter[1]: 0000000000000000
Attempt to execute non-executable address 0000000000000000

CONTEXT: ffffd001050060c0 – (.cxr 0xffffd001050060c0)
rax=0000000000000000 rbx=ffffe000161a8010 rcx=ffffe0001a0ac380
rdx=ffffe00015e9a7f0 rsi=ffffe000161ac040 rdi=ffffd00105006b78
rip=0000000000000000 rsp=ffffd00105006af0 rbp=0000000000000080
r8=0000000000000000 r9=ffffe0001a0ac388 r10=fffff8002c443120
r11=0000000000000000 r12=0000000000000000 r13=fffff800ae476000
r14=fffff8002ca29300 r15=fffff8002ca244d0
iopl=0 nv up ei ng nz na po nc
cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b efl=00010286
00000000`00000000 ?? ???
Resetting default scope

DEFAULT_BUCKET_ID: WIN8_DRIVER_FAULT_SERVER

PROCESS_NAME: System

CURRENT_IRQL: 0

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at “0x%08lx” referenced memory at “0x%08lx”. The memory could not be “%s”.

EXCEPTION_PARAMETER1: 0000000000000008

EXCEPTION_PARAMETER2: 0000000000000000

WRITE_ADDRESS: fffff800ae72dce0: Unable to get special pool info
fffff800ae72dce0: Unable to get special pool info
unable to get nt!MmNonPagedPoolStart
unable to get nt!MmSizeOfNonPagedPoolInBytes
0000000000000000

FOLLOWUP_IP:
nt!KxWaitForSpinLockAndAcquire+23
fffff800`ae545663 4885c0 test rax,rax

FAILED_INSTRUCTION_ADDRESS:
+23
00000000`00000000 ?? ???

BUGCHECK_STR: AV

LAST_CONTROL_TRANSFER: from ffffe00015e9a7f0 to 0000000000000000

STACK_TEXT:
ffffd00105006af0 ffffe00015e9a7f0 : ffffe0001a0ac380 ffffe000161ac040 ffffe0001a0ac388 0000000000000001 : 0x0
ffffd00105006af8 ffffe0001a0ac380 : ffffe000161ac040 ffffe0001a0ac388 0000000000000001 fffff800ae545663 : 0xffffe00015e9a7f0 ffffd00105006b00 ffffe000161ac040 : ffffe0001a0ac388 0000000000000001 fffff800ae545663 ffffd00105006b78 : 0xffffe0001a0ac380
ffffd00105006b08 ffffe0001a0ac388 : 0000000000000001 fffff800ae545663 ffffd00105006b78 fffff8002ca219fb : 0xffffe000161ac040 ffffd00105006b10 0000000000000001 : fffff800ae545663 ffffd00105006b78 fffff8002ca219fb ffffe0001a0ac380 : 0xffffe0001a0ac388
ffffd00105006b18 fffff800ae545663 : ffffd00105006b78 fffff8002ca219fb ffffe0001a0ac380 ffffe000161ac040 : 0x1
ffffd00105006b20 fffff800ae546472 : 0000000000000001 ffffe00019400c40 ffffe000161a8010 0000000000000000 : nt!KxWaitForSpinLockAndAcquire+0x23
ffffd00105006b50 0000000000000000 : ffffe00100000000 ffffe00019e81808 ffffe000161ac040 fffff8002ca24735 : nt!KeReleaseSpinLock+0x22

SYMBOL_STACK_INDEX: 6

SYMBOL_NAME: nt!KxWaitForSpinLockAndAcquire+23

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: nt

IMAGE_NAME: ntkrnlmp.exe

DEBUG_FLR_IMAGE_TIMESTAMP: 53085af2

STACK_COMMAND: .cxr 0xffffd001050060c0 ; kb

FAILURE_BUCKET_ID: X64_AV_NULL_IP_nt!KxWaitForSpinLockAndAcquire+23

BUCKET_ID: X64_AV_NULL_IP_nt!KxWaitForSpinLockAndAcquire+23

Followup: MachineOwner

I think KeReleaseSpinLock is a red herring. I suspect this is some sort of stack corruption and it is the return address that’s loading the 0 into RIP.

While there are excellent tools for finding various kinds of corruption in the kernel, stack corruption can still occur. One thing you might want to do is look further in the stack (e.g., “dps”) and see if you can obtain some information from the previously-used portion of the stack (that is, addresses below the current RSP) as to what part of your code might have been executing.

Tony
OSR

Thanks Tony.

I will try this, another query I have is in what cases will FltGetStreamContext throw an exception (the input parameters are valid)

FltGetStreamContext is not documented to raise an exception and I have never experienced it raising an exception. You need to diagnose this further.

-scott
OSR
@OSRDrivers

The following are my (rather strange) observations

  1. I observed that FltGetStreamContext was throwing an exception and the driver was raising a bugcheck SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)
    I need to root cause this further, however this behavior is not consistent.

  2. After catching the exception and marking the I/O as failed, the datacopy operation works fine on a physical machine. However I still see that the driver raises a bugcheck when the driver is running on a virtual machine.

Tony, dps did not give much info on the source where the exception is raised. I am doing more analysis.

Thanks.

Invalid pointer dereferences that fall in the user addressable range
(including NULL) raise an exception. Catching the exception has just masked
the bug, you need to track that down.

-scott
OSR
@OSRDrivers

wrote in message news:xxxxx@ntfsd…

The following are my (rather strange) observations

  1. I observed that FltGetStreamContext was throwing an exception and the
    driver was raising a bugcheck SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)
    I need to root cause this further, however this behavior is not consistent.

  2. After catching the exception and marking the I/O as failed, the datacopy
    operation works fine on a physical machine. However I still see that the
    driver raises a bugcheck when the driver is running on a virtual machine.

Tony, dps did not give much info on the source where the exception is
raised. I am doing more analysis.

Thanks.