Kernel corruption or application error?

milan_durovic · September 9, 2024, 1:29am

I'm experiencing a Windows crash, but the kernel stack doesn't indicate any driver code. analyze -v suggests that the kernel state may be corrupt somehow, but I can't figure out anything more from the crash dump file. We recently incorporated some Windows' updates and before that, we added a kernel driver that reads core temperatures, just for full disclosure. This driver is using KeSetSystemAffinityThreadEx to switch to each of the CPU cores and execute an instruction that reads the core temperature, but after the sampling is done, it reverts to the original core. I'm mentioning this in case this could be a factor, although MS documentation does not suggest this could create headaches.

Does the crash report below ring a bell? I'm trying to figure out if the problem is related to this driver, or to Windows kernel itself:

3: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

KERNEL_SECURITY_CHECK_FAILURE (139)
A kernel component has corrupted a critical data structure.  The corruption
could potentially allow a malicious user to gain control of this machine.
Arguments:
Arg1: 000000000000001e, Type of memory safety violation
Arg2: ffff93061436a660, Address of the trap frame for the exception that caused the bugcheck
Arg3: ffff93061436a5b8, Address of the exception record for the exception that caused the bugcheck
Arg4: 0000000000000000, Reserved

Debugging Details:
------------------

Failed to find runtime module (coreclr.dll or clr.dll or libcoreclr.so), 0x80004005
Extension commands need it in order to have something to do.
For more information see https://go.microsoft.com/fwlink/?linkid=2135652

KEY_VALUES_STRING: 1

    Key  : Analysis.CPU.Sec
    Value: 2

    Key  : Analysis.DebugAnalysisProvider.CPP
    Value: Create: 8007007e on L54928

    Key  : Analysis.DebugData
    Value: CreateObject

    Key  : Analysis.DebugModel
    Value: CreateObject

    Key  : Analysis.Elapsed.Sec
    Value: 54

    Key  : Analysis.Memory.CommitPeak.Mb
    Value: 163

    Key  : Analysis.System
    Value: CreateObject


BUGCHECK_CODE:  139

BUGCHECK_P1: 1e

BUGCHECK_P2: ffff93061436a660

BUGCHECK_P3: ffff93061436a5b8

BUGCHECK_P4: 0

TRAP_FRAME:  ffff93061436a660 -- (.trap 0xffff93061436a660)
NOTE: The trap frame does not contain all registers.
Some register values may be zeroed or incorrect.
rax=ffff9785fb103640 rbx=0000000000000000 rcx=000000000000001e
rdx=0000000000000001 rsi=0000000000000000 rdi=0000000000000000
rip=fffff8077b2bce88 rsp=ffff93061436a7f0 rbp=ffff93061436a870
 r8=0000000000000001  r9=0000000000000000 r10=fffff8077b7762c0
r11=0000000000000000 r12=0000000000000000 r13=0000000000000000
r14=0000000000000000 r15=0000000000000000
iopl=0         nv up ei ng nz na po cy
nt!KiDeferredReadyThread+0xe38:
fffff807`7b2bce88 cd29            int     29h
Resetting default scope

EXCEPTION_RECORD:  ffff93061436a5b8 -- (.exr 0xffff93061436a5b8)
ExceptionAddress: fffff8077b2bce88 (nt!KiDeferredReadyThread+0x0000000000000e38)
   ExceptionCode: c0000409 (Security check failure or stack buffer overrun)
  ExceptionFlags: 00000001
NumberParameters: 1
   Parameter[0]: 000000000000001e
Subcode: 0x1e FAST_FAIL_INVALID_NEXT_THREAD 

BLACKBOXBSD: 1 (!blackboxbsd)


BLACKBOXPNP: 1 (!blackboxpnp)


PROCESS_NAME:  Bootstrap.exe

ERROR_CODE: (NTSTATUS) 0xc0000409 - The system detected an overrun of a stack-based buffer in this application. This overrun could potentially allow a malicious user to gain control of this application.

EXCEPTION_CODE_STR:  c0000409

EXCEPTION_PARAMETER1:  000000000000001e

EXCEPTION_STR:  0xc0000409

STACK_TEXT:  
ffff9306`1436a338 fffff807`7b3e7f69     : 00000000`00000139 00000000`0000001e ffff9306`1436a660 ffff9306`1436a5b8 : nt!KeBugCheckEx
ffff9306`1436a340 fffff807`7b3e8490     : 00000000`00000000 fffff807`7a317700 00000000`00000000 00000000`00000000 : nt!KiBugCheckDispatch+0x69
ffff9306`1436a480 fffff807`7b3e645d     : ffff9785`00009000 00000000`0000000f fffff807`7b7762c0 fffff807`7b7762c0 : nt!KiFastFailDispatch+0xd0
ffff9306`1436a660 fffff807`7b2bce88     : 00000000`00000001 ffff9785`fb1037b0 ffff9785`fb103640 ffff9785`f5f35980 : nt!KiRaiseSecurityCheckFailure+0x31d
ffff9306`1436a7f0 fffff807`7b2bbdb7     : ffffc000`def46180 fffff807`7b2bdd09 00000185`4acc43a0 fffff807`00000000 : nt!KiDeferredReadyThread+0xe38
ffff9306`1436a8b0 fffff807`7b2bb877     : ffff9785`f6f5e960 00000000`00000000 ffff9785`f6f5e960 00000000`00000000 : nt!KiExitDispatcher+0x187
ffff9306`1436a920 fffff807`7b81ba62     : ffff9306`00000001 ffff9306`1436aa80 ffff9306`1436aa00 fffff807`7bc8baf6 : nt!KeSetEvent+0xb7
ffff9306`1436a9b0 fffff807`7b3e76c5     : ffff9785`fb219080 00000185`46531e00 ffff9785`00000000 ffff9785`f6f5e960 : nt!NtSetEvent+0x92
ffff9306`1436aa00 00007ffd`87770194     : 00007ffd`8418c90d 0000004b`f617f238 00000000`00000001 00000000`00000000 : nt!KiSystemServiceCopyEnd+0x25
0000004b`fc9dec88 00007ffd`8418c90d     : 0000004b`f617f238 00000000`00000001 00000000`00000000 00000185`46531e00 : ntdll!NtSetEvent+0x14
0000004b`fc9dec90 00007ffd`6fc168e5     : 00007ffd`7007cea0 00007ffd`6fc16810 0000004b`fc9ded18 00007ffd`6fc16810 : KERNELBASE!SetEvent+0xd
0000004b`fc9decc0 00007ffd`1119ec89     : 00000000`00000001 00000000`00000000 00000185`46531eb0 00007ffd`6fcab340 : coreclr!ObjectNative::Pulse+0xd5 [D:\a\_work\1\s\src\coreclr\classlibnative\bcltype\objectnative.cpp @ 283] 
0000004b`fc9dee20 00000000`00000001     : 00000000`00000000 00000185`46531eb0 00007ffd`6fcab340 0000004b`fc9dee20 : 0x00007ffd`1119ec89
0000004b`fc9dee28 00000000`00000000     : 00000185`46531eb0 00007ffd`6fcab340 0000004b`fc9dee20 00000185`46531b58 : 0x1


SYMBOL_NAME:  nt!KiFastFailDispatch+d0

MODULE_NAME: nt

IMAGE_NAME:  ntkrnlmp.exe

STACK_COMMAND:  .thread ; .cxr ; kb

BUCKET_ID_FUNC_OFFSET:  d0

FAILURE_BUCKET_ID:  0x139_1e_INVALID_NEXT_THREAD_nt!KiFastFailDispatch

OS_VERSION:  10.0.17763.1

BUILDLAB_STR:  rs5_release

OSPLATFORM_TYPE:  x64

OSNAME:  Windows 10

FAILURE_ID_HASH:  {bef176cd-c482-4279-6644-552334c6dc54}

Followup:     MachineOwner

Any help is appreciated.

Regards,
Milan

Tim_Roberts · September 9, 2024, 2:01am

I'm trying to figure out if the problem is related to this driver, or to Windows kernel itself:

It's your driver.

KiDeferredReadyThread has detected an invalid entry in its ready-to-run thread list, and did a fastfail (int 29).

Are you doing all of this thread hackery in the context of the main system process, or have you created your own thread to sacrifice? Are you 100% sure you are specifying a processor that exists?

milan_durovic · September 9, 2024, 3:22am

Thanks Tim. I don't know the answer to your question, because this trickery is happening in a timer callback (WDFTIMER is periodic, 100ms period, so nothing challenging). The sampling code looks like this:

KAFFINITY CPUTemperatureTracker::ReadCurrentCoreTemperatures() {
    KAFFINITY original_affinity = 1;
    KAFFINITY overheated_cores_bitmask = 0;
    // To read the temperature of each CPU core, we must execute *readmsr* instruction on that core,
    // and to do that, we have to keep changing the processor affinity
    for (unsigned int i = 0; i < CPUCount_; i++) {
        KAFFINITY new_affinity = ((KAFFINITY)1) << i;
        if (i == 0)
            original_affinity = KeSetSystemAffinityThreadEx(new_affinity);
        else
            KeSetSystemAffinityThreadEx(new_affinity);
        __int64 result = __readmsr(IA32_THERM_STATUS);
        if (result & VALID_READING_MASK) {
            unsigned int temperature = (result & TEMPERATURE_MASK) >> 16;
            currentSample_[i] = temperature;
            valid_reads_++;
            if (temperature <= OVERHEATED_THRESHOLD)
                overheated_cores_bitmask = overheated_cores_bitmask | new_affinity;
        } else {
            currentSample_[i] = INVALID_TEMPERATURE;
            invalid_reads_++;
        }
    }
    KeRevertToUserAffinityThreadEx(original_affinity);
    return overheated_cores_bitmask;
}

Mark_Roddy · September 9, 2024, 8:18pm

I suspect your calls to this function are at DISPATCH_LEVEL as you are using a periodic timer. Note that there is no affinity switch until the thread is running at < DISPATCH_LEVEL. This is a bug, if my analysis is accurate, but it might not be the bug that is causing the failure.

milan_durovic · September 9, 2024, 9:33pm

Thanks Mark. That may be the case, but I would expect OS to:

return some kind of error code indicating that current IRQL level is wrong, or
crash right in that call instead of returning error.

I wouldn't expect OS to say nothing, silently corrupt some thread structures and than crash later on, in unrelated code. If that's what's happening, then it's an OS bug.

Tim_Roberts · September 9, 2024, 9:45pm

Kernel code is trusted code. There is not as much validation as you would find in a user-mode API. If it were me, I would spin off my own long-running thread instead of using a periodic timer, which borrows a system thread.

Your ACPI BIOS is already monitoring these temperatures, and there is code to initiate a shutdown if they exceed a given temperature. You can read them using WMI from user-mode, with no kernel programming required. How can we get a CPU temperature through WMI? - Stack Overflow

milan_durovic · September 9, 2024, 11:56pm

Thanks Tim. I was aware of the WMI method you suggested, but gave up on that as it seemed unreliable, as evidenced by comments on the same Stackoverflow thread.

I thought that creating my own thread just for this purpose may be more resource demanding than simply using the timer facility, which I guess is taking a time slice from some other, existing thread, but only briefly and only every 100ms.

I'm not sure about your argument that there's less checking than in a typical user application. On the contrary, because stakes are much higher, single application crashing vs whole system going down. Plus the call stack indicates that the kernel keeps on checking OS state. I think if my OS API call use is wrong, the simplest thing to do is to just return the error code or crash the system right at that point, rather than accepting it and then crashing much later.

Mark_Roddy · September 10, 2024, 6:53am

You’re allowed to call it at dispatch_level. The docs clearly indicate that. The cpu switch is deferred until the calling thread drops below dispatch_level.