Help needed to solve WIN2012R2 BSOD MI_CHECK_KERNEL_NOEXECUTE_FAULT

john-7 · March 25, 2020, 1:44pm

Hello,

I am seeing a strange Win2012R2 BSOD.
We have a WFP driver to inspect inbound/outbound traffic.
However this intermittent problem is seen only on Win2012R2 and even after applying all the correct symbols I am not able to resolve the call stack which leads to this BSOD.
Please let me know if there are any windbg commands to troubleshoot this further.

Child-SP RetAddr Call Site

00 ffffd001b668e538 fffff80400d9dba0 hal!HalpAcpiPmRegisterReadPort+0x1b
01 ffffd001b668e540 fffff80400dbd213 hal!HalpAcpiPmRegisterRead+0x30
02 ffffd001b668e570 fffff804007c6e01 hal!HaliHaltSystem+0x53
03 ffffd001b668e5b0 fffff804007c6a4d nt!KiBugCheckDebugBreak+0x99
04 ffffd001b668e610 fffff804007513a4 nt!KeBugCheck2+0xc49
05 ffffd001b668ed10 fffff804006e796c nt!KeBugCheckEx+0x104
06 ffffd001b668ed50 fffff804007f558f nt!MI_CHECK_KERNEL_NOEXECUTE_FAULT+0x64
07 ffffd001b668ed90 fffff804006448c3 nt!MiRaisedIrqlFault+0x1c7
08 ffffd001b668edf0 fffff8040075e957 nt!MmAccessFault+0x103
09 ffffd001b668eef0 ffffe001a80038ca nt!KiPageFault+0x317
0a ffffd001b668f088 ffffe001a84e7e65 0xffffe001a80038ca 0b ffffd001b668f090 b3b74bdee4453415 0xffffe001a84e7e65
0c ffffd001b668f098 ffffd001b668f100 0xb3b74bdee4453415 0d ffffd001b668f0a0 ffffe001a84dd15c 0xffffd001b668f100
0e ffffd001b668f0a8 0000000000000001 0xffffe001a84dd15c 0f ffffd001b668f0b0 ffffe001a5864be0 0x1 10 ffffd001b668f0b8 0000000000000000 0xffffe001a5864be0

According to trap frame analysis - the 0xffffe001`a80038ca is the faulting address and it has no Execute permissions.

2: kd> !pte 0xffffe001`a80038ca
VA ffffe001a80038ca
PXE at FFFFF6FB7DBEDE00 PPE at FFFFF6FB7DBC0030 PDE at FFFFF6FB78006A00 PTE at FFFFF6F000D40018
contains 000000000054D863 contains 000000000054C863 contains 000000000D7E3863 contains 80000001339F1963
pfn 54d —DA–KWEV pfn 54c —DA–KWEV pfn d7e3 —DA–KWEV pfn 1339f1 -G-DA–KW-V

This address range specified for 0xffffe001a80038ca does not fall in the range of any loaded module. also when the addresses 0xffffe001a84e7e65, 0xb3b74bdee4453415, 0xb3b74bdee4453415 are unassembled they seem strange and I am not sure why instructions at 0xffffe001`a80038ca are being executed.

I don’t have access to private symbols.

2: kd> .trap ffffd001b668eef0 NOTE: The trap frame does not contain all registers. Some register values may be zeroed or incorrect. rax=0000000080040031 rbx=0000000000000000 rcx=fffff6fb7dbedf80 rdx=ffffd001b668f450 rsi=0000000000000000 rdi=0000000000000000 rip=ffffe001a80038ca rsp=ffffd001b668f088 rbp=ffffd001b668f100 r8=0000000000000000 r9=0000000000000000 r10=7010008004002001 r11=0000000080050031 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei pl nz na pe nc ffffe001a80038ca 0000 add byte ptr [rax],al ds:0000000080040031=?? 2: kd> u 0xffffe001a80038ca
ffffe001a80038ca 0000 add byte ptr [rax],al ffffe001a80038cc 0000 add byte ptr [rax],al
ffffe001a80038ce 0000 add byte ptr [rax],al ffffe001a80038d0 0000 add byte ptr [rax],al
ffffe001a80038d2 0000 add byte ptr [rax],al ffffe001a80038d4 0000 add byte ptr [rax],al
ffffe001a80038d6 0000 add byte ptr [rax],al ffffe001a80038d8 0000 add byte ptr [rax],al
2: kd> u 0xffffe001a80038c8 ffffe001a80038c8 0000 add byte ptr [rax],al
ffffe001a80038ca 0000 add byte ptr [rax],al ffffe001a80038cc 0000 add byte ptr [rax],al
ffffe001a80038ce 0000 add byte ptr [rax],al ffffe001a80038d0 0000 add byte ptr [rax],al
ffffe001a80038d2 0000 add byte ptr [rax],al ffffe001a80038d4 0000 add byte ptr [rax],al
ffffe001a80038d6 0000 add byte ptr [rax],al 2: kd> u 0xffffe001a84e7e63
ffffe001a84e7e63 b1ff mov cl,0FFh ffffe001a84e7e65 85c0 test eax,eax
ffffe001a84e7e67 740b je ffffe001a84e7e74
ffffe001a84e7e69 488bd3 mov rdx,rbx ffffe001a84e7e6c 498bcd mov rcx,r13
ffffe001a84e7e6f e8f2bdb1ff call ffffe001a8003c66
ffffe001a84e7e74 0f20e1 mov rcx,cr4 ffffe001a84e7e77 48f7c180000200 test rcx,20080h

This is just a request to suggest what could be the issue and how it can be resolved.

Thanks.

rod_widdowson · March 25, 2020, 1:52pm

0xffffe001 a80038c looks suspiciously like a quadword read at a longword boundary.

Compare with your stack addresses b668e538 fffff804

is there anything interesting at a80038c 0xffffe001

john-7 · March 25, 2020, 7:23pm

I rechecked and the address 0xffffe001 a80038ca is aligned on a 8 byte boundary when it should be aligned on a 16 byte boundary.
the unassembly of this address shows the following.

2: kd> u ffffe001a80038ca ffffe001a80038ca 0000 add byte ptr [rax],al
ffffe001a80038cc 0000 add byte ptr [rax],al ffffe001a80038ce 0000 add byte ptr [rax],al
ffffe001a80038d0 0000 add byte ptr [rax],al ffffe001a80038d2 0000 add byte ptr [rax],al
ffffe001a80038d4 0000 add byte ptr [rax],al ffffe001a80038d6 0000 add byte ptr [rax],al
ffffe001`a80038d8 0000 add byte ptr [rax],al

However I am not sure how this address is being executed, the previous function in the stack does not show an explicit call to u ffffe001`a80038ca

Scott_Noone_OSR · March 26, 2020, 12:03am

My guess is that you overran the stack and corrupted the return address. That’s clearly not code (note that the opcodes are all “00”).

rod_widdowson · March 26, 2020, 2:31pm

Verifier?

john-7 · March 26, 2020, 7:13pm

Hello rod, I will recheck if verifier is ON - since this failed in the testing environment.
Scott, I will recheck but I am not sure how my driver would be corrupting the stack. When I dumped all kernel stacks - my driver is not active when the crash occurs; I will still recheck everything. I am wondering, despite having all the symbols - the starting address is not mapped to any modules.

Scott_Noone_OSR · March 26, 2020, 7:58pm

The scenario would be that the stack gets corrupted and then unwound, so you wouldn’t need to see your driver on the stack to be at fault.

Can you put the dump someplace? I can take a look and let you know if I find anything.

john-7 · March 27, 2020, 9:01pm

Thanks Scott, I will need to get an approval before I upload the dump. Let me see what I can do. Is there anything else I can try in the meantime to get the possible source of error.

Scott_Noone_OSR · March 30, 2020, 2:32pm

As for suggestions, I’ll quote myself

Your instruction pointer is trashed, probably a stack overflow of some kind. You can try dumping the pre-bugcheck stack save area and see if you can reconstruct the surrounding call stack:

dps nt!KiPreBugcheckStackSaveArea L6000/8

Other than that, start adding tracing to your code and figure out the last functions called before the crash.

john-7 · April 9, 2020, 7:30am

Thanks Scott, sorry for a late reply. I am not allowed to upload the dump.
The dps nt!KiPreBugcheckStackSaveArea L6000/8 command shows the stack containing only zeros.
I did not know this command, is there some web article which talks about such commands?

What I figured out is that the the thread which crashes is not the thread which is owned by our driver, it is some other system thread, part of ntkrnlmp. Also we have another driver which tries to touch the IDT on a CPU. However this thread gets switched out due a sleep. When it comes back it may not be on the same CPU but still have the pointer to the old IDT table. My current guess is that this could be an issue as this crash is very intermittent.

Peter_Viscarola_OSR · April 9, 2020, 7:06pm

I did not know this command, is there some web article which talks about such commands?

Docs here.

Peter

john-7 · June 1, 2020, 10:32am

Thanks Peter.

Scott, Sorry for a late reply but we were testing the fix in multiple scenarios.

so after multiple runs the conclusion is that our driver touches the IDT and calls KeDelayExecutionThread() to wait for some action.
After the delay it comes back and tries to modify the IDT again at dispatch.
However in between these actions - it seems that some other thread is reading the IDT and it gets a corrupted value.
After implementing our work at Dispatch and not lowering the IRQL till the expected action is complete solved the issue.
The crash is not seen so far at least - still being tested.

Thanks.