Bug check by calling KeInsertQueueDpc in NMI callback

Christian_Kindahl · May 19, 2010, 6:02am

Hi,

I am having some problems with queuing a DPC from a NMI callback. The result is a KERNEL_MODE_EXCEPTION_NOT_HANDLED (8e) bug check.

I am writing a driver that listens to NMIs generated by the processors performance counters. I register the NMI callback using KeRegisterNmiCallback. In the callback routine all I do is reset the performance counter in order to receive further interrupts (this involves writing to a MSR and unmaksing a bit in the APIC memory space), this works fine. At least that’s how it appears, I have encountered no problem and no bug checks.

However, when I try to queue a DPC the system bug checks:

KeInsertQueueDpc(&cpu_dpc_int[KeGetCurrentProcessorNumber()],NULL,NULL);

Each processor core has it’s own KDPC object that is allocated in non-pagable memory. I have confirmed that the entire structure is readable from the NMI callback routine. I have also tried calling the DPC handler directly which works. The DPC routine is empty except for a single call to DbgPrint.

It should be noted that this problem does not occur on 64-bit (AMD64) versions of Windows, I have confirmed that it works on both Vista and Windows 7. It should also be noted that if generating NMIs with a very long time period, for example once every second everything works on 32-bit versions as well (perhaps by luck?).

I have stripped my driver clean trying to make sure that nothing else interferes. The driver passes through the driver verifier without any problems (with and without the call to KeInsertQueueDpc).

I am probably misunderstanding something fundamental, hopefully some of you can shed some light on the issue.

Regards,
Christian

Below follows the output from WinDbg, the process name (lsass.exe in this case) seems to be random:

Loading Dump File [C:\Windows\MEMORY.DMP]
Kernel Summary Dump File: Only kernel address space is available

Symbol search path is: C:\Users\developer\Desktop\nmi_test;C:\Symbols
Executable search path is:
Windows 7 Kernel Version 7600 MP (2 procs) Free x86 compatible
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 7600.16385.x86fre.win7_rtm.090713-1255
Machine Name:
Kernel base = 0x8281b000 PsLoadedModuleList = 0x82963810
Debug session time: Wed May 19 08:31:52.295 2010 (GMT+2)
System Uptime: 0 days 16:55:00.355
Loading Kernel Symbols
…
…
…
Loading User Symbols
PEB is paged out (Peb.Ldr = 7ffdb00c). Type “.hh dbgerr001” for details
Loading unloaded module list
…
0: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

KERNEL_MODE_EXCEPTION_NOT_HANDLED (8e)
This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Some common problems are exception code 0x80000003. This means a hard
coded breakpoint or assertion was hit, but this system was booted
/NODEBUG. This is not supposed to happen as developers should never have
hardcoded breakpoints in retail code, but …
If this happens, make sure a debugger gets connected, and the
system is booted /DEBUG. This will let us see why this breakpoint is
happening.
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: 828d8441, The address that the exception occurred at
Arg3: 8293e7a0, Trap Frame
Arg4: 00000000

Debugging Details:

PEB is paged out (Peb.Ldr = 7ffdb00c). Type “.hh dbgerr001” for details
PEB is paged out (Peb.Ldr = 7ffdb00c). Type “.hh dbgerr001” for details

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

FAULTING_IP:
nt!KiDispatchException+27a
828d8441 8b87c4000000 mov eax,dword ptr [edi+0C4h]

TRAP_FRAME: 8293e7a0 – (.trap 0xffffffff8293e7a0)
ESP EDITED! New esp=8293eb00
ErrCode = 00000000
eax=8293edfc ebx=8293ef20 ecx=8a3012c0 edx=00000000 esi=00000000 edi=8293eb00
eip=828d8441 esp=8293e814 ebp=8293ef04 iopl=0 nv up ei pl nz na pe nc
cs=0000 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00000206
nt!KiDispatchException+0x27a:
828d8441 8b87c4000000 mov eax,dword ptr [edi+0C4h] ds:0023:8293ebc4=00000000
Resetting default scope

DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT

BUGCHECK_STR: 0x8E

PROCESS_NAME: lsass.exe

CURRENT_IRQL: 0

EXCEPTION_RECORD: 8293ef20 – (.exr 0xffffffff8293ef20)
ExceptionAddress: 00000000
ExceptionCode: c000001d (Illegal instruction)
ExceptionFlags: 00000000
NumberParameters: 0

LAST_CONTROL_TRANSFER: from 828d8372 to 828f7d10

STACK_TEXT:
8293e294 828d8372 0000008e c0000005 828d8441 nt!KeBugCheckEx+0x1e
8293e6bc 8284db08 8293edfc 00000000 8293e7a0 nt!KiDispatchException+0x1ac
8293e770 82861fa7 8293edfc 8293e824 00000000 nt!KiRaiseException+0x185
8293e78c 8285e42a 8293edfc 8293e824 00000000 nt!NtRaiseException+0x33
8293e78c 828d8441 8293edfc 8293e824 00000000 nt!KiFastCallEntry+0x12a
8293ef04 8285f016 8293ef20 00000000 8293ef74 nt!KiDispatchException+0x27a
8293ef6c 8285efb2 00000000 00000000 00000023 nt!CommonDispatchException+0x4a
8293ef74 00000000 00000023 00000023 0102f6c0 nt!Kei386EoiHelper+0x17a

STACK_COMMAND: kb

FOLLOWUP_IP:
nt!KiDispatchException+27a
828d8441 8b87c4000000 mov eax,dword ptr [edi+0C4h]

SYMBOL_STACK_INDEX: 0

SYMBOL_NAME: nt!KiDispatchException+27a

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: nt

IMAGE_NAME: ntkrpamp.exe

DEBUG_FLR_IMAGE_TIMESTAMP: 4a5bc007

FAILURE_BUCKET_ID: 0x8E_VRF_nt!KiDispatchException+27a

BUCKET_ID: 0x8E_VRF_nt!KiDispatchException+27a

Followup: MachineOwner

Scott_Noone_OSR · May 19, 2010, 11:33am

I don’t know anything about those NMI callbacks or what’s legal to do from
them, though queuing a DPC seems reasonable enough. Buried in that output is
STATUS_ILLEGAL_INSTRUCTION, which certainly seems like it would be some sort
of corruption. Your stack is hosed though and WinDBG isn’t helping much, I
suspect that between the task gate and the exception it’s entirely confused.

Any chance of getting a copy of the kernel summary dump? I’d like to see if
I can get a more meaningful stack trace or more information about the state
of the machine in general.

Or maybe someone who knows will respond with, “you can’t queue a DPC from
those callbacks” and that will be that.

-scott

–
Scott Noone
Consulting Associate
OSR Open Systems Resources, Inc.
http://www.osronline.com

wrote in message news:xxxxx@ntdev…
> Hi,
>
> I am having some problems with queuing a DPC from a NMI callback. The
> result is a KERNEL_MODE_EXCEPTION_NOT_HANDLED (8e) bug check.
>
> I am writing a driver that listens to NMIs generated by the processors
> performance counters. I register the NMI callback using
> KeRegisterNmiCallback. In the callback routine all I do is reset the
> performance counter in order to receive further interrupts (this involves
> writing to a MSR and unmaksing a bit in the APIC memory space), this works
> fine. At least that’s how it appears, I have encountered no problem and no
> bug checks.
>
> However, when I try to queue a DPC the system bug checks:
>
> KeInsertQueueDpc(&cpu_dpc_int[KeGetCurrentProcessorNumber()],NULL,NULL);
>
> Each processor core has it’s own KDPC object that is allocated in
> non-pagable memory. I have confirmed that the entire structure is readable
> from the NMI callback routine. I have also tried calling the DPC handler
> directly which works. The DPC routine is empty except for a single call to
> DbgPrint.
>
> It should be noted that this problem does not occur on 64-bit (AMD64)
> versions of Windows, I have confirmed that it works on both Vista and
> Windows 7. It should also be noted that if generating NMIs with a very
> long time period, for example once every second everything works on 32-bit
> versions as well (perhaps by luck?).
>
> I have stripped my driver clean trying to make sure that nothing else
> interferes. The driver passes through the driver verifier without any
> problems (with and without the call to KeInsertQueueDpc).
>
> I am probably misunderstanding something fundamental, hopefully some of
> you can shed some light on the issue.
>
> Regards,
> Christian
>
> Below follows the output from WinDbg, the process name (lsass.exe in this
> case) seems to be random:
>
> Microsoft (R) Windows Debugger Version 6.11.0001.404 X86
> Copyright (c) Microsoft Corporation. All rights reserved.
>
>
> Loading Dump File [C:\Windows\MEMORY.DMP]
> Kernel Summary Dump File: Only kernel address space is available
>
> Symbol search path is: C:\Users\developer\Desktop\nmi_test;C:\Symbols
> Executable search path is:
> Windows 7 Kernel Version 7600 MP (2 procs) Free x86 compatible
> Product: WinNt, suite: TerminalServer SingleUserTS
> Built by: 7600.16385.x86fre.win7_rtm.090713-1255
> Machine Name:
> Kernel base = 0x8281b000 PsLoadedModuleList = 0x82963810
> Debug session time: Wed May 19 08:31:52.295 2010 (GMT+2)
> System Uptime: 0 days 16:55:00.355
> Loading Kernel Symbols
> …
> …
> …
> Loading User Symbols
> PEB is paged out (Peb.Ldr = 7ffdb00c). Type “.hh dbgerr001” for details
> Loading unloaded module list
> …
> 0: kd> !analyze -v
> ***
> *
>
> * Bugcheck Analysis
>
> *
>
>
>
> KERNEL_MODE_EXCEPTION_NOT_HANDLED (8e)
> This is a very common bugcheck. Usually the exception address pinpoints
> the driver/function that caused the problem. Always note this address
> as well as the link date of the driver/image that contains this address.
> Some common problems are exception code 0x80000003. This means a hard
> coded breakpoint or assertion was hit, but this system was booted
> /NODEBUG. This is not supposed to happen as developers should never have
> hardcoded breakpoints in retail code, but …
> If this happens, make sure a debugger gets connected, and the
> system is booted /DEBUG. This will let us see why this breakpoint is
> happening.
> Arguments:
> Arg1: c0000005, The exception code that was not handled
> Arg2: 828d8441, The address that the exception occurred at
> Arg3: 8293e7a0, Trap Frame
> Arg4: 00000000
>
> Debugging Details:
> ------------------
>
> PEB is paged out (Peb.Ldr = 7ffdb00c). Type “.hh dbgerr001” for details
> PEB is paged out (Peb.Ldr = 7ffdb00c). Type “.hh dbgerr001” for details
>
> EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx
> referenced memory at 0x%08lx. The memory could not be %s.
>
> FAULTING_IP:
> nt!KiDispatchException+27a
> 828d8441 8b87c4000000 mov eax,dword ptr [edi+0C4h]
>
> TRAP_FRAME: 8293e7a0 – (.trap 0xffffffff8293e7a0)
> ESP EDITED! New esp=8293eb00
> ErrCode = 00000000
> eax=8293edfc ebx=8293ef20 ecx=8a3012c0 edx=00000000 esi=00000000
> edi=8293eb00
> eip=828d8441 esp=8293e814 ebp=8293ef04 iopl=0 nv up ei pl nz na pe
> nc
> cs=0000 ss=0010 ds=0023 es=0023 fs=0030 gs=0000
> efl=00000206
> nt!KiDispatchException+0x27a:
> 828d8441 8b87c4000000 mov eax,dword ptr [edi+0C4h]
> ds:0023:8293ebc4=00000000
> Resetting default scope
>
> DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT
>
> BUGCHECK_STR: 0x8E
>
> PROCESS_NAME: lsass.exe
>
> CURRENT_IRQL: 0
>
> EXCEPTION_RECORD: 8293ef20 – (.exr 0xffffffff8293ef20)
> ExceptionAddress: 00000000
> ExceptionCode: c000001d (Illegal instruction)
> ExceptionFlags: 00000000
> NumberParameters: 0
>
> LAST_CONTROL_TRANSFER: from 828d8372 to 828f7d10
>
> STACK_TEXT:
> 8293e294 828d8372 0000008e c0000005 828d8441 nt!KeBugCheckEx+0x1e
> 8293e6bc 8284db08 8293edfc 00000000 8293e7a0 nt!KiDispatchException+0x1ac
> 8293e770 82861fa7 8293edfc 8293e824 00000000 nt!KiRaiseException+0x185
> 8293e78c 8285e42a 8293edfc 8293e824 00000000 nt!NtRaiseException+0x33
> 8293e78c 828d8441 8293edfc 8293e824 00000000 nt!KiFastCallEntry+0x12a
> 8293ef04 8285f016 8293ef20 00000000 8293ef74 nt!KiDispatchException+0x27a
> 8293ef6c 8285efb2 00000000 00000000 00000023
> nt!CommonDispatchException+0x4a
> 8293ef74 00000000 00000023 00000023 0102f6c0 nt!Kei386EoiHelper+0x17a
>
>
> STACK_COMMAND: kb
>
> FOLLOWUP_IP:
> nt!KiDispatchException+27a
> 828d8441 8b87c4000000 mov eax,dword ptr [edi+0C4h]
>
> SYMBOL_STACK_INDEX: 0
>
> SYMBOL_NAME: nt!KiDispatchException+27a
>
> FOLLOWUP_NAME: MachineOwner
>
> MODULE_NAME: nt
>
> IMAGE_NAME: ntkrpamp.exe
>
> DEBUG_FLR_IMAGE_TIMESTAMP: 4a5bc007
>
> FAILURE_BUCKET_ID: 0x8E_VRF_nt!KiDispatchException+27a
>
> BUCKET_ID: 0x8E_VRF_nt!KiDispatchException+27a
>
> Followup: MachineOwner
> ---------
>
>

Christian_Kindahl · May 19, 2010, 12:41pm

Thanks for the reply! Yes, some kind of corruption sounds reasonable. That’s why I tried to strip my driver of all code not related to setting up the interrupt.

I have uploaded a copy of the kernel dump here:
http://www.nemalabs.com/tmp/memory_dmp.zip

Thanks!

Scott_Noone_OSR · May 19, 2010, 5:13pm

You have some seriously bad mojo/juju going on here and I’d say from the
looks of it queuing a DPC from this callback is a bad idea.

The faulting thread is fairly hosed:

ChildEBP RetAddr
82931294 828cb382 nt!KeBugCheckEx+0x1e
829316bc 82840b08 nt!KiDispatchException+0x1ac
82931770 82854fe7 nt!KiRaiseException+0x185
8293178c 8285144a nt!NtRaiseException+0x33
8293178c 828cb451 nt!KiFastCallEntry+0x12a
82931f04 82852036 nt!KiDispatchException+0x27a
82931f6c 82851fd2 nt!CommonDispatchException+0x4a
82931f74 00000000 nt!KiExceptionExit+0x17a

This is indeed a STATUS_ILLEGAL_INSTRUCTION exception being raised. The
illegal instruction seems to be in the user mode state though, so I can’t
get to the IP. However, you’ll note something interesting about the stack
pointer values here. They’re not within the bounds of the current thread:

Stack Init a538cfd0 Current a538cad0 Base a538d000 Limit a538a000 Call 0

And they’re not within the bounds of the DPC stack:

0: kd> dt nt!_kprcb 82937d20 DpcStack
+0x1908 DpcStack : 0x8078b000 Void

So where did this stack come from? The answer is in the task register, which
indicates that we’re still in the NMI task:

0: kd> r @tr
tr=00000058
0: kd> !idt -a

Dumping IDT:

00: 82852220 nt!KiTrap00
01: 828523b0 nt!KiTrap01
02: Task Selector = 0x0058 << NMI

Which means that we’re on the double fault stack:

0: kd> x nt!KiDoubleFaultStack
82932000 nt!KiDoubleFaultStack =

(Stacks grow down, so the double fault stack range is 8292f000-82932000)

This is where things get really freaky…If you check out the ready list
you’ll find an interesting thread that’s ready to run on this processor:

0: kd> !ready
Processor 0: Ready Threads at priority 4
THREAD a54f0938 Cid 0d30.0f10 Teb: 7ffaf000 Win32Thread: 00000000 ???
Processor 1: No threads in READY state

0: kd> !THREAD a54f0938
THREAD a54f0938 Cid 0d30.0f10 Teb: 7ffaf000 Win32Thread: 00000000 ???
IRP List:
a8aece28: (0006,01d8) Flags: 40020800 Mdl: 00000000
Not impersonating
DeviceMap 86207868
Owning Process a5438030 Image: svchost.exe
Attached Process N/A Image: N/A
Wait Start TickCount 15842 Ticks: 0
Context Switch Count 1850
UserTime 00:00:01.778
KernelTime 00:00:00.624
Win32 Start Address 0x777ed63e
Stack Init a23c6fd0 Current 82931eb4 Base a23c7000 Limit a23c4000 Call 0
Priority 4 BasePriority 4 UnusualBoost 0 ForegroundBoost 0 IoPriority 0
PagePriority 1
ChildEBP RetAddr Args to Child
82931ecc 82c23924 82939654 82937d00 82931dcc nt!KiDispatchInterrupt+0xe2
(FPO: [Uses EBP] [0,0,3])
82931f04 82852036 82931f20 00000000 82931f74
hal!HalpDispatchSoftwareInterrupt+0x5e (FPO: [Non-Fpo])
82931f6c 82851fd2 00000000 00000000 00000023 nt!CommonDispatchException+0x4a
(FPO: [0,20,0])
82931f74 00000000 00000023 00000023 005fe910 nt!KiExceptionExit+0x17a

Check out the stack addresses, this thread is also using the double fault
stack. So, we have two threads trying to use the same execution stack at the
same time. I’d say that this could lead to “unpredictable” results.

As to why this is happening, I suspect that the O/S is expecting the trip
into the NMI TSS and back to be synchronous. By queuing the DPC you’re
tinkering with the IRQL and that’s enough to throw this processor back into
gen pop before it gets back. Without being able to play with this more I
can’t confirm that though, so hopefully someone with a more intimate
understanding of these NMI callbacks can say for certain whether or not this
will work.

It also wouldn’t explain why this works for you on x64, though it’s possible
that it’s timing related and if you tried hard enough you could get it to
happen there as well.

-scott

–
Scott Noone
Consulting Associate
OSR Open Systems Resources, Inc.
http://www.osronline.com

wrote in message news:xxxxx@ntdev…
> Thanks for the reply! Yes, some kind of corruption sounds reasonable.
> That’s why I tried to strip my driver of all code not related to setting
> up the interrupt.
>
> I have uploaded a copy of the kernel dump here:
> http://www.nemalabs.com/tmp/memory_dmp.zip
>
> Thanks!
>

OSR_Community_User · May 19, 2010, 5:21pm

I’ve tried this before, which I was able to get to work (as I recall), but only in one VERY specific scenario on EXACTLY two machines (the only place that it was ever going to run). I was also doing so decidedly not friendly stuff beforehand that went outside the API, so I’m not sure how much this means in general, or if it helps at all.

I think that the system would have to be assuming that the whole operation is synchronous, because of the way that callback cancellation works.

Either way, damn impressive analysis, Snoone.

Good luck,

mm

OSR_Community_User · May 19, 2010, 5:38pm

I believe that NMI callbacks occur at IRQL == HIGH_LEVEL, which means you can’t call much of anything, including DbgPrint.

Good luck,

mm

Christian_Kindahl · May 20, 2010, 3:34am

My hat off to you Scott, your analysis is much appreciated.

MM: Yes, they occur at HIGH_LEVEL. The call to DbgPrint was in the DPC routine.

Thank you very much! I’ll try to achieve my goal without using the NMI callback.