Bugcheck help

Steve_Prochniak · October 8, 2009, 12:20pm

In doing some live migration testing on a Win2k3 VM, I ran into this bugcheck after 20 or so migrates under load. Two things confuse me.

There is no obvious way that this code can be running at IRQL 0x1b, and the function that calls this one starts with a PAGED_CODE macro. Can the recorded IRQL be bogus in any situations?
We store and initialize some KEVENT objects in space mapped from a CmResourceTypeMemory. Something tells me that this could be a bad idea, but I technically don't know why.

40051 is the value of KEVENT->Header.Lock plus 0x50.

0: kd> dt -b KEVENT f8a9482c
dump_XenVbd!KEVENT
+0x000 Header : _DISPATCHER_HEADER
+0x000 Type : 0x1 ''
+0x001 Abandoned : 0 ''
+0x001 Absolute : 0 ''
+0x001 NpxIrql : 0 ''
+0x001 Signalling : 0 ''
+0x002 Size : 0x4 ''
+0x002 Hand : 0x4 ''
+0x003 Inserted : 0 ''
+0x003 DebugActive : 0 ''
+0x003 DpcActive : 0 ''
+0x000 Lock : 262145
+0x004 SignalState : 1
+0x008 WaitListHead : _LIST_ENTRY [0xf8a95834 - 0xf8a94834]
+0x000 Flink : 0xf8a95834
+0x004 Blink : 0xf8a94834

0: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: 00040051, memory referenced
Arg2: d000001b, IRQL
Arg3: 00000001, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation (only on chips which support this level of status)
Arg4: 808312b0, address which referenced memory

Debugging Details:

WRITE_ADDRESS: 00040051

CURRENT_IRQL: 1b

FAULTING_IP:
nt!KiUnwaitThread+8
808312b0 095650 or dword ptr [esi+50h],edx

DEFAULT_BUCKET_ID: DRIVER_FAULT

BUGCHECK_STR: 0xA

PROCESS_NAME: System

TRAP_FRAME: f88a2c8c -- (.trap 0xfffffffff88a2c8c)
ErrCode = 00000002
eax=f8a95834 ebx=00000001 ecx=00040001 edx=00000100 esi=00040001 edi=f8a94834
eip=808312b0 esp=f88a2d00 ebp=f88a2d04 iopl=0 nv up ei ng nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010282
nt!KiUnwaitThread+0x8:
808312b0 095650 or dword ptr [esi+50h],edx ds:0023:00040051=????????
Resetting default scope

LAST_CONTROL_TRANSFER: from 808312b0 to 8088c963

STACK_TEXT:
f88a2c8c 808312b0 badb0d00 00000100 00000000 nt!KiTrap0E+0x2a7
f88a2d04 8082818e 00000001 821a1020 8214c018 nt!KiUnwaitThread+0x8
f88a2d20 f8266173 00a9482c 00000001 00000000 nt!KeSetEvent+0x84
f88a2d38 f8266937 8215b608 15120f0c 00000001 xenpci!XenPciSuspendPDO+0x93 [c:\projects\ovmwinpvdriver.hg\xenpci\pnp_fdo.c @ 1306]
f88a2d6c 808ec1eb 8214c018 814bcd08 808ae5fc xenpci!XenPciSuspendWorker+0x137 [c:\projects\ovmwinpvdriver.hg\xenpci\pnp_fdo.c @ 1535]
f88a2d80 80880441 81d96408 00000000 821a1020 nt!IopProcessWorkItem+0x13
f88a2dac 80949b7c 81d96408 00000000 00000000 nt!ExpWorkerThread+0xeb
f88a2ddc 8088e062 80880356 00000001 00000000 nt!PspSystemThreadStartup+0x2e
00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16

Scott_Noone_OSR · October 8, 2009, 12:42pm

> 1) There is no obvious way that this code can be running at IRQL 0x1b,

and the function that calls this one starts with a >PAGED_CODE macro. Can
the recorded IRQL be bogus in any situations?

The IRQL is 1b because the dispatcher database lock is held at the time of
the crash.

40051 is the value of KEVENT->Header.Lock plus 0x50.

You’re actually inside nt!KiUnwaitThread, the:

or dword ptr [esi+50h],edx

is accessing the KTHREAD to unwait (the WaitStatus field, specifically):

0: kd> u nt!kiunwaitthread
nt!KiUnwaitThread:
80830e62 mov edi,edi
80830e64 push ebp
80830e65 mov ebp,esp
80830e67 push esi
80830e68 mov esi,ecx >>>ECX is the KTHREAD
80830e6a or dword ptr [esi+50h],edx >>> OR KTHREAD->WaitStatus with
second param

Moral of the story is that your KEVENT is trashed. More specifically, the
WaitListHead has a bogus KWAIT_BLOCK on it. I think if you:

dt nt!_kwait_block 0xf8a95834

You’ll see that the thread address in the wait block is 40001.

As for “why” it’s trashed that’s another question…

-scott

–
Scott Noone
Consulting Associate
OSR Open Systems Resources, Inc.
http://www.osronline.com

wrote in message news:xxxxx@ntdev…
> In doing some live migration testing on a Win2k3 VM, I ran into this
> bugcheck after 20 or so migrates under load. Two things confuse me.
>
> 1) There is no obvious way that this code can be running at IRQL 0x1b,
> and the function that calls this one starts with a PAGED_CODE macro. Can
> the recorded IRQL be bogus in any situations?
>
> 2) We store and initialize some KEVENT objects in space mapped from a
> CmResourceTypeMemory. Something tells me that this could be a bad idea,
> but I technically don’t know why.
>
> 40051 is the value of KEVENT->Header.Lock plus 0x50.
>
> 0: kd> dt -b KEVENT f8a9482c
> dump_XenVbd!KEVENT
> +0x000 Header : _DISPATCHER_HEADER
> +0x000 Type : 0x1 ‘’
> +0x001 Abandoned : 0 ‘’
> +0x001 Absolute : 0 ‘’
> +0x001 NpxIrql : 0 ‘’
> +0x001 Signalling : 0 ‘’
> +0x002 Size : 0x4 ‘’
> +0x002 Hand : 0x4 ‘’
> +0x003 Inserted : 0 ‘’
> +0x003 DebugActive : 0 ‘’
> +0x003 DpcActive : 0 ‘’
> +0x000 Lock : 262145
> +0x004 SignalState : 1
> +0x008 WaitListHead : _LIST_ENTRY [0xf8a95834 - 0xf8a94834]
> +0x000 Flink : 0xf8a95834
> +0x004 Blink : 0xf8a94834
>
> 0: kd> !analyze -v
> ***
> *
>
> * Bugcheck Analysis
>
> *
>
>
>
> IRQL_NOT_LESS_OR_EQUAL (a)
> An attempt was made to access a pageable (or completely invalid) address
> at an
> interrupt request level (IRQL) that is too high. This is usually
> caused by drivers using improper addresses.
> If a kernel debugger is available get the stack backtrace.
> Arguments:
> Arg1: 00040051, memory referenced
> Arg2: d000001b, IRQL
> Arg3: 00000001, bitfield :
> bit 0 : value 0 = read operation, 1 = write operation
> bit 3 : value 0 = not an execute operation, 1 = execute operation (only on
> chips which support this level of status)
> Arg4: 808312b0, address which referenced memory
>
> Debugging Details:
> ------------------
>
>
> WRITE_ADDRESS: 00040051
>
> CURRENT_IRQL: 1b
>
> FAULTING_IP:
> nt!KiUnwaitThread+8
> 808312b0 095650 or dword ptr [esi+50h],edx
>
> DEFAULT_BUCKET_ID: DRIVER_FAULT
>
> BUGCHECK_STR: 0xA
>
> PROCESS_NAME: System
>
> TRAP_FRAME: f88a2c8c – (.trap 0xfffffffff88a2c8c)
> ErrCode = 00000002
> eax=f8a95834 ebx=00000001 ecx=00040001 edx=00000100 esi=00040001
> edi=f8a94834
> eip=808312b0 esp=f88a2d00 ebp=f88a2d04 iopl=0 nv up ei ng nz na po
> nc
> cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000
> efl=00010282
> nt!KiUnwaitThread+0x8:
> 808312b0 095650 or dword ptr [esi+50h],edx
> ds:0023:00040051=???
> Resetting default scope
>
> LAST_CONTROL_TRANSFER: from 808312b0 to 8088c963
>
> STACK_TEXT:
> f88a2c8c 808312b0 badb0d00 00000100 00000000 nt!KiTrap0E+0x2a7
> f88a2d04 8082818e 00000001 821a1020 8214c018 nt!KiUnwaitThread+0x8
> f88a2d20 f8266173 00a9482c 00000001 00000000 nt!KeSetEvent+0x84
> f88a2d38 f8266937 8215b608 15120f0c 00000001 xenpci!XenPciSuspendPDO+0x93
> [c:\projects\ovmwinpvdriver.hg\xenpci\pnp_fdo.c @ 1306]
> f88a2d6c 808ec1eb 8214c018 814bcd08 808ae5fc
> xenpci!XenPciSuspendWorker+0x137
> [c:\projects\ovmwinpvdriver.hg\xenpci\pnp_fdo.c @ 1535]
> f88a2d80 80880441 81d96408 00000000 821a1020 nt!IopProcessWorkItem+0x13
> f88a2dac 80949b7c 81d96408 00000000 00000000 nt!ExpWorkerThread+0xeb
> f88a2ddc 8088e062 80880356 00000001 00000000
> nt!PspSystemThreadStartup+0x2e
> 00000000 00000000 00000000 00000000 00000000 nt!KiThreadStartup+0x16
>
>
>

Steve_Prochniak · October 8, 2009, 12:58pm

You’re exactly right…

0: kd> dt nt!_kwait_block 0xf8a95834
+0x000 WaitListEntry : _LIST_ENTRY [0xf8a95834 - 0xf8a94834]
+0x008 Thread : 0x00040001 _KTHREAD
+0x00c Object : (null)
+0x010 NextWaitBlock : 0xf8a94844 _KWAIT_BLOCK
+0x014 WaitKey : 0x4844
+0x016 WaitType : 0xa9 ‘’
+0x017 SpareByte : 0xf8 ‘’

What does the “Lock” value of the KEVENT object represent? Is there any significance to this value of 0x00040001 that both the kthread value and the KEVENT->Lock share?

Scott_Noone_OSR · October 8, 2009, 2:31pm

> What does the “Lock” value of the KEVENT object represent? Is there any

significance to this value of 0x00040001 that both the >kthread value and
the KEVENT->Lock share?

The 040001 is because Lock is a union of the first LONG. Size and type are 4
and 1, respectively:

0: kd> dt -b KEVENT f8a9482c
dump_XenVbd!KEVENT
+0x000 Header : _DISPATCHER_HEADER
+0x000 Type : 0x1 ‘’
+0x001 Abandoned : 0 ‘’
+0x001 Absolute : 0 ‘’
+0x001 NpxIrql : 0 ‘’
+0x001 Signalling : 0 ‘’
+0x002 Size : 0x4 ‘’
+0x002 Hand : 0x4 ‘’
+0x003 Inserted : 0 ‘’
+0x003 DebugActive : 0 ‘’
+0x003 DpcActive : 0 ‘’
+0x000 Lock : 262145
+0x004 SignalState : 1

The Lock field itself is also used as a lock, but that doesn’t seem to be
coming into play here. I think it’s just a red herring or indicates that the
wait block buffer has been recycled already.

Typically the wait blocks used are the wait blocks that are part of the
thread object. Does !pool 0xf8a95834 show anything? Maybe you’ve somehow
managed to terminate a thread prematurely? This sort of thing can also
happen if you call KeWaitForSingleObject with Altertable set to TRUE or with
a wait type of user mode, your thread can wake up for reasons other than the
event it’s waiting on being signaled. We actually just had a bug like this
in one of our kits recently.

Or do you have any calls to KeWaitForMultipleObjects where you specify your
own wait block array?

-scott

–
Scott Noone
Consulting Associate
OSR Open Systems Resources, Inc.
http://www.osronline.com

wrote in message news:xxxxx@ntdev…
> You’re exactly right…
>
> 0: kd> dt nt!_kwait_block 0xf8a95834
> +0x000 WaitListEntry : _LIST_ENTRY [0xf8a95834 - 0xf8a94834]
> +0x008 Thread : 0x00040001 _KTHREAD
> +0x00c Object : (null)
> +0x010 NextWaitBlock : 0xf8a94844 _KWAIT_BLOCK
> +0x014 WaitKey : 0x4844
> +0x016 WaitType : 0xa9 ‘’
> +0x017 SpareByte : 0xf8 ‘’
>
> What does the “Lock” value of the KEVENT object represent? Is there any
> significance to this value of 0x00040001 that both the kthread value and
> the KEVENT->Lock share?
>

Steve_Prochniak · October 8, 2009, 3:06pm

0: kd> !pool 0xf8a95834
Pool page f8a95834 region is Unknown
f8a95000 is not a valid large pool allocation, checking large session pool…
bf7f4000: Unable to get contents of pool block

I wonder if this is because the memory resides on a PCI card and isn’t resident in the crash dump.

There are a lot of threads in play here and I’m thinking now that it makes more sense to revisit the architecture than it does to actually track down the specifics of this crash. I think you’re definitely on the right track with a thread that terminated prematurely - I expected to find another thread running on a different processor with a stack that would point to some sort of contention with this one, but there’s nothing obvious.

Thanks for the help - much appreciated!

Scott_Noone_OSR · October 8, 2009, 3:39pm

> I wonder if this is because the memory resides on a PCI card and isn’t

resident in the crash dump.

If it’s not a pool address !pool won’t show it, which is why you’re getting
the error. Odd that your wait blocks are on the PCI card, though I see that
the address of the wait block and event are pretty close. Do you call
KeWaitForMultipleObjects with a wait block array allocated from your PCI
memory?

-scott

–
Scott Noone
Consulting Associate
OSR Open Systems Resources, Inc.
http://www.osronline.com

wrote in message news:xxxxx@ntdev…
> 0: kd> !pool 0xf8a95834
> Pool page f8a95834 region is Unknown
> f8a95000 is not a valid large pool allocation, checking large session
> pool…
> bf7f4000: Unable to get contents of pool block
>
> I wonder if this is because the memory resides on a PCI card and isn’t
> resident in the crash dump.
>
> There are a lot of threads in play here and I’m thinking now that it makes
> more sense to revisit the architecture than it does to actually track down
> the specifics of this crash. I think you’re definitely on the right track
> with a thread that terminated prematurely - I expected to find another
> thread running on a different processor with a stack that would point to
> some sort of contention with this one, but there’s nothing obvious.
>
> Thanks for the help - much appreciated!
>

Mark_Roddy · October 8, 2009, 3:43pm

Why are your even objects in PCI memory? That is just wrong, even in a
VM environment. They need to be shared with another VM or a parent or
DOM 0 partition? Doesn’t seem like a viable design.

Mark Roddy

On Thu, Oct 8, 2009 at 3:06 PM, wrote:
> 0: kd> !pool 0xf8a95834
> Pool page f8a95834 region is Unknown
> f8a95000 is not a valid large pool allocation, checking large session pool…
> bf7f4000: Unable to get contents of pool block
>
> I wonder if this is because the memory resides on a PCI card and isn’t resident in the crash dump.
>
> There are a lot of threads in play here and I’m thinking now that it makes more sense to revisit the architecture than it does to actually track down the specifics of this crash. ?I think you’re definitely on the right track with a thread that terminated prematurely - I expected to find another thread running on a different processor with a stack that would point to some sort of contention with this one, but there’s nothing obvious.
>
> Thanks for the help - much appreciated!
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>

Steve_Prochniak · October 8, 2009, 6:10pm

I’m not quite sure… It looks like a bunch of things are put there that are needed to be shared between the various devices - bus, disk and net, but I need to examine it a bit closer and ask around about the original intention. I don’t think it’d work as a cross-VM sharing mechanism, because I believe that each VM gets it’s own “Platform PCI” card.

Scott, no the events are used for syncronization between threads in different devices and they all are waiting for single objects around the time we’re trying to migrate to a different server.