What's going on when the system get hang?

I have an sdio driver on winxp sp3 platform(when run it on server2008, never encounter the freezing ), when the driver run for a short time(sevral minute later), the system get hang. It seems that the thread of the idle process is alived and the IRQL of the processor is very high(0x1c), the scheduler of the system has no chance to runing. The thread is always lively. I got the following information:
kd> !pcr
KPCR for Processor 0 at ffdff000:
Major 1 Minor 1
NtTib.ExceptionList: 8054a324
NtTib.StackBase: 8054ad70
NtTib.StackLimit: 80547f80
NtTib.SubSystemTib: 00000000
NtTib.Version: 00000000
NtTib.UserPointer: 00000000
NtTib.SelfTib: 00000000

SelfPcr: ffdff000
Prcb: ffdff120
Irql: 0000001c
IRR: 00000004
IDR: ffff20e8
InterruptMode: 00000000
IDT: 8003f400
GDT: 8003f000
TSS: 80042000

CurrentThread: 805537c0
NextThread: fcfdcb30
IdleThread: 805537c0

DpcQueue: 0x821d8e0c 0xf77cb430 [Normal] sdbus!SdbusInterruptDpc
0x80553e20 0x80500e00 [Normal] nt!KiReadyThread

kd> !thread 805537c0
THREAD 805537c0 Cid 0000.0000 Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 0
Not impersonating
Owning Process 0 Image:
Attached Process 80553a20 Image: Idle
Wait Start TickCount 28759 Ticks: 162323 (0:00:27:05.567)
Context Switch Count 46103
UserTime 00:00:00.000
KernelTime 00:02:41.051
Stack Init 8054af80 Current 8054accc Base 8054af80 Limit 80547f80 Call 0
Priority 16 BasePriority 0 PriorityDecrement 0 DecrementCount 0
ChildEBP RetAddr Args to Child
8054aa34 806d77cf badb0d00 00003003 f8b201d2 nt!RtlSetBits+0x70 (FPO: [3,0,0])
8054aaa4 8054236b 01001111 0000003a 8054aab8 hal!HalBeginSystemInterrupt+0x83 (FPO: [3,0,0])
8054aae4 804f9e57 00000282 804f9ebf 820b9438 nt!RtlCopyMemoryNonTemporal+0x177
8054ab40 f80adf5a 00000000 81e00000 00000000 nt!KiGetVectorInfo+0x71 (FPO: [2,0,4])
8054ab64 80500d98 81e05030 81e05008 74dc3280 NDIS!ndisMTimerDpcX+0x7a (FPO: [4,0,4])
8054abf4 f8972a84 ffd00118 ffffffff e52cf4e0 nt!KiFindReadyThread+0x70 (FPO: [0,0,0])
8054ac80 80500eaf 80553a20 805537c0 ffdff000 kdcom!Dbg1394_EnablePhysicalAccess+0x9a (FPO: [1,0,4])
8054acac 80542b9d 80553e20 00000000 000095f9 nt!KiReadyThread+0x103 (FPO: [0,2,0])
8054acd0 80542b16 00000000 0000000e 00000000 nt!RtlpCaptureContext+0x9
00000000 00000000 00000000 00000000 00000000 nt!ExecuteHandler2+0x4a

kd> g
Break instruction exception - code 80000003 (first chance)


You are seeing this message because you pressed either
CTRL+C (if you run kd.exe) or,
CTRL+BREAK (if you run WinDBG),
on your debugger machine’s keyboard.

THIS IS NOT A BUG OR A SYSTEM CRASH

If you did not intend to break into the debugger, press the “g” key, then
press the “Enter” key now. This message might immediately reappear. If it
does, press “g” and “Enter” again.


nt!RtlSetBits+0x70:
80528bec cc int 3
kd> !thread 805537c0
THREAD 805537c0 Cid 0000.0000 Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 0
Not impersonating
Owning Process 0 Image:
Attached Process 80553a20 Image: Idle
Wait Start TickCount 28759 Ticks: 162515 (0:00:27:07.490)
Context Switch Count 46103
UserTime 00:00:00.000
KernelTime 00:02:41.051
Stack Init 8054af80 Current 8054accc Base 8054af80 Limit 80547f80 Call 0
Priority 16 BasePriority 0 PriorityDecrement 0 DecrementCount 0
ChildEBP RetAddr Args to Child
8054a9e8 806d7658 badb0d00 00006042 f77e7110 nt!RtlSetBits+0x70 (FPO: [3,0,0])
8054aa58 f879dffb 00006042 822020e0 8205f300 hal!READ_PORT_USHORT+0x8 (FPO: [1,0,0])
8054aa78 f7821e91 002029dc 81e73d84 8205f3b4 usbuhci!UhciInterruptService+0x2f (FPO: [1,3,0])
8054aa8c 805423c1 81e73b20 82202028 806d77cf USBPORT!USBPORT_InterruptService+0x3d (FPO: [2,0,0])
00000000 00000000 00000000 00000000 00000000 nt!RtlCopyMemoryNonTemporal+0x1cd

kd> !thread 805537c0
THREAD 805537c0 Cid 0000.0000 Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 0
Not impersonating
Owning Process 0 Image:
Attached Process 80553a20 Image: Idle
Wait Start TickCount 28759 Ticks: 162515 (0:00:27:07.490)
Context Switch Count 46103
UserTime 00:00:00.000
KernelTime 00:02:41.051
Stack Init 8054af80 Current 8054accc Base 8054af80 Limit 80547f80 Call 0
Priority 16 BasePriority 0 PriorityDecrement 0 DecrementCount 0
ChildEBP RetAddr Args to Child
8054a9e8 806d7658 badb0d00 00006042 f77e7110 nt!RtlSetBits+0x70 (FPO: [3,0,0])
8054aa58 f879dffb 00006042 822020e0 8205f300 hal!READ_PORT_USHORT+0x8 (FPO: [1,0,0])
8054aa78 f7821e91 002029dc 81e73d84 8205f3b4 usbuhci!UhciInterruptService+0x2f (FPO: [1,3,0])
8054aa8c 805423c1 81e73b20 82202028 806d77cf USBPORT!USBPORT_InterruptService+0x3d (FPO: [2,0,0])
00000000 00000000 00000000 00000000 00000000 nt!RtlCopyMemoryNonTemporal+0x1cd

kd> g
Break instruction exception - code 80000003 (first chance)


You are seeing this message because you pressed either
CTRL+C (if you run kd.exe) or,
CTRL+BREAK (if you run WinDBG),
on your debugger machine’s keyboard.

THIS IS NOT A BUG OR A SYSTEM CRASH

If you did not intend to break into the debugger, press the “g” key, then
press the “Enter” key now. This message might immediately reappear. If it
does, press “g” and “Enter” again.


nt!RtlSetBits+0x70:
80528bec cc int 3
kd> bl
0 e f59fbd40 0001 (0001) apwsdio!HwSdioInit

kd> !thread 805537c0
THREAD 805537c0 Cid 0000.0000 Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 0
Not impersonating
Owning Process 0 Image:
Attached Process 80553a20 Image: Idle
Wait Start TickCount 28759 Ticks: 163666 (0:00:27:19.016)
Context Switch Count 46103
UserTime 00:00:00.000
KernelTime 00:02:41.051
Stack Init 8054af80 Current 8054accc Base 8054af80 Limit 80547f80 Call 0
Priority 16 BasePriority 0 PriorityDecrement 0 DecrementCount 0
ChildEBP RetAddr Args to Child
8054aa34 806d77cf badb0d00 00003003 f8b201d2 nt!RtlSetBits+0x70 (FPO: [3,0,0])
8054aaa4 8054236b 01001111 0000003a 8054aab8 hal!HalBeginSystemInterrupt+0x83 (FPO: [3,0,0])
8054aae4 804f9e57 00000282 804f9ebf 820b9438 nt!RtlCopyMemoryNonTemporal+0x177
8054ab40 f80adf5a 00000000 81e00000 00000000 nt!KiGetVectorInfo+0x71 (FPO: [2,0,4])
8054ab64 80500d98 81e05030 81e05008 74dc3280 NDIS!ndisMTimerDpcX+0x7a (FPO: [4,0,4])
8054abf4 f8972a84 ffd00118 ffffffff e52cf4e0 nt!KiFindReadyThread+0x70 (FPO: [0,0,0])
8054ac80 80500eaf 80553a20 805537c0 ffdff000 kdcom!Dbg1394_EnablePhysicalAccess+0x9a (FPO: [1,0,4])
8054acac 80542b9d 80553e20 00000000 000095f9 nt!KiReadyThread+0x103 (FPO: [0,2,0])
8054acd0 80542b16 00000000 0000000e 00000000 nt!RtlpCaptureContext+0x9
00000000 00000000 00000000 00000000 00000000 nt!ExecuteHandler2+0x4a

kd> !thread 805537c0 30
THREAD 805537c0 Cid 0000.0000 Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 0
kd> !thread 805537c0 14
THREAD 805537c0 Cid 0000.0000 Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 0
kd> !thread 805537c0 ie
THREAD 805537c0 Cid 0000.0000 Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 0
Not impersonating
Owning Process 0 Image:
Attached Process 80553a20 Image: Idle
Wait Start TickCount 28759 Ticks: 163666 (0:00:27:19.016)
Context Switch Count 46103
UserTime 00:00:00.000
KernelTime 00:02:41.051
Stack Init 8054af80 Current 8054accc Base 8054af80 Limit 80547f80 Call 0
Priority 16 BasePriority 0 PriorityDecrement 0 DecrementCount 0
ChildEBP RetAddr Args to Child
8054aa34 806d77cf badb0d00 00003003 f8b201d2 nt!RtlSetBits+0x70 (FPO: [3,0,0])
8054aaa4 8054236b 01001111 0000003a 8054aab8 hal!HalBeginSystemInterrupt+0x83 (FPO: [3,0,0])
8054aae4 804f9e57 00000282 804f9ebf 820b9438 nt!RtlCopyMemoryNonTemporal+0x177
8054ab40 f80adf5a 00000000 81e00000 00000000 nt!KiGetVectorInfo+0x71 (FPO: [2,0,4])
8054ab64 80500d98 81e05030 81e05008 74dc3280 NDIS!ndisMTimerDpcX+0x7a (FPO: [4,0,4])
8054abf4 f8972a84 ffd00118 ffffffff e52cf4e0 nt!KiFindReadyThread+0x70 (FPO: [0,0,0])
8054ac80 80500eaf 80553a20 805537c0 ffdff000 kdcom!Dbg1394_EnablePhysicalAccess+0x9a (FPO: [1,0,4])
8054acac 80542b9d 80553e20 00000000 000095f9 nt!KiReadyThread+0x103 (FPO: [0,2,0])
8054acd0 80542b16 00000000 0000000e 00000000 nt!RtlpCaptureContext+0x9
00000000 00000000 00000000 00000000 00000000 nt!ExecuteHandler2+0x4a

kd> !pcr
KPCR for Processor 0 at ffdff000:
Major 1 Minor 1
NtTib.ExceptionList: 8054a324
NtTib.StackBase: 8054ad70
NtTib.StackLimit: 80547f80
NtTib.SubSystemTib: 00000000
NtTib.Version: 00000000
NtTib.UserPointer: 00000000
NtTib.SelfTib: 00000000

SelfPcr: ffdff000
Prcb: ffdff120
Irql: 0000001c
IRR: 00000004
IDR: ffff20e8
InterruptMode: 00000000
IDT: 8003f400
GDT: 8003f000
TSS: 80042000

CurrentThread: 805537c0
NextThread: fcfdcb30
IdleThread: 805537c0

DpcQueue: 0x821d8e0c 0xf77cb430 [Normal] sdbus!SdbusInterruptDpc
0x80553e20 0x80500e00 [Normal] nt!KiReadyThread

kd> g
Break instruction exception - code 80000003 (first chance)


You are seeing this message because you pressed either
CTRL+C (if you run kd.exe) or,
CTRL+BREAK (if you run WinDBG),
on your debugger machine’s keyboard.

THIS IS NOT A BUG OR A SYSTEM CRASH

If you did not intend to break into the debugger, press the “g” key, then
press the “Enter” key now. This message might immediately reappear. If it
does, press “g” and “Enter” again.


nt!RtlSetBits+0x70:
80528bec cc int 3
kd> bl
0 e f59fbd40 0001 (0001) apwsdio!HwSdioInit

kd> g
Break instruction exception - code 80000003 (first chance)


You are seeing this message because you pressed either
CTRL+C (if you run kd.exe) or,
CTRL+BREAK (if you run WinDBG),
on your debugger machine’s keyboard.

THIS IS NOT A BUG OR A SYSTEM CRASH

If you did not intend to break into the debugger, press the “g” key, then
press the “Enter” key now. This message might immediately reappear. If it
does, press “g” and “Enter” again.


nt!RtlSetBits+0x70:
80528bec cc int 3
kd> !thread 805537c0 ie
THREAD 805537c0 Cid 0000.0000 Teb: 00000000 Win32Thread: 00000000 RUNNING on processor 0
Not impersonating
Owning Process 0 Image:
Attached Process 80553a20 Image: Idle
Wait Start TickCount 28759 Ticks: 745853 (0:02:04:29.270)
Context Switch Count 46103
UserTime 00:00:00.000
KernelTime 00:02:41.051
Stack Init 8054af80 Current 8054accc Base 8054af80 Limit 80547f80 Call 0
Priority 16 BasePriority 0 PriorityDecrement 0 DecrementCount 0
ChildEBP RetAddr Args to Child
8054a9e8 806d7658 badb0d00 00006042 00000008 nt!RtlSetBits+0x70 (FPO: [3,0,0])
8054aa58 f879dffb 00006042 822020e0 8205f300 hal!READ_PORT_USHORT+0x8 (FPO: [1,0,0])
8054aa78 f7821e91 002029dc 81e73d84 8205f3b4 usbuhci!UhciInterruptService+0x2f (FPO: [1,3,0])
8054aa8c 805423c1 81e73b20 82202028 806d77cf USBPORT!USBPORT_InterruptService+0x3d (FPO: [2,0,0])
00000000 00000000 00000000 00000000 00000000 nt!RtlCopyMemoryNonTemporal+0x1cd

Any one can give some suggestion about the hang? The thread is not waiting for some event or lock, but it always not go to wait state to let the system scheduler to run, that’s really strange.

>when the driver run for a short time(sevral minute later), the system get hang.

Deadlock on some lock usually (but not always).

!process 0 7 helps a lot.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Hi, Maxim, thanks for your information. The idle process is the directly evil process that cause the hang(of course i believe it is not cause by it). Some very important information in the above is:

Context Switch Count 46103
UserTime 00:00:00.000
KernelTime 00:02:41.051

those are not changed, but the following always changed:

Ticks: 745853 (0:02:04:29.270)(Does this show that the thread is alive?)

The DPC queue never been removed when using !pcr command.

Judging from the fact that IRQL is above DPC level, it looks like your driver’s ISR relies upon spinlock, i.e does something ISRs should not do under Windows. Apparently, what happens here is that DPC acquires spinlock, interrupt occurs while it is still being held, and ISR tries to acquire it recursively. Another possibility is that non-ISR code uses interrupt spinlock as a “regular” one, i.e. uses it in calls that don’t raise IRQL to the
appropriate level - if interrupt occurs while lock is being held the system tries to acquire this spinlock recursively before calling ISR, effectively hanging infinitely…

Anton Bassov



> The idle process is the directly evil process that cause the hang…

It seems to be one of the funniest statements I’ve EVER seen in this NG…

Anton Bassov

I don’t think the IRQL is valid in w2k3 debug data. The idle thread
here appears to be constantly busy processing dpcs or interrupts, but
for usb or ndis.

An always-busy DPC or greater thread on a cpu can certainly hang the
system by, for example, blocking that thread indicated as ‘next
thread’ in the PCR data from ever running, even on an MP system where
the other processors are free to do useful work.

Mark Roddy

On Wed, Jul 29, 2009 at 7:23 AM, wrote:
> Judging from the fact that IRQL is above DPC level, it looks like your driver’s ISR relies upon spinlock, i.e does something ISRs should not do under Windows. Apparently, what happens here is that DPC acquires spinlock, interrupt occurs while it is still being held, and ISR tries to acquire it recursively. Another possibility is that non-ISR code uses interrupt spinlock as a “regular” one, i.e. uses it in calls that don’t raise IRQL to the
> appropriate level - if interrupt occurs while lock is being held the system tries to acquire this spinlock recursively before calling ISR, effectively hanging infinitely…
>
>
> Anton Bassov
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>

Mark wrote:
I don’t think the IRQL is valid in w2k3 debug data. The idle thread here appears to be constantly busy processing dpcs or interrupts, but for usb or ndis.

What may cause to this if the IRQL here shows is not valid? It seems i have encountered serveral times for this IRQL(i forgot the detail when i encoutered it last time).

here is the IRQL:
Software IRQL:

PASSIVE_LEVEL 0 // Passive release level
LOW_LEVEL 0 // Lowest interrupt level
APC_LEVEL 1 // APC interrupt level
DISPATCH_LEVEL 2 // Dispatcher level

Hardware IRQL:

DIRQL: from 3 to 26 for device ISR

PROFILE_LEVEL 27 (0x1B) // timer used for profiling.
CLOCK1_LEVEL 28 (0x1C) // Interval clock 1 level - Not used on x86
CLOCK2_LEVEL 28 (0x1C) // Interval clock 2 level
SYNCH_LEVEL 28 (0x1C) // synchronization level
IPI_LEVEL 29 (0x1D) // Interprocessor interrupt level
POWER_LEVEL 30 (0x1E) // Power failure level
HIGH_LEVEL 31 (0x1F) // Highest interrupt level

How my driver can raise this kind of hang of the system? Can any body give more suggestion for the further test?

Thanks very much

> I don’t think the IRQL is valid in w2k3 debug data.

Please pay a special attention to HalBeginSystemInterrupt()'s presence in the call stack, which means that we are speaking about interrupt context here…

The idle thread here appears to be constantly busy processing dpcs or interrupts, but for usb or ndis.

Exactly - it is processing interrupts/dpcs for other drivers when interrupt occurs, and is unable to accomplish its interrupt/dpc processing because it deadlocks when the OP’s device interrupts.
Therefore, interrupts/dpc it was earlier processing are still in the call stack…

Please note that the culprit itself does not necessarily have to be in the call stack - it may well happen that
it had already successfully deadlocked another CPU, so that spinlock the target CPU is trying to acquire is held by another CPU without any chance of ever getting released. …

Anton Bassov

IIRC, !irql reports meaningful data for dump targets starting with Srv03, previously, it just gave you the bug check IRQL (interrupts off), which was not so useful.

  • S

-----Original Message-----
From: Mark Roddy
Sent: Wednesday, July 29, 2009 06:47
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] What’s going on when the system get hang?

I don’t think the IRQL is valid in w2k3 debug data. The idle thread
here appears to be constantly busy processing dpcs or interrupts, but
for usb or ndis.

An always-busy DPC or greater thread on a cpu can certainly hang the
system by, for example, blocking that thread indicated as ‘next
thread’ in the PCR data from ever running, even on an MP system where
the other processors are free to do useful work.

Mark Roddy

On Wed, Jul 29, 2009 at 7:23 AM, wrote:
> Judging from the fact that IRQL is above DPC level, it looks like your driver’s ISR relies upon spinlock, i.e does something ISRs should not do under Windows. Apparently, what happens here is that DPC acquires spinlock, interrupt occurs while it is still being held, and ISR tries to acquire it recursively. Another possibility is that non-ISR code uses interrupt spinlock as a “regular” one, i.e. uses it in calls that don’t raise IRQL to the
> appropriate level - if interrupt occurs while lock is being held the system tries to acquire this spinlock recursively before calling ISR, effectively hanging infinitely…
>
>
> Anton Bassov
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

> IIRC, !irql reports meaningful data for dump targets starting with Srv03,

Yes, I vaguely recall that this command was not particularly useful under XP (at least SP2 - I don’t know if
SP3 is different)…

Anton Bassov

Hi, guys
The guilty driver is other third party’s driver. When i do not start the driver, then no hang occurs.
Sdio interrupt is process by system first, in the interrupt callback routine provide by me, i just Acknowledge the interrupt to clear the sdio interrupt status, no other step takes, so there should no more synchronization issue

There is a strange action from the system that the sdbus driver always call the sdbusinterrupt(this routine may queue a DPC for my driver’s callback routine) rountine althrough i don’t insert our sdio device and there is no other sdio device too. Can any one explain why this happend?