Deadlock in worker threads

I see a deadlock on a 24 thread (two six-core) system. All “delayed” worker threads are stuck in nt!KeSetSystemGroupAffinityThread. Examples:

0: kd> !thread fffffa801eeed8a0
THREAD fffffa801eeed8a0 Cid 0004.0fb4 Teb: 0000000000000000 Win32Thread: 0000000000000000 ???
Not impersonating
DeviceMap fffff8a000008500
Owning Process fffffa80136f2040 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 3518454 Ticks: 45279 (0:00:11:46.356)
Context Switch Count 691
UserTime 00:00:00.000
KernelTime 00:00:00.046
Win32 Start Address nt!ExpWorkerThread (0xfffff800054f2910)
Stack Init fffff88008d9cdb0 Current fffff88008d9bf90
Base fffff88008d9d000 Limit fffff88008d97000 Call 0
Priority 15 BasePriority 12 UnusualBoost 3 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP RetAddr : Args to Child : Call Site
fffff88008d9bfd0 fffff800054bd372 : fffff880026a4180 fffffa801eeed8a0 fffffa801eeedaa0 0000000000000000 : nt!KiSwapContext+0x7a
fffff88008d9c110 fffff800054bd9ac : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : nt!KeSetSystemGroupAffinityThread+0x18a
fffff88008d9c180 fffff800057a6bbc : fffffa803a5f7460 fffff88008d9c240 fffff880096fffd0 fffff8000566cbc0 : nt!PopExecuteOnTargetProcessors+0xdc
fffff88008d9c220 fffff800057f2419 : 0000000000000000 fffff88008d9c700 0000000000000018 fffff8a009dcc1c0 : nt!PpmCapturePerformanceDistribution+0x9c
fffff88008d9c290 fffff800057f2949 : fffff8a009dcc1c0 0000000000000000 0000000000000006 0000000000000000 : nt!ExpQuerySystemInformation+0x14d9
fffff88008d9c640 fffff800054e78d3 : fffff8a009de0000 fffff800054e687d 0000000000000041 fffff8a009dcc000 : nt!NtQuerySystemInformation+0x4d
fffff88008d9c680 fffff800054e3e70 : fffff8800198def4 0000000000020000 fffff88008d9c844 fffff8a009dcc198 : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff88008d9c680) fffff88008d9c818 fffff8800198def4 : 0000000000020000 fffff88008d9c844 fffff8a009dcc198 fffffa8000000000 : nt!KiServiceLinkage fffff88008d9c820 fffff8800198e3ed : fffffa8018b5ce70 fffff88001c583b7 0000000020206f49 fffffa801eeed8a0 : cng!GatherRandomKey+0x294 fffff88008d9cbe0 fffff800057def4d : 0000000000000001 0000000000000001 fffffa801ef58900 fffffa801eeed8a0 : cng!scavengingWorkItemRoutine+0x3d fffff88008d9cc80 fffff800054f2a21 : fffff80005685658 fffff800057def01 fffffa801eeed800 0037003600310020 : nt!IopProcessWorkItem+0x3d fffff88008d9ccb0 fffff80005785cce : 0075006c005c0033 fffffa801eeed8a0 0000000000000080 fffffa80136f2040 : nt!ExpWorkerThread+0x111 fffff88008d9cd40 fffff800054d9fe6 : fffff88002715180 fffffa801eeed8a0 fffff880027204c0 0033002000300020 : nt!PspSystemThreadStartup+0x5a fffff88008d9cd80 0000000000000000 : fffff88008d9d000 0000000000000000 0000000000000000 00000000`00000000 : nt!KxStartSystemThread+0x16

0: kd> !thread fffffa801efe5a10
THREAD fffffa801efe5a10 Cid 0004.08ec Teb: 0000000000000000 Win32Thread: 0000000000000000 ???
Not impersonating
DeviceMap fffff8a000008500
Owning Process fffffa80136f2040 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 3518454 Ticks: 45279 (0:00:11:46.356)
Context Switch Count 687
UserTime 00:00:00.000
KernelTime 00:00:00.046
Win32 Start Address nt!ExpWorkerThread (0xfffff800054f2910)
Stack Init fffff88009212db0 Current fffff88009211f90
Base fffff88009213000 Limit fffff8800920d000 Call 0
Priority 15 BasePriority 12 UnusualBoost 3 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP RetAddr : Args to Child : Call Site
fffff88009211fd0 fffff800054bd372 : fffff880026a4180 fffffa801efe5a10 fffffa801efe5c10 0000000000000000 : nt!KiSwapContext+0x7a
fffff88009212110 fffff800054bd9ac : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : nt!KeSetSystemGroupAffinityThread+0x18a
fffff88009212180 fffff800057a6bbc : fffffa803a5f7460 fffff88009212240 fffff8800971ffd0 fffff8000566cbc0 : nt!PopExecuteOnTargetProcessors+0xdc
fffff88009212220 fffff800057f2419 : 0000000000000000 fffff88009212700 0000000000000018 fffff8a00b2fc1c0 : nt!PpmCapturePerformanceDistribution+0x9c
fffff88009212290 fffff800057f2949 : fffff8a00b2fc1c0 0000000000000000 0000000000000006 0000000000000000 : nt!ExpQuerySystemInformation+0x14d9
fffff88009212640 fffff800054e78d3 : fffff8a00b310000 fffff800054e687d 0000000000000041 fffff8a00b2fc000 : nt!NtQuerySystemInformation+0x4d
fffff88009212680 fffff800054e3e70 : fffff8800198def4 0000000000020000 fffff88009212844 fffff8a00b2fc198 : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff88009212680) fffff88009212818 fffff8800198def4 : 0000000000020000 fffff88009212844 fffff8a00b2fc198 fffffa8000000000 : nt!KiServiceLinkage fffff88009212820 fffff8800198e3ed : fffffa8018b5ce70 fffff88001c583b7 0000000020206f49 fffffa801efe5a10 : cng!GatherRandomKey+0x294 fffff88009212be0 fffff800057def4d : 0000000000000001 0000000000000001 fffffa803a048290 fffffa801efe5a10 : cng!scavengingWorkItemRoutine+0x3d fffff88009212c80 fffff800054f2a21 : fffff80005685658 fffff800057def01 fffffa801efe5a00 0000000000000000 : nt!IopProcessWorkItem+0x3d fffff88009212cb0 fffff80005785cce : 0000000000000000 fffffa801efe5a10 0000000000000080 fffffa80136f2040 : nt!ExpWorkerThread+0x111 fffff88009212d40 fffff800054d9fe6 : fffff880028e2180 fffffa801efe5a10 fffff880028ed4c0 0000000000000246 : nt!PspSystemThreadStartup+0x5a fffff88009212d80 0000000000000000 : fffff88009213000 0000000000000000 0000000000000000 00000000`00000000 : nt!KxStartSystemThread+0x16

0: kd> !thread fffffa80136d5b60
THREAD fffffa80136d5b60 Cid 0004.0d34 Teb: 0000000000000000 Win32Thread: 0000000000000000 ???
Not impersonating
DeviceMap fffff8a000008500
Owning Process fffffa80136f2040 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 3518454 Ticks: 45279 (0:00:11:46.356)
Context Switch Count 701
UserTime 00:00:00.000
KernelTime 00:00:00.062
Win32 Start Address nt!ExpWorkerThread (0xfffff800054f2910)
Stack Init fffff880093bbdb0 Current fffff880093baf90
Base fffff880093bc000 Limit fffff880093b6000 Call 0
Priority 15 BasePriority 12 UnusualBoost 3 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP RetAddr : Args to Child : Call Site
fffff880093bafd0 fffff800054bd372 : fffff880026a4180 fffffa80136d5b60 fffffa80136d5d60 0000000000000000 : nt!KiSwapContext+0x7a
fffff880093bb110 fffff800054bd9ac : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : nt!KeSetSystemGroupAffinityThread+0x18a
fffff880093bb180 fffff800057a6bbc : fffffa803a5f7460 fffff880093bb240 fffff880096bffd0 fffff8000566cbc0 : nt!PopExecuteOnTargetProcessors+0xdc
fffff880093bb220 fffff800057f2419 : 0000000000000000 fffff880093bb700 0000000000000018 fffff8a009d8c1c0 : nt!PpmCapturePerformanceDistribution+0x9c
fffff880093bb290 fffff800057f2949 : fffff8a009d8c1c0 0000000000000000 0000000000000006 0000000000000000 : nt!ExpQuerySystemInformation+0x14d9
fffff880093bb640 fffff800054e78d3 : fffff8a009da0000 fffff800054e687d 0000000000000041 fffff8a009d8c000 : nt!NtQuerySystemInformation+0x4d
fffff880093bb680 fffff800054e3e70 : fffff8800198def4 0000000000020000 fffff880093bb844 fffff8a009d8c198 : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880093bb680) fffff880093bb818 fffff8800198def4 : 0000000000020000 fffff880093bb844 fffff8a009d8c198 fffffa8000000000 : nt!KiServiceLinkage fffff880093bb820 fffff8800198e3ed : fffffa8018b5ce70 0000000000000000 0000000020206f49 fffffa80136d5b60 : cng!GatherRandomKey+0x294 fffff880093bbbe0 fffff800057def4d : 0000000000000001 0000000000000001 fffffa802aacb2d0 fffffa80136d5b60 : cng!scavengingWorkItemRoutine+0x3d fffff880093bbc80 fffff800054f2a21 : fffff80005685658 fffff800057def01 fffffa80136d5b00 0000000000000000 : nt!IopProcessWorkItem+0x3d fffff880093bbcb0 fffff80005785cce : 0000000000000000 fffffa80136d5b60 0000000000000080 fffffa80136f2040 : nt!ExpWorkerThread+0x111 fffff880093bbd40 fffff800054d9fe6 : fffff88002715180 fffffa80136d5b60 fffff880027204c0 0000000000000246 : nt!PspSystemThreadStartup+0x5a fffff880093bbd80 0000000000000000 : fffff880093bc000 0000000000000000 0000000000000000 00000000`00000000 : nt!KxStartSystemThread+0x16

0: kd> !thread fffffa801370b680
THREAD fffffa801370b680 Cid 0004.0044 Teb: 0000000000000000 Win32Thread: 0000000000000000 ???
Not impersonating
DeviceMap fffff8a000008500
Owning Process fffffa80136f2040 Image: System
Attached Process N/A Image: N/A
Wait Start TickCount 3518454 Ticks: 45279 (0:00:11:46.356)
Context Switch Count 25026
UserTime 00:00:00.000
KernelTime 00:00:04.399
Win32 Start Address nt!ExpWorkerThread (0xfffff800054f2910)
Stack Init fffff88002fd4db0 Current fffff88002fd3f90
Base fffff88002fd5000 Limit fffff88002fcf000 Call 0
Priority 15 BasePriority 12 UnusualBoost 3 ForegroundBoost 0 IoPriority 2 PagePriority 5
Child-SP RetAddr : Args to Child : Call Site
fffff88002fd3fd0 fffff800054bd372 : fffff880026a4180 fffffa801370b680 fffffa801370b880 0000000000000000 : nt!KiSwapContext+0x7a
fffff88002fd4110 fffff800054bd9ac : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : nt!KeSetSystemGroupAffinityThread+0x18a
fffff88002fd4180 fffff800057a6bbc : fffffa803a5f7460 fffff88002fd4240 fffff88000b44fd0 fffff8000566cbc0 : nt!PopExecuteOnTargetProcessors+0xdc
fffff88002fd4220 fffff800057f2419 : 0000000000000000 fffff88002fd4700 0000000000000018 fffff8a00258f1c0 : nt!PpmCapturePerformanceDistribution+0x9c
fffff88002fd4290 fffff800057f2949 : fffff8a00258f1c0 0000000000000000 0000000000000006 0000000000000000 : nt!ExpQuerySystemInformation+0x14d9
fffff88002fd4640 fffff800054e78d3 : fffff8a0025a0000 fffff800054e687d 0000000000000041 fffff8a00258f000 : nt!NtQuerySystemInformation+0x4d
fffff88002fd4680 fffff800054e3e70 : fffff8800198def4 0000000000020000 fffff88002fd4844 fffff8a00258f198 : nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff88002fd4680) fffff88002fd4818 fffff8800198def4 : 0000000000020000 fffff88002fd4844 fffff8a00258f198 fffffa8000000000 : nt!KiServiceLinkage fffff88002fd4820 fffff8800198e3ed : fffffa8018b5ce70 00000000000007ff 0000000020206f49 fffffa801370b680 : cng!GatherRandomKey+0x294 fffff88002fd4be0 fffff800057def4d : 0000000000000001 0000000000000001 fffffa801fa00160 fffffa801370b680 : cng!scavengingWorkItemRoutine+0x3d fffff88002fd4c80 fffff800054f2a21 : fffff80005685658 fffff800057def01 fffffa801370b600 fffff80005685658 : nt!IopProcessWorkItem+0x3d fffff88002fd4cb0 fffff80005785cce : 0000000000000000 fffffa801370b680 0000000000000080 fffffa80136f2040 : nt!ExpWorkerThread+0x111 fffff88002fd4d40 fffff800054d9fe6 : fffff88002871180 fffffa801370b680 fffff8800287c4c0 0000000000000000 : nt!PspSystemThreadStartup+0x5a fffff88002fd4d80 0000000000000000 : fffff88002fd5000 fffff88002fcf000 fffff88002fd49e0 00000000`00000000 : nt!KxStartSystemThread+0x16

Anybody knows what’s going on?

Also, there is no hard deadlock on any processor. There is one runaway thread (not holding any lock), but why would it block all other threads on a 24-thread box?

On Tue, Apr 12, 2011 at 2:22 PM, wrote:

> PopExecuteOnTargetProcessors

So you have one runaway thread and a routine that is attempting to execute
“on all (target) processors” - perhaps this is the cause of the problem?

A runaway thread is a private one, not one of Worker threads. Also, it runs on PASSIVE_LEVEL and default priority:

Priority 16 BasePriority 8 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5

, so it can be preempted.

BUT, worker threads seem to be on:

Priority 15 BasePriority 12 UnusualBoost 3 ForegroundBoost 0 IoPriority 2 PagePriority 5

Could that be a problem?

Also, it doesn’t seem a "routine, that wants to execute an all processors. If it were so, that would make more sense to use DPC-type facilities.

As present, such behavior is a clear design flaw. A thread could be simply a default priority thread (or several such threads) running a multi-hours calculation in user mode, and that could cause all system worker threads to get stuck? And the system get unresponsive then? Because the worker threads are also used by HID drivers. WTF?

A big Windows deficiency is lack of runaway (or CPU-intensive) thread detection and their priority demotion. It’s been very big bitch of mine in single-CPU era, and I thought in this age it would not be so. I was mistaken.

>Also, it runs on PASSIVE_LEVEL and default priority:

Priority 16 BasePriority 8 UnusualBoost 0 ForegroundBoost 0 IoPriority 2
PagePriority 5

16 isn’t the default priority of threads in the O/S. Note there that the
BasePriority is 8, which is much more likely to be the default priority of
the thread (usually 7-9, though it’s O/S specific).

Could that be a problem?

Sure. Priority 16 is higher than priority 15.

Also, it doesn’t seem a "routine, that wants to execute an all processors.
If it were so, that would make more sense to use DPC-type >facilities.

Why? The code here is trying to capture some per-processor info and the way
it does that is to change the affinity of the thread in a loop capturing
along the way. Seems reasonable to me.

thread could be simply a default priority thread (or several such threads)
running a multi-hours calculation in user mode, and that could >cause all
system worker threads to get stuck? And the system get unresponsive then?
Because the worker threads are also used by >HID drivers. WTF?

16 is a real time thread priority and, as such, it can be (ab)used to bring
the entire system to its knees. This is why it requires that your token have
the SeIncreaseBasePriorityPrivilege enabled, which by default is only
available to administrators.

-scott


Scott Noone
Consulting Associate and Chief System Problem Analyst
OSR Open Systems Resources, Inc.
http://www.osronline.com