Deadlock occuring in the SRV when receiving a packet...

Hi All,

I am wondering if anyone has come across a similar problem to the one I am
currently debugging. It involves a 3rd party filesystem (ours) that has an
exported SMB share, of a directory. We are doing a lot of small io to the
file system via the share, at an arbitrary time (while running a test
which creates 1500 files and writes the filename into each of the files)
the machine (SMB server) deadlocks.

The OS is Windows 2000 SP3…

The offending stack looks like:

ChildEBP RetAddr Args to Child
f081fa98 80468f0e 00000001 05000002 00000030
nt!RtlpBreakWithStatusInstruction (FPO: [1,0,0])
f081fa98 8046838b 00000001 05000002 00000030 nt!KeUpdateSystemTime+0x13e
(FPO: [0,2] TrapFrame @ f081faac)
f081fb20 804006da fcac24e8 f081fb48 00000002 nt!KiTrap0E+0x98 (FPO: [0,0]
TrapFrame @ f081fb20)
f081fb98 f7feb8ca fcb78da0 fcc33b30 00000000
nt!ExpInterlockedPopEntrySListFault (FPO: [0,2,0])
f081fbbc f7febaf3 fcb28d08 fcb318e8 fcc37608
srv!SrvFsdGetReceiveWorkItem+0x1b (FPO: [Non-Fpo])
f081fbd8 f86017a5 fcb28e50 fcc33b30 00000e20
srv!SrvFsdTdiReceiveHandler+0xdb (FPO: [Non-Fpo])
f081fc18 f8601592 fcc40030 00000000 00000e20 netbt!RcvHandlrNotOs+0xf4
(FPO: [Non-Fpo])
f081fc5c f8629ae2 fcc40030 fcc37608 00000e20 netbt!TdiReceiveHandler+0x425
(FPO: [Non-Fpo])
f081fca8 f86292a3 fcc37608 00001850 f081fdc4 tcpip!IndicateData+0x1f2
(FPO: [Non-Fpo])
f081fd10 f8626436 00000000 b83ca8c0 863ca8c0 tcpip!TCPRcv+0xad1 (FPO:
[Non-Fpo])
f081fd5c f8626b27 f8628ef2 fcc57c28 ffad402e tcpip!DeliverToUser+0xf9
(FPO: [Non-Fpo])
f081fe14 f862730b fcc57c28 ffad4042 00000060 tcpip!IPRcvPacket+0x58e (FPO:
[Non-Fpo])
f081fe54 f862737d 00000000 fccd7e48 ffad4020
tcpip!ARPRcvIndicationNew+0x172 (FPO: [Non-Fpo])
f081fe90 fc84b403 fcc57268 00000000 fccdb008 tcpip!ARPRcvPacket+0x5c (FPO:
[Non-Fpo])
f081fee8 f0465fbe fcd56100 f081ff48 00000001
NDIS!ethFilterDprIndicateReceivePacket+0x2ea (FPO: [Non-Fpo])
f081ffa8 f046256d 00cdb008 804021f0 fcd56130
el90xbc5!UpCompleteNdis40PlusEvent+0x25e (FPO: [Non-Fpo])
f081ffc4 fc8353d9 fccdb008 f08331f4 f0833124 el90xbc5!NICInterrupt+0x83
(FPO: [EBP 0x804021f0] [1,1,4])
f081ffe0 804645d4 fccdb1f4 fccdb1e0 00000000 NDIS!ndisMDpc+0xc8 (FPO:
[Non-Fpo])
f081fff4 804026fa f0833098 00000000 00000000 nt!KiRetireDpcList+0x30 (FPO:
[0,0,0])

I broke into the machine and this is the same stack everytime… There are
other worker threads in the system that are waiting to do io (they are
doing io to our filesystem), and these are all write behind threads (the
number of write behind thread varies between 2-3).

I have done a bit of disassembly and come up with the trigger for the
deadlock (the following is the ExInterlockedPopEntrySList):

nt!ExInterlockedPopEntrySList:
804006c4 53 push ebx
804006c5 55 push ebp
804006c6 8be9 mov ebp,ecx
nt!ExpInterlockedPopEntrySListResume:
804006c8 8b5504 mov edx,[ebp+0x4]
804006cb 8b4500 mov eax,[ebp]
804006ce 0bc0 or eax,eax
804006d0 7410 jz nt!ExpInterlockedPopEntrySListFault+0x8
(804006e2)
804006d2 8bca mov ecx,edx
804006d4 81c1ffff0000 add ecx,0xffff
nt!ExpInterlockedPopEntrySListFault:
804006da 8b18 mov ebx,[eax]
804006dc 0fc74d00 cmpxchg8b qword ptr [ebp]
804006e0 75ec jnz nt!ExpInterlockedPopEntrySListResume+0x6
(804006ce)
804006e2 5d pop ebp
804006e3 5b pop ebx
804006e4 c3 ret

It is looping around 804006c8 -> 804006ce and faulting at 804006da. As eax
has a dud value in it:
kd> r
eax=05000000 ebx=fcb78ddc ecx=001cffff edx=001c0000 esi=fcb78da0
edi=804006c4
eip=804006da esp=f081fb94 ebp=fcb78de8 iopl=0 nv up ei pl nz na po
nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000
efl=00000206

Why eax was clobbered I am still to determine why, I believe eax is the
work item. The threads I am responsible for seem to be just fine, I cannot
fault them in that I don’t beleive they where responsible for clobbering
memory.

Another issue is memory on 2 of the 3 pools is very tight, but the standby
list is just fine:

kd> !memusage
loading PFN database
loading (99% complete)
Zeroed: 0 ( 0 kb)
Free: 1 ( 4 kb)
Standby: 6964 ( 27856 kb)
Modified: 584 ( 2336 kb)
ModifiedNoWrite: 0 ( 0 kb)
Active/Valid: 24880 ( 99520 kb)
Transition: 0 ( 0 kb)
Unknown: 0 ( 0 kb)
TOTAL: 32429 (129716 kb)
Building kernel map
Finished building kernel map

Another other “thing”, is this does not occur on a dual processor system,
only uni proc systems…

It implies my locking heirachy in the write behind worker threads is
busted
but I have not been able to fault it (i.e. via code inspection and
testing). This also does not occur if we run this same test directly to
the
file system, so what is the SVR doing???

Is this a known issue? As I have not been able to find it on the MS msdn
webpage, the osr ntdev and ntfsd lists, google (you name it, I’ve been
there).

Thanks,

Ian Costello

Should have also added what the queues look like:

kd> !exqueue
Dumping ExWorkerQueue: 8046D3E0

**** Critical WorkQueue( current = 1 maximum = 1 )
WARNING: active threads = maximum active threads in the queue. No new
workitems schedulable in this queue until they finish or block.
THREAD fcda2820 Cid 8.c Teb: 00000000 Win32Thread: 00000000 WAIT
THREAD fcda25a0 Cid 8.10 Teb: 00000000 Win32Thread: 00000000 WAIT
THREAD fcda2320 Cid 8.14 Teb: 00000000 Win32Thread: 00000000 WAIT
THREAD fcda1020 Cid 8.18 Teb: 00000000 Win32Thread: 00000000 WAIT
THREAD fcda1da0 Cid 8.1c Teb: 00000000 Win32Thread: 00000000 RUNNING

**** Delayed WorkQueue( current = 0 maximum = 1 )
THREAD fcda1b20 Cid 8.20 Teb: 00000000 Win32Thread: 00000000 WAIT
THREAD fcda18a0 Cid 8.24 Teb: 00000000 Win32Thread: 00000000 WAIT
THREAD fcda1620 Cid 8.28 Teb: 00000000 Win32Thread: 00000000 WAIT

**** HyperCritical WorkQueue( current = 0 maximum = 1 )
THREAD fcda13a0 Cid 8.2c Teb: 00000000 Win32Thread: 00000000 WAIT

2 of the Critical Worker threads are write behind operations (I recently changed
them from delayed to critical).

Thanks,


Ian Costello (xxxxx@adacel.com.au)

Adacel Technologies Ltd
250 Bay Street, Brighton, Victoria
Post Code 3168, Australia

Phone: +61 3 8534 5535
Fax: +61 3 9596 2960
Web: www.adacel.com.au

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com]On Behalf Of
xxxxx@adacel.com.au
Sent: Tuesday, May 27, 2003 5:05 PM
To: File Systems Developers
Subject: [ntfsd] Deadlock occuring in the SRV when receiving
a packet…

Hi All,

I am wondering if anyone has come across a similar problem to
the one I am
currently debugging. It involves a 3rd party filesystem
(ours) that has an
exported SMB share, of a directory. We are doing a lot of
small io to the
file system via the share, at an arbitrary time (while running a test
which creates 1500 files and writes the filename into each of
the files)
the machine (SMB server) deadlocks.

The OS is Windows 2000 SP3…

The offending stack looks like:

ChildEBP RetAddr Args to Child
f081fa98 80468f0e 00000001 05000002 00000030
nt!RtlpBreakWithStatusInstruction (FPO: [1,0,0])
f081fa98 8046838b 00000001 05000002 00000030
nt!KeUpdateSystemTime+0x13e
(FPO: [0,2] TrapFrame @ f081faac)
f081fb20 804006da fcac24e8 f081fb48 00000002 nt!KiTrap0E+0x98
(FPO: [0,0]
TrapFrame @ f081fb20)
f081fb98 f7feb8ca fcb78da0 fcc33b30 00000000
nt!ExpInterlockedPopEntrySListFault (FPO: [0,2,0])
f081fbbc f7febaf3 fcb28d08 fcb318e8 fcc37608
srv!SrvFsdGetReceiveWorkItem+0x1b (FPO: [Non-Fpo])
f081fbd8 f86017a5 fcb28e50 fcc33b30 00000e20
srv!SrvFsdTdiReceiveHandler+0xdb (FPO: [Non-Fpo])
f081fc18 f8601592 fcc40030 00000000 00000e20 netbt!RcvHandlrNotOs+0xf4
(FPO: [Non-Fpo])
f081fc5c f8629ae2 fcc40030 fcc37608 00000e20
netbt!TdiReceiveHandler+0x425
(FPO: [Non-Fpo])
f081fca8 f86292a3 fcc37608 00001850 f081fdc4 tcpip!IndicateData+0x1f2
(FPO: [Non-Fpo])
f081fd10 f8626436 00000000 b83ca8c0 863ca8c0 tcpip!TCPRcv+0xad1 (FPO:
[Non-Fpo])
f081fd5c f8626b27 f8628ef2 fcc57c28 ffad402e tcpip!DeliverToUser+0xf9
(FPO: [Non-Fpo])
f081fe14 f862730b fcc57c28 ffad4042 00000060
tcpip!IPRcvPacket+0x58e (FPO:
[Non-Fpo])
f081fe54 f862737d 00000000 fccd7e48 ffad4020
tcpip!ARPRcvIndicationNew+0x172 (FPO: [Non-Fpo])
f081fe90 fc84b403 fcc57268 00000000 fccdb008
tcpip!ARPRcvPacket+0x5c (FPO:
[Non-Fpo])
f081fee8 f0465fbe fcd56100 f081ff48 00000001
NDIS!ethFilterDprIndicateReceivePacket+0x2ea (FPO: [Non-Fpo])
f081ffa8 f046256d 00cdb008 804021f0 fcd56130
el90xbc5!UpCompleteNdis40PlusEvent+0x25e (FPO: [Non-Fpo])
f081ffc4 fc8353d9 fccdb008 f08331f4 f0833124
el90xbc5!NICInterrupt+0x83
(FPO: [EBP 0x804021f0] [1,1,4])
f081ffe0 804645d4 fccdb1f4 fccdb1e0 00000000 NDIS!ndisMDpc+0xc8 (FPO:
[Non-Fpo])
f081fff4 804026fa f0833098 00000000 00000000
nt!KiRetireDpcList+0x30 (FPO:
[0,0,0])

I broke into the machine and this is the same stack
everytime… There are
other worker threads in the system that are waiting to do io
(they are
doing io to our filesystem), and these are all write behind
threads (the
number of write behind thread varies between 2-3).

I have done a bit of disassembly and come up with the trigger for the
deadlock (the following is the ExInterlockedPopEntrySList):

nt!ExInterlockedPopEntrySList:
804006c4 53 push ebx
804006c5 55 push ebp
804006c6 8be9 mov ebp,ecx
nt!ExpInterlockedPopEntrySListResume:
804006c8 8b5504 mov edx,[ebp+0x4]
804006cb 8b4500 mov eax,[ebp]
804006ce 0bc0 or eax,eax
804006d0 7410 jz nt!ExpInterlockedPopEntrySListFault+0x8
(804006e2)
804006d2 8bca mov ecx,edx
804006d4 81c1ffff0000 add ecx,0xffff
nt!ExpInterlockedPopEntrySListFault:
804006da 8b18 mov ebx,[eax]
804006dc 0fc74d00 cmpxchg8b qword ptr [ebp]
804006e0 75ec jnz nt!ExpInterlockedPopEntrySListResume+0x6
(804006ce)
804006e2 5d pop ebp
804006e3 5b pop ebx
804006e4 c3 ret

It is looping around 804006c8 -> 804006ce and faulting at
804006da. As eax
has a dud value in it:
kd> r
eax=05000000 ebx=fcb78ddc ecx=001cffff edx=001c0000 esi=fcb78da0
edi=804006c4
eip=804006da esp=f081fb94 ebp=fcb78de8 iopl=0 nv up
ei pl nz na po
nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000
efl=00000206

Why eax was clobbered I am still to determine why, I believe
eax is the
work item. The threads I am responsible for seem to be just
fine, I cannot
fault them in that I don’t beleive they where responsible for
clobbering
memory.

Another issue is memory on 2 of the 3 pools is very tight,
but the standby
list is just fine:

kd> !memusage
loading PFN database
loading (99% complete)
Zeroed: 0 ( 0 kb)
Free: 1 ( 4 kb)
Standby: 6964 ( 27856 kb)
Modified: 584 ( 2336 kb)
ModifiedNoWrite: 0 ( 0 kb)
Active/Valid: 24880 ( 99520 kb)
Transition: 0 ( 0 kb)
Unknown: 0 ( 0 kb)
TOTAL: 32429 (129716 kb)
Building kernel map
Finished building kernel map

Another other “thing”, is this does not occur on a dual
processor system,
only uni proc systems…

It implies my locking heirachy in the write behind worker threads is
busted
but I have not been able to fault it (i.e. via code inspection and
testing). This also does not occur if we run this same test
directly to
the
file system, so what is the SVR doing???

Is this a known issue? As I have not been able to find it on
the MS msdn
webpage, the osr ntdev and ntfsd lists, google (you name it, I’ve been
there).

Thanks,

Ian Costello


You are currently subscribed to ntfsd as: xxxxx@adacel.com.au
To unsubscribe send a blank email to xxxxx@lists.osr.com

Try !process 0 7 and look at the second deadlocked thread.

Max

----- Original Message -----
From:
To: “File Systems Developers”
Sent: Tuesday, May 27, 2003 11:04 AM
Subject: [ntfsd] Deadlock occuring in the SRV when receiving a
packet…

> Hi All,
>
> I am wondering if anyone has come across a similar problem to the
one I am
> currently debugging. It involves a 3rd party filesystem (ours) that
has an
> exported SMB share, of a directory. We are doing a lot of small io
to the
> file system via the share, at an arbitrary time (while running a
test
> which creates 1500 files and writes the filename into each of the
files)
> the machine (SMB server) deadlocks.
>
> The OS is Windows 2000 SP3…
>
> The offending stack looks like:
>
> ChildEBP RetAddr Args to Child
> f081fa98 80468f0e 00000001 05000002 00000030
> nt!RtlpBreakWithStatusInstruction (FPO: [1,0,0])
> f081fa98 8046838b 00000001 05000002 00000030
nt!KeUpdateSystemTime+0x13e
> (FPO: [0,2] TrapFrame @ f081faac)
> f081fb20 804006da fcac24e8 f081fb48 00000002 nt!KiTrap0E+0x98 (FPO:
[0,0]
> TrapFrame @ f081fb20)
> f081fb98 f7feb8ca fcb78da0 fcc33b30 00000000
> nt!ExpInterlockedPopEntrySListFault (FPO: [0,2,0])
> f081fbbc f7febaf3 fcb28d08 fcb318e8 fcc37608
> srv!SrvFsdGetReceiveWorkItem+0x1b (FPO: [Non-Fpo])
> f081fbd8 f86017a5 fcb28e50 fcc33b30 00000e20
> srv!SrvFsdTdiReceiveHandler+0xdb (FPO: [Non-Fpo])
> f081fc18 f8601592 fcc40030 00000000 00000e20
netbt!RcvHandlrNotOs+0xf4
> (FPO: [Non-Fpo])
> f081fc5c f8629ae2 fcc40030 fcc37608 00000e20
netbt!TdiReceiveHandler+0x425
> (FPO: [Non-Fpo])
> f081fca8 f86292a3 fcc37608 00001850 f081fdc4
tcpip!IndicateData+0x1f2
> (FPO: [Non-Fpo])
> f081fd10 f8626436 00000000 b83ca8c0 863ca8c0 tcpip!TCPRcv+0xad1
(FPO:
> [Non-Fpo])
> f081fd5c f8626b27 f8628ef2 fcc57c28 ffad402e
tcpip!DeliverToUser+0xf9
> (FPO: [Non-Fpo])
> f081fe14 f862730b fcc57c28 ffad4042 00000060 tcpip!IPRcvPacket+0x58e
(FPO:
> [Non-Fpo])
> f081fe54 f862737d 00000000 fccd7e48 ffad4020
> tcpip!ARPRcvIndicationNew+0x172 (FPO: [Non-Fpo])
> f081fe90 fc84b403 fcc57268 00000000 fccdb008 tcpip!ARPRcvPacket+0x5c
(FPO:
> [Non-Fpo])
> f081fee8 f0465fbe fcd56100 f081ff48 00000001
> NDIS!ethFilterDprIndicateReceivePacket+0x2ea (FPO: [Non-Fpo])
> f081ffa8 f046256d 00cdb008 804021f0 fcd56130
> el90xbc5!UpCompleteNdis40PlusEvent+0x25e (FPO: [Non-Fpo])
> f081ffc4 fc8353d9 fccdb008 f08331f4 f0833124
el90xbc5!NICInterrupt+0x83
> (FPO: [EBP 0x804021f0] [1,1,4])
> f081ffe0 804645d4 fccdb1f4 fccdb1e0 00000000 NDIS!ndisMDpc+0xc8
(FPO:
> [Non-Fpo])
> f081fff4 804026fa f0833098 00000000 00000000 nt!KiRetireDpcList+0x30
(FPO:
> [0,0,0])
>
> I broke into the machine and this is the same stack everytime…
There are
> other worker threads in the system that are waiting to do io (they
are
> doing io to our filesystem), and these are all write behind threads
(the
> number of write behind thread varies between 2-3).
>
> I have done a bit of disassembly and come up with the trigger for
the
> deadlock (the following is the ExInterlockedPopEntrySList):
>
> nt!ExInterlockedPopEntrySList:
> 804006c4 53 push ebx
> 804006c5 55 push ebp
> 804006c6 8be9 mov ebp,ecx
> nt!ExpInterlockedPopEntrySListResume:
> 804006c8 8b5504 mov edx,[ebp+0x4]
> 804006cb 8b4500 mov eax,[ebp]
> 804006ce 0bc0 or eax,eax
> 804006d0 7410 jz nt!ExpInterlockedPopEntrySListFault+0x8
> (804006e2)
> 804006d2 8bca mov ecx,edx
> 804006d4 81c1ffff0000 add ecx,0xffff
> nt!ExpInterlockedPopEntrySListFault:
> 804006da 8b18 mov ebx,[eax]
> 804006dc 0fc74d00 cmpxchg8b qword ptr [ebp]
> 804006e0 75ec jnz nt!ExpInterlockedPopEntrySListResume+0x6
> (804006ce)
> 804006e2 5d pop ebp
> 804006e3 5b pop ebx
> 804006e4 c3 ret
>
> It is looping around 804006c8 -> 804006ce and faulting at 804006da.
As eax
> has a dud value in it:
> kd> r
> eax=05000000 ebx=fcb78ddc ecx=001cffff edx=001c0000 esi=fcb78da0
> edi=804006c4
> eip=804006da esp=f081fb94 ebp=fcb78de8 iopl=0 nv up ei pl nz
na po
> nc
> cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000
> efl=00000206
>
> Why eax was clobbered I am still to determine why, I believe eax is
the
> work item. The threads I am responsible for seem to be just fine, I
cannot
> fault them in that I don’t beleive they where responsible for
clobbering
> memory.
>
> Another issue is memory on 2 of the 3 pools is very tight, but the
standby
> list is just fine:
>
> kd> !memusage
> loading PFN database
> loading (99% complete)
> Zeroed: 0 ( 0 kb)
> Free: 1 ( 4 kb)
> Standby: 6964 ( 27856 kb)
> Modified: 584 ( 2336 kb)
> ModifiedNoWrite: 0 ( 0 kb)
> Active/Valid: 24880 ( 99520 kb)
> Transition: 0 ( 0 kb)
> Unknown: 0 ( 0 kb)
> TOTAL: 32429 (129716 kb)
> Building kernel map
> Finished building kernel map
>
> Another other “thing”, is this does not occur on a dual processor
system,
> only uni proc systems…
>
> It implies my locking heirachy in the write behind worker threads is
> busted
> but I have not been able to fault it (i.e. via code inspection and
> testing). This also does not occur if we run this same test directly
to
> the
> file system, so what is the SVR doing???
>
> Is this a known issue? As I have not been able to find it on the MS
msdn
> webpage, the osr ntdev and ntfsd lists, google (you name it, I’ve
been
> there).
>
> Thanks,
>
> Ian Costello
>
> —
> You are currently subscribed to ntfsd as: xxxxx@storagecraft.com
> To unsubscribe send a blank email to xxxxx@lists.osr.com

Max,

Thanks for the tip. I generally use !process 0 0 and then manually dump
all the processes that are off interest… This does help, but in this
case the thread that is deadlocked is spinning in a tight loop on a dodgy
value (which I believe is a WORK_ITEM retrieved from the work item queue)
in the ExpInterlockedPopEntrySListFault assembler code. I believe the
memory corruption occurred prior to the thread being used to service an
interrupt as it is in the context of a DPC and will not be context
switched
out until it has completed. The ExpInterlockedPopEntrySListFault code
looks like it has code in it specific to an MP type guard, that is to
fault when another process on another processor starts fiddling with a
value that is passed in to the ExpInterlockedPopEntrySListFault routine.

I am, however, using stock standard worker thread routines
(ExInitializeWorkItem(), ExQueueWorkItem(), & ExFreeWorkItem() for NT4 and
IoAllocateWorkItem(), IoQueueWorkItem() and IoFreeWorkItem() for W2K) to
actually do the io at the back-end of the file system… I did note that
the osr guys use worker threads do so but with their own implementation,
does that imply there is something “fishy” with the worker item routines?
(or they just want more control over the priorities etc of the worker
threads).

Maybe I have done something that has potentially caused corruption with
the
worker items, but I can’t determine what at this stage…

Try !process 0 7 and look at the second deadlocked thread.

PS am currently spinning up a version of “home grown” worker thread pool
and see if the problem goes, if it does then will have to go back and
revisit the original code, really want to understand the issue as the work
item routines are really easy to use???

Hope that makes sense :wink:

Thanks,

Ian Costello

Deadlock is not really the correct terminology to use in this case as the
thread has hung the system as it is spinning in a really tight loop.
Basically polling for the work items in the queue to drop to 0, but as a
the pointer to the work item is a dud continues in this fashion… Tried
using Driver Verify to check if we where corrupting memory and it doesn’t
seem that we are.

At a loss here…

Ian