Hi All,
I am wondering if anyone has come across a similar problem to the one I am
currently debugging. It involves a 3rd party filesystem (ours) that has an
exported SMB share, of a directory. We are doing a lot of small io to the
file system via the share, at an arbitrary time (while running a test
which creates 1500 files and writes the filename into each of the files)
the machine (SMB server) deadlocks.
The OS is Windows 2000 SP3…
The offending stack looks like:
ChildEBP RetAddr Args to Child
f081fa98 80468f0e 00000001 05000002 00000030
nt!RtlpBreakWithStatusInstruction (FPO: [1,0,0])
f081fa98 8046838b 00000001 05000002 00000030 nt!KeUpdateSystemTime+0x13e
(FPO: [0,2] TrapFrame @ f081faac)
f081fb20 804006da fcac24e8 f081fb48 00000002 nt!KiTrap0E+0x98 (FPO: [0,0]
TrapFrame @ f081fb20)
f081fb98 f7feb8ca fcb78da0 fcc33b30 00000000
nt!ExpInterlockedPopEntrySListFault (FPO: [0,2,0])
f081fbbc f7febaf3 fcb28d08 fcb318e8 fcc37608
srv!SrvFsdGetReceiveWorkItem+0x1b (FPO: [Non-Fpo])
f081fbd8 f86017a5 fcb28e50 fcc33b30 00000e20
srv!SrvFsdTdiReceiveHandler+0xdb (FPO: [Non-Fpo])
f081fc18 f8601592 fcc40030 00000000 00000e20 netbt!RcvHandlrNotOs+0xf4
(FPO: [Non-Fpo])
f081fc5c f8629ae2 fcc40030 fcc37608 00000e20 netbt!TdiReceiveHandler+0x425
(FPO: [Non-Fpo])
f081fca8 f86292a3 fcc37608 00001850 f081fdc4 tcpip!IndicateData+0x1f2
(FPO: [Non-Fpo])
f081fd10 f8626436 00000000 b83ca8c0 863ca8c0 tcpip!TCPRcv+0xad1 (FPO:
[Non-Fpo])
f081fd5c f8626b27 f8628ef2 fcc57c28 ffad402e tcpip!DeliverToUser+0xf9
(FPO: [Non-Fpo])
f081fe14 f862730b fcc57c28 ffad4042 00000060 tcpip!IPRcvPacket+0x58e (FPO:
[Non-Fpo])
f081fe54 f862737d 00000000 fccd7e48 ffad4020
tcpip!ARPRcvIndicationNew+0x172 (FPO: [Non-Fpo])
f081fe90 fc84b403 fcc57268 00000000 fccdb008 tcpip!ARPRcvPacket+0x5c (FPO:
[Non-Fpo])
f081fee8 f0465fbe fcd56100 f081ff48 00000001
NDIS!ethFilterDprIndicateReceivePacket+0x2ea (FPO: [Non-Fpo])
f081ffa8 f046256d 00cdb008 804021f0 fcd56130
el90xbc5!UpCompleteNdis40PlusEvent+0x25e (FPO: [Non-Fpo])
f081ffc4 fc8353d9 fccdb008 f08331f4 f0833124 el90xbc5!NICInterrupt+0x83
(FPO: [EBP 0x804021f0] [1,1,4])
f081ffe0 804645d4 fccdb1f4 fccdb1e0 00000000 NDIS!ndisMDpc+0xc8 (FPO:
[Non-Fpo])
f081fff4 804026fa f0833098 00000000 00000000 nt!KiRetireDpcList+0x30 (FPO:
[0,0,0])
I broke into the machine and this is the same stack everytime… There are
other worker threads in the system that are waiting to do io (they are
doing io to our filesystem), and these are all write behind threads (the
number of write behind thread varies between 2-3).
I have done a bit of disassembly and come up with the trigger for the
deadlock (the following is the ExInterlockedPopEntrySList):
nt!ExInterlockedPopEntrySList:
804006c4 53 push ebx
804006c5 55 push ebp
804006c6 8be9 mov ebp,ecx
nt!ExpInterlockedPopEntrySListResume:
804006c8 8b5504 mov edx,[ebp+0x4]
804006cb 8b4500 mov eax,[ebp]
804006ce 0bc0 or eax,eax
804006d0 7410 jz nt!ExpInterlockedPopEntrySListFault+0x8
(804006e2)
804006d2 8bca mov ecx,edx
804006d4 81c1ffff0000 add ecx,0xffff
nt!ExpInterlockedPopEntrySListFault:
804006da 8b18 mov ebx,[eax]
804006dc 0fc74d00 cmpxchg8b qword ptr [ebp]
804006e0 75ec jnz nt!ExpInterlockedPopEntrySListResume+0x6
(804006ce)
804006e2 5d pop ebp
804006e3 5b pop ebx
804006e4 c3 ret
It is looping around 804006c8 -> 804006ce and faulting at 804006da. As eax
has a dud value in it:
kd> r
eax=05000000 ebx=fcb78ddc ecx=001cffff edx=001c0000 esi=fcb78da0
edi=804006c4
eip=804006da esp=f081fb94 ebp=fcb78de8 iopl=0 nv up ei pl nz na po
nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000
efl=00000206
Why eax was clobbered I am still to determine why, I believe eax is the
work item. The threads I am responsible for seem to be just fine, I cannot
fault them in that I don’t beleive they where responsible for clobbering
memory.
Another issue is memory on 2 of the 3 pools is very tight, but the standby
list is just fine:
kd> !memusage
loading PFN database
loading (99% complete)
Zeroed: 0 ( 0 kb)
Free: 1 ( 4 kb)
Standby: 6964 ( 27856 kb)
Modified: 584 ( 2336 kb)
ModifiedNoWrite: 0 ( 0 kb)
Active/Valid: 24880 ( 99520 kb)
Transition: 0 ( 0 kb)
Unknown: 0 ( 0 kb)
TOTAL: 32429 (129716 kb)
Building kernel map
Finished building kernel map
Another other “thing”, is this does not occur on a dual processor system,
only uni proc systems…
It implies my locking heirachy in the write behind worker threads is
busted
but I have not been able to fault it (i.e. via code inspection and
testing). This also does not occur if we run this same test directly to
the
file system, so what is the SVR doing???
Is this a known issue? As I have not been able to find it on the MS msdn
webpage, the osr ntdev and ntfsd lists, google (you name it, I’ve been
there).
Thanks,
Ian Costello