Debugging a hang in ntfs.sys

Folks,

I’m currently investigating a (multi) process hang on Win 8.1. Long story
short - I manage to get powershell hanging on a call to the Win32
implementation of MoveFile (moving is triggered by a script PS is
processing). Looking at the file it’s trying to move, once the hang is
reproduced, the file becomes ‘locked’ in the sense - if I open notepad.exe
with that file, it starts hanging as well.

So I’ve enabled local kernel debugging and figured out that the hanging
thread is dependent on that IRP:

lkd> !irp ffffe00086b996d0
Irp is active with 10 stacks 9 is current (= 0xffffe00086b999e0)
No Mdl: No System Buffer: Thread ffffe00087890880: Irp stack trace.
cmd flg cl Device File Completion-Context
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000

[IRP_MJ_CREATE(0), N/A(0)]
0 e1 ffffe00083a56030 ffffe00087e2bf20
fffff8006e6e2480-ffffe00087d5
4b60 Success Error Cancel pending
\FileSystem\Ntfs
Args: ffffd000cfbb87c0 01000060 00030080 00000000
[IRP_MJ_CREATE(0), N/A(0)]
0 0 ffffe00084e548d0 ffffe00087e2bf20 00000000-00000000
\FileSystem\FltMgr
Args: ffffd000cfbb87c0 01000060 00030080 00000000

The queue for the ntfs driver seems empty:

lkd> !devobj ffffe00083a56030
Device object (ffffe00083a56030) is for:
\FileSystem\Ntfs DriverObject ffffe00084795490
Current Irp 00000000 RefCount 0 Type 00000008 Flags 00060000
Dacl ffffc10121d53100 DevExt ffffe00083a56180 DevObjExt ffffe00083a58280
ExtensionFlags (0x00000800) DOE_DEFAULT_SD_PRESENT
Characteristics (0000000000)
AttachedDevice (Upper) ffffe00084e548d0 \FileSystem\FltMgr
Device queue is not busy.

Would anyone have any clues as to how to proceed this investigation?

It’s worth noting that other than the affected file (that is a subject of
this hang), the OS is working properly and nothing else is hanging (i.e.
ntfs is not generally stalled)

Please advise! Your help will be much appreciated as kernel mode debugging
is something relatively new to me :slight_smile:

Thanks,
Alex

Well, the first step is to figure out which threads are blocked. Since this is a file system, I?d start with ?!locks? and look for locks where there are threads waiting for a given lock. Then figure out why the thread which owns that lock is blocked.

Tony
OSR

Well, I’ve done that already - that’s how I arrived at the above IRPs.

In a nutshell - there’s a lock owned by the hung powershell process, that’s blocking 9 other threads (most of them in searchhost.exe). The powershell thread itself is blocked because of the reason I explained above - i.e. it’s waiting on some file i/o to complete.

Now I’m trying to figure out why ntfs isn’t completing the i/o and here’s where I got stuck

Can you show the stack traces? That would be generally helpful.

You might also want to check APC state (“!apc”) as that can cause problems.

Normally, a create hang is due to a lock, not an I/O, so this is definitely unusual.

Tony
OSR

Sure.

This is the lock

Resource @ 0xffffe00083dde4c8 Exclusively owned
Contention Count = 21
NumberOfExclusiveWaiters = 21
Threads: ffffe00086c64740-01<*>
Threads Waiting On Exclusive Access:
ffffe000887c1880 ffffe000891d0080 ffffe000873a6080
ffffe000844b5880
ffffe000860ce080 ffffe0008601f040 ffffe0008724d800
ffffe000874b8080
ffffe00086c2a680 ffffe00087c9d080 ffffe000870b0080
ffffe00087283080
ffffe000844af080 ffffe00083ab5880 ffffe00086615880
ffffe000866e7880
ffffe00086608440 ffffe0008432e080 ffffe000842b75c0
ffffe000889fa880
ffffe00087560880

lkd> !thread ffffe00086c64740
THREAD ffffe00086c64740 Cid 03ec.034c Teb: 000000007eed4000 Win32Thread: fffff
901406a0b50 WAIT: (Executive) KernelMode Alertable
ffffd000d0115618 NotificationEvent
IRP List:
ffffe00087667010: (0006,03a0) Flags: 00060830 Mdl: 00000000
ffffe000865ba4b0: (0006,03a0) Flags: 00060000 Mdl: 00000000
Not impersonating
DeviceMap ffffc00025b28c50
Owning Process ffffe00086c78080 Image: powershell.exe
Attached Process N/A Image: N/A
Wait Start TickCount 8522158 Ticks: 2455146 (0:10:39:21.656)
Context Switch Count 34408 IdealProcessor: 1
UserTime 00:00:06.375
KernelTime 00:00:02.203
Win32 Start Address 0x000000007441a8c0
Stack Init ffffd000d0115dd0 Current ffffd000d0115140
Base ffffd000d0116000 Limit ffffd000d0110000 Call 0
Priority 15 BasePriority 8 UnusualBoost 4 ForegroundBoost 2 IoPriority 2 PagePri
ority 5
Child-SP RetAddr : Args to Child
: Call Site
ffffd000d0115180 fffff802600c334e : fffff80260317180 ffffe00086c64740 fffff8
0200000009 0000000000000000 : nt!KiSwapContext+0x76
ffffd000d01152c0 fffff802600c2dc9 : 0000000000000001 fffff802600753c8 000000
0000000000 0000000000000001 : nt!KiSwapThread+0x14e
ffffd000d0115360 fffff80260061543 : ffffd000d01153f8 0000000000000000 ffffff
f600000002 00000001ffffffff : nt!KiCommitThreadWait+0x129
ffffd000d01153e0 fffff80260483f87 : ffffd000d0115618 ffffc00000000000 000000
0000000000 0000000000000000 : nt!KeWaitForSingleObject+0x373
ffffd000d0115470 fffff80260481e8b : ffffe000878e2160 ffffd000d01158e0 ffffd0
00d0115618 ffffe00087667010 : nt!FsRtlCancellableWaitForMultipleObjects+0xcf
ffffd000d01154e0 fffff802601de96c : ffffd000d0115618 ffffd000d01158e0 ffffd0
00d0115618 0000000000000000 : nt!FsRtlCancellableWaitForSingleObject+0x27
ffffd000d0115520 fffff802600f5244 : ffffc00031f2cb30 0000000000000000 000000
0000002000 ffffc00031f2cb30 : nt!FsRtlpWaitOnIrp+0x1c4
ffffd000d01155a0 fffff802600f56ec : ffffd000d01158e0 ffffe00087667320 ffffe0
0087667010 ffffc00031f2cb00 : nt!FsRtlpOplockBreakByCacheFlags+0xcc4
ffffd000d0115670 fffff8006e9a45b7 : ffffe00000000000 ffff80004d61d4c1 ffffd0
0000000000 ffffe000876673b0 : nt!FsRtlCheckOplockEx+0x354
ffffd000d0115770 ffffe00000000000 : ffff80004d61d4c1 ffffd00000000000 ffffe0
00876673b0 0000000000000000 : 0xfffff8006e9a45b7 ffffd000d0115778 ffff80004d61d4c1 : ffffd00000000000 ffffe000876673b0 000000 0000000000 0000000000000000 : 0xffffe00000000000
ffffd000d0115780 ffffd00000000000 : ffffe000876673b0 0000000000000000 000000
0000000000 0000000000000001 : 0xffff80004d61d4c1 ffffd000d0115788 ffffe000876673b0 : 0000000000000000 0000000000000000 000000 0000000001 fffff8026005f62a : 0xffffd00000000000
ffffd000d0115790 0000000000000000 : 0000000000000000 0000000000000001 fffff8
026005f62a ffffe00085060100 : 0xffffe000`876673b0

This is the first irp

lkd> !irp ffffe00087667010
Irp is active with 10 stacks 9 is current (= 0xffffe00087667320)
No Mdl: System buffer=ffffe00086854b00: Thread ffffe00086c64740: Irp stack tra
ce.
cmd flg cl Device File Completion-Context
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000

[IRP_MJ_SET_INFORMATION(6), N/A(0)]
0 e0 ffffe00083a56030 ffffe00086364220 fffff8006e6e2480-ffffe00083f0
18b0 Success Error Cancel
\FileSystem\Ntfs
Args: 000001a2 0000000a ffffe000870954e0 00000000
[IRP_MJ_SET_INFORMATION(6), N/A(0)]
0 0 ffffe00084e548d0 ffffe00086364220 00000000-00000000
\FileSystem\FltMgr
Args: 000001a2 0000000a ffffe000870954e0 00000000

and the second

lkd> !irp ffffe000865ba4b0
Irp is active with 10 stacks 10 is current (= 0xffffe000865ba808)
No Mdl: No System Buffer: Thread ffffe00086c64740: Irp stack trace.
cmd flg cl Device File Completion-Context
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000
[N/A(0), N/A(0)]
0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000

[IRP_MJ_DIRECTORY_CONTROL(c), N/A(2)]
1 1 ffffe00083a56030 ffffe00086e9c210 00000000-00000000 pending
\FileSystem\Ntfs
Args: 00000020 0000001b 00000000 00000000

As for the APC state - I’ll need to figure out what that is (as I said, I’m new to kernel debugging) and then investigate the results myself. Here’s what !apc returns (in case it rings a bell)

lkd> !apc
*** Enumerating APCs in all processes
Process ffffe00083a29400 System
Process ffffe000850ad400 smss.exe
Process ffffe00085449740 csrss.exe
Process ffffe00083b0e440 csrss.exe
Process ffffe00083ab0080 wininit.exe
Process ffffe00085672080 winlogon.exe
Process ffffe00085651080 services.exe
Process ffffe000857748c0 lsass.exe
Process ffffe000853c78c0 svchost.exe
Process ffffe000853d68c0 svchost.exe
Process ffffe00085b4c580 LogonUI.exe
Process ffffe000853c58c0 dwm.exe
Process ffffe00085b624c0 svchost.exe
Process ffffe00085b5b8c0 svchost.exe
Process ffffe00085bd78c0 svchost.exe
Process ffffe000860448c0 svchost.exe
Process ffffe0008608b800 svchost.exe
Process ffffe0008605d8c0 spoolsv.exe
Process ffffe000861d3640 svchost.exe
Process ffffe000860118c0 armsvc.exe
Process ffffe000861c3300 svchost.exe
Process ffffe000862b7240 vmtoolsd.exe
Process ffffe00086383640 vmware-usbarbi
Process ffffe00086512640 svchost.exe
Process ffffe00086a58280 msdtc.exe
Process ffffe0008554c8c0 csrss.exe
Process ffffe0008633e080 winlogon.exe
Process ffffe00086b478c0 dwm.exe
Process ffffe000865f43c0 taskhostex.exe
Process ffffe000865a8400 rdpclip.exe
Process ffffe0008444b8c0 explorer.exe
Process ffffe000844458c0 svchost.exe
Process ffffe000871aa640 SearchIndexer.
Process ffffe00086292080 vmtoolsd.exe
Process ffffe0008445b8c0 cmd.exe
Process ffffe00083e878c0 conhost.exe
Process ffffe00086464240 wmpnetwk.exe
Process ffffe0008633d080 procexp.exe
Process ffffe00083ebf8c0 procexp64.exe
Thread ffffe00083ab5880 ApcStateIndex 0 ApcListHead ffffe00083ab5918 [KERNEL
]
KAPC @ ffffe00083ab5b08
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe00087219080 mmc.exe
Process ffffe0008502e8c0 VMware.PowerCL
Process ffffe0008549d8c0 powershell.exe
Process ffffe000868aa7c0 conhost.exe
Process ffffe000851a58c0 SearchProtocol
Thread ffffe00086ad5080 ApcStateIndex 0 ApcListHead ffffe00086ad5118 [KERNEL
]
KAPC @ ffffe00086ad5308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe000865638c0 powershell.exe
Thread ffffe000850f1080 ApcStateIndex 0 ApcListHead ffffe000850f1118 [KERNEL
]
KAPC @ ffffe000850f1308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
KAPC @ ffffe000870e7170
Type 12
KernelRoutine fffff8026046e4b8 nt!CmpPostApc+0
RundownRoutine fffff802605d2a04 nt!CmpPostApcRunDown+0
Process ffffe000880b80c0 conhost.exe
Process ffffe000844b18c0 SearchProtocol
Thread ffffe00084475080 ApcStateIndex 0 ApcListHead ffffe00084475118 [KERNEL
]
KAPC @ ffffe00084475308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe00083bea8c0 SearchProtocol
Thread ffffe0008429a880 ApcStateIndex 0 ApcListHead ffffe0008429a918 [KERNEL
]
KAPC @ ffffe0008429ab08
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe00086c78080 powershell.exe
Thread ffffe00086c64740 ApcStateIndex 0 ApcListHead ffffe00086c647d8 [KERNEL
]
KAPC @ ffffe00086c649c8
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
KAPC @ ffffe000881c1a50
Type 12
KernelRoutine fffff8026046e4b8 nt!CmpPostApc+0
RundownRoutine fffff802605d2a04 nt!CmpPostApcRunDown+0
Process ffffe0008506f8c0 conhost.exe
Process ffffe00086bad8c0 SearchProtocol
Thread ffffe000863fa880 ApcStateIndex 0 ApcListHead ffffe000863fa918 [KERNEL
]
KAPC @ ffffe000863fab08
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe00086e6e8c0 SearchProtocol
Thread ffffe00086859080 ApcStateIndex 0 ApcListHead ffffe00086859118 [KERNEL
]
KAPC @ ffffe00086859308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe000844b50c0 SearchProtocol
Thread ffffe00086c2a680 ApcStateIndex 0 ApcListHead ffffe00086c2a718 [KERNEL
]
KAPC @ ffffe00086c2a908
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe0008551d8c0 SearchProtocol
Thread ffffe000870b0080 ApcStateIndex 0 ApcListHead ffffe000870b0118 [KERNEL
]
KAPC @ ffffe000870b0308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe00086e640c0 SearchProtocol
Thread ffffe000844b5880 ApcStateIndex 0 ApcListHead ffffe000844b5918 [KERNEL
]
KAPC @ ffffe000844b5b08
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe000877d38c0 SearchProtocol
Thread ffffe000860ce080 ApcStateIndex 0 ApcListHead ffffe000860ce118 [KERNEL
]
KAPC @ ffffe000860ce308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe000877d90c0 SearchProtocol
Thread ffffe000842ab080 ApcStateIndex 0 ApcListHead ffffe000842ab118 [KERNEL
]
KAPC @ ffffe000842ab308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe00087888200 SearchProtocol
Thread ffffe0008724d800 ApcStateIndex 0 ApcListHead ffffe0008724d898 [KERNEL
]
KAPC @ ffffe0008724da88
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe00086e708c0 SearchProtocol
Thread ffffe00086838800 ApcStateIndex 0 ApcListHead ffffe00086838898 [KERNEL
]
KAPC @ ffffe00086838a88
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe00087fa60c0 SearchProtocol
Thread ffffe000844af080 ApcStateIndex 0 ApcListHead ffffe000844af118 [KERNEL
]
KAPC @ ffffe000844af308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe0008869d080 kd.exe
Process ffffe00083f568c0 windbg.exe
Process ffffe000848152c0 calc.exe
Process ffffe00086998200 procexp.exe
Process ffffe00086a0b380 procexp64.exe
Process ffffe00084a381c0 SearchProtocol
Thread ffffe000842b6080 ApcStateIndex 0 ApcListHead ffffe000842b6118 [KERNEL
]
KAPC @ ffffe000842b6308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe0008813e8c0 SearchProtocol
Thread ffffe00087c9d080 ApcStateIndex 0 ApcListHead ffffe00087c9d118 [KERNEL
]
KAPC @ ffffe00087c9d308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe000879e2700 cmd.exe
Process ffffe000870a5200 conhost.exe
Process ffffe000877926c0 notepad.exe
Process ffffe000888aa680 notepad.exe
Process ffffe000889d98c0 svchost.exe
Thread ffffe000889fa880 ApcStateIndex 0 ApcListHead ffffe000889fa918 [KERNEL
]
KAPC @ ffffe000889fab08
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe000843cb440 svchost.exe
Thread ffffe00087560880 ApcStateIndex 0 ApcListHead ffffe00087560918 [KERNEL
]
KAPC @ ffffe00087560b08
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe0008899e8c0 svchost.exe
Thread ffffe00087283080 ApcStateIndex 0 ApcListHead ffffe00087283118 [KERNEL
]
KAPC @ ffffe00087283308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe0008663f340 SearchProtocol
Thread ffffe00084447080 ApcStateIndex 0 ApcListHead ffffe00084447118 [KERNEL
]
KAPC @ ffffe00084447308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe000843b88c0 SearchProtocol
Thread ffffe000842b75c0 ApcStateIndex 0 ApcListHead ffffe000842b7658 [KERNEL
]
KAPC @ ffffe000842b7848
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe000889d78c0 svchost.exe
Thread ffffe000874b8080 ApcStateIndex 0 ApcListHead ffffe000874b8118 [KERNEL
]
KAPC @ ffffe000874b8308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe00088f658c0 svchost.exe
Thread ffffe000873a6080 ApcStateIndex 0 ApcListHead ffffe000873a6118 [KERNEL
]
KAPC @ ffffe000873a6308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe00086643080 svchost.exe
Thread ffffe000891d0080 ApcStateIndex 0 ApcListHead ffffe000891d0118 [KERNEL
]
KAPC @ ffffe000891d0308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe000888d78c0 svchost.exe
Thread ffffe0008432e080 ApcStateIndex 0 ApcListHead ffffe0008432e118 [KERNEL
]
KAPC @ ffffe0008432e308
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe0008941f8c0 svchost.exe
Thread ffffe00086608440 ApcStateIndex 0 ApcListHead ffffe000866084d8 [KERNEL
]
KAPC @ ffffe000866086c8
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe0008940f8c0 svchost.exe
Thread ffffe000866e7880 ApcStateIndex 0 ApcListHead ffffe000866e7918 [KERNEL
]
KAPC @ ffffe000866e7b08
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe0008878b8c0 svchost.exe
Thread ffffe000887c1880 ApcStateIndex 0 ApcListHead ffffe000887c1918 [KERNEL
]
KAPC @ ffffe000887c1b08
Type 12
KernelRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
RundownRoutine fffff8026013e3ac nt!EmpCheckErrataList+0
Process ffffe00088ecb400 svchost.exe
Process ffffe000894978c0 TSTheme.exe
Process ffffe00088932700 SearchProtocol
Process ffffe000862567c0 SearchFilterHo

That was very helpful.

0000000000 ffffe000876673b0 : nt!FsRtlCheckOplockEx+0x354

The thread that is “waiting on I/O” is waiting for an oplock break. A different thread owns the oplock and isn’t dealing with the break request. The APC list is quite busy, but it’s full of fluff (I’ve never seen this command generate so much output before… interesting).

Two of the threads have interesting looking APCs on them (EmpCheckErrataList doesn’t sound very interesting to me).

ffffe00086c64740 ffffe000850f1080

both have a registry APC pending:

KernelRoutine fffff8026046e4b8 nt!CmpPostApc+0

Not a smoking gun, but definitely interesting.

FsRtlCheckOplockEx is documented here: https://msdn.microsoft.com/en-us/library/windows/hardware/ff545771(v=vs.85).aspx

I’d suspect that this might be related to that kernel APC that’s blocked (on two threads) as oplock callbacks are implemented as kernel APCs.

Check the wait IRQL on those two: I’d expect they are both waiting at APC_LEVEL.

Are you running any non-Microsoft kernel drivers on this machine? In particular, I’d be suspicious of file system filter drivers (“fltmc” from an elevated command prompt will list those on the box).

One possibility is that this thread owns the oplock and because it has block APCs (standard practice upon entering a file system driver), it’s self-deadlocking.

Anything special about the file you are moving?

Tony
OSR

Interesting situation.

I bet the thread that is supposed to handle the oplock break is either not ready to receive APCs ( IRQL >= APC LEVEL  or is in a critical region or red guarded region ) or it is locked itself waiting on some other lock dependent on the oplock break . So first you might want to track down the IRQL on that thread then you could see if this is the case. This means that  some filter has a bug in which it does not restore the irql back or exit the region and from here on there is not much you can do.

In order to prove this, use filespy on the power shell script and see exactly what is it doing to that file and the write a small win32 project to repro it every time with any file. Then try unloading any non Msft filters ( use fltmc command in CMD )  and then see without which one you problem is solved.


Gabriel Bercea

Windows Driver Consulting

www.kasardia.com

On Thu, Mar 24, 2016 at 3:12 PM -0700, wrote:

That was very helpful.

0000000000 ffffe000876673b0 : nt!FsRtlCheckOplockEx+0x354

The thread that is “waiting on I/O” is waiting for an oplock break. A different thread owns the oplock and isn’t dealing with the break request. The APC list is quite busy, but it’s full of fluff (I’ve never seen this command generate so much output before… interesting).

Two of the threads have interesting looking APCs on them (EmpCheckErrataList doesn’t sound very interesting to me).

ffffe00086c64740 ffffe000850f1080

both have a registry APC pending:

KernelRoutine fffff8026046e4b8 nt!CmpPostApc+0

Not a smoking gun, but definitely interesting.

FsRtlCheckOplockEx is documented here: https://msdn.microsoft.com/en-us/library/windows/hardware/ff545771(v=vs.85).aspx

I’d suspect that this might be related to that kernel APC that’s blocked (on two threads) as oplock callbacks are implemented as kernel APCs.

Check the wait IRQL on those two: I’d expect they are both waiting at APC_LEVEL.

Are you running any non-Microsoft kernel drivers on this machine? In particular, I’d be suspicious of file system filter drivers (“fltmc” from an elevated command prompt will list those on the box).

One possibility is that this thread owns the oplock and because it has block APCs (standard practice upon entering a file system driver), it’s self-deadlocking.

Anything special about the file you are moving?

Tony
OSR


WINDBG is sponsored by OSR

OSR is hiring!! Info at http://www.osr.com/careers

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at

To unsubscribe, visit the List Server section of OSR Online at

Thanks for the posts, guys!

I did some further reading and tried to educate myself in kernel io, apcs, oplocks, etc. The two threads that have not processed the kernel APCs are the same two threads, belonging to the 2 powershell processes that are hung.

There’s nothing special about these scripts. They’re simply moving log files to a different folder, by calling some powershell primitives, which are implemented by .net’s FileInfo.MoveTo()…

Also, the IRQL is 0 (at least that’s what !irql returns), so it didn’t seem to be at apc or dispatch level

To summarize my understanding:

  1. Both powershell threads (belonging to 2 separate processe) are trying to move some files
  2. Somewhere in their call stacks, they manage to obtain some locks which are causing a bunch of other processes to get blocked waiting on these (hence the result of !locks)
  3. The PS threads post an IRP for the ntfs to create an IRP and initiate the IO, and then get blocked waiting on an oplock break to happen
  4. Same two PS threads being blocked on the oplock break, cannot process APCs either (hence the result of !apc)

So I’m wondering now - how can I find out who’s owning these oplocks that cause this? I tried this:

lkd> !fltkd.oplock
oplock [addr] [flags] Dump oplock given address of owning
FILE_OBJECT or stuck IRP (tells you what oplock it is waiting for)
where [flags] specify what information to dump:
0x00000001 [addr] points to an OPLOCK (default is FILE_OBJECT or IRP)

But… I’m not sure how to read the result or dig further (I guess 0x1 is the index in some internal oplock structure or smth).

Best hypothesis I have so far for what’s happening:

  1. searchindexer while running, decided to index the log files and obtained oplocks on them
  2. powershell processes managed to take a lock on smth that the searchindexer needed and blocked it
  3. when trying to move a file, powershell ended up waiting on the oplock break that couldn’t happen as the indexer’s threads are now blocked by powershell… hence the deadlock

The question is - how do I prove this? I need to figure out who owns these oplocks…

p.s. Here’s the minifilter list you asked for, though I’m not sure how it helps:

Filter Name Num Instances Altitude Frame


vsepflt 4 328200 0
luafv 1 135000 0
npsvctrig 1 46000 0
FileInfo 4 45000 0
Wof 0 40700 0

Hi again Alexander,
If you want to see who’s doing what you need to catch the activity on that
file. You can do this with filespy. You can get it from:
http://zezula.net/en/fstools/filespy.html
What I would to is setup the path filter to include the file you think is
causing the deadlock and maybe select only the system process, searchindexer
and the powershell process in the process filter. Select to only see
CREATEs, SETs and CLEANUP
operations on the file and run your scenario. You should see from there who
is doing what.

On a second notice, how the fltkd.oplock works is pretty straight forward
and you can use a FileObject pointer or a IRP pointer to see the state of an
oplock.
I have here a command output using the pending IRP that requested and was
granted the oplock:

3: kd> !fltkd.oplock fffff980145b0ca0 <- THIS ADDR IS AN IRP

Note: Display may not make sense if this IRP is not waiting on an oplock!

Blocking Oplock: fffff8a00d565540 RWH Granted
State Flags : [00007040] Exclusive CacheReadLevel
CacheHandleLevel CacheWriteLevel
Excl. Oplock Request Irp : fffff980145b0ca0
Excl. Oplock File Object : fffffa8034dab7e0
Excl. Oplock Owning Proc.: fffffa8031ce0940

After I broke the IRP from another process but not acknowledged it yet, and
I have used the FileObject to see the status now.

6: kd> !fltkd.oplock fffffa8034dab7e0 <— THIS ADDR IS A FO

Oplock: fffff8a00d565540 RWH Breaking to RH
State Flags : [00507040] Exclusive CacheReadLevel
CacheHandleLevel CacheWriteLevel BreakToCacheRead BreakToCacheHandle
Excl. Oplock Request Irp : 0000000000000000
Excl. Oplock File Object : fffffa8034dab7e0
Excl. Oplock Owning Proc.: fffffa8031ce0940

You can see here that the oplock is broken from RWH to RH and is awaiting
acknowledgement.
After you acknowledge you will see something like this as command output:

4: kd> !fltkd.oplock fffffa8033ec6720

Oplock: fffff8a00d565540 No Oplocks Held
State Flags : [00000001] NoOplock
Excl. Oplock Request Irp :
Excl. Oplock File Object :
Excl. Oplock Owning Proc.:

Anyway, I believe it would also be useful if you could show us the
powershell script maybe I could get a repro myself with this.

Regards,
Gabriel
www.kasardia.com

-----Original Message-----
From: xxxxx@gmail.com
Sent: Sunday, 27 March, 2016 11:49
To: Kernel Debugging Interest List
Subject: RE:[windbg] Debugging a hang in ntfs.sys

Thanks for the posts, guys!

I did some further reading and tried to educate myself in kernel io, apcs,
oplocks, etc. The two threads that have not processed the kernel APCs are
the same two threads, belonging to the 2 powershell processes that are hung.

There’s nothing special about these scripts. They’re simply moving log files
to a different folder, by calling some powershell primitives, which are
implemented by .net’s FileInfo.MoveTo()…

Also, the IRQL is 0 (at least that’s what !irql returns), so it didn’t seem
to be at apc or dispatch level

To summarize my understanding:
1. Both powershell threads (belonging to 2 separate processe) are trying to
move some files
2. Somewhere in their call stacks, they manage to obtain some locks which
are causing a bunch of other processes to get blocked waiting on these
(hence the result of !locks)
3. The PS threads post an IRP for the ntfs to create an IRP and initiate the
IO, and then get blocked waiting on an oplock break to happen
4. Same two PS threads being blocked on the oplock break, cannot process
APCs either (hence the result of !apc)

So I’m wondering now - how can I find out who’s owning these oplocks that
cause this? I tried this:

lkd> !fltkd.oplock
oplock [addr] [flags] Dump oplock given address of
owning
FILE_OBJECT or stuck IRP (tells you what oplock it is waiting for)
where [flags] specify what information to dump:
0x00000001 [addr] points to an OPLOCK (default is FILE_OBJECT or
IRP)

But… I’m not sure how to read the result or dig further (I guess 0x1 is the
index in some internal oplock structure or smth).

Best hypothesis I have so far for what’s happening:
1. searchindexer while running, decided to index the log files and obtained
oplocks on them
2. powershell processes managed to take a lock on smth that the
searchindexer needed and blocked it
3. when trying to move a file, powershell ended up waiting on the oplock
break that couldn’t happen as the indexer’s threads are now blocked by
powershell… hence the deadlock

The question is - how do I prove this? I need to figure out who owns these
oplocks…

p.s. Here’s the minifilter list you asked for, though I’m not sure how it
helps:

Filter Name Num Instances Altitude Frame
------------------------------ ------------- ------------ -----
vsepflt 4 328200 0
luafv 1 135000 0
npsvctrig 1 46000 0
FileInfo 4 45000 0
Wof 0 40700 0


WINDBG is sponsored by OSR

OSR is hiring!! Info at http://www.osr.com/careers

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:</http:></http:>

Well, that’s the thing - the repro is quite hard, that’s why tools like filespy are not very applicable.

The PS script is part of a test automation fwk which runs periodically on various VMs (of different OSes) and for some reason, on this win 8.1 it freezes. It hangs on random ps1 files being executed and hangs in some base logic that moves the log file from one location to another on the same drive (the destination is being created by the same script).

As pointed out - the whole thing is hard to reproduce as it freezes at random places (read - random files) and at a random pid (a new powershell instance is created for each script being executed).

I currently have a repro, kd and windbg all setup and got my investigation to these IRPs in the beginning of the post :slight_smile:

Thanks for the help on the oplocks, I’ll see if I can dig out some further info out of the IRPs on the hung thread. Lots of fun with this so far :smiley:

Here’s the oplock details for the offending IRP (ffffe00087667010)

!fltkd.oplock ffffe00087667010

Blocking Oplock: ffffc00031f2cb30 RH Granted/Atomic Requested
State Flags : [00013000] CacheReadLevel CacheHandleLevel AtomicO
plockRequest
Excl. Oplock Request Irp :
Excl. Oplock File Object :
Excl. Oplock Owning Proc.:

R/Level 2 Oplocks : (ffffc00031f2cb58) Count=0

RH Oplocks : (ffffc00031f2cb68) Count=9
Irp Address File Object Owning Process
---------------- ---------------- ----------------
(atomic request) ffffe00086d4e3d0 ffffe000895233c0 (SearchProtocolHost.exe)
(atomic request) ffffe0008683e630 ffffe00088e3a340 (SearchProtocolHost.exe)
(atomic request) ffffe000844f4d10 ffffe000843b88c0 (SearchProtocolHost.exe)
(atomic request) ffffe000849aecf0 ffffe0008813e8c0 (SearchProtocolHost.exe)
(atomic request) ffffe00087e7df20 ffffe00087fa60c0 (SearchProtocolHost.exe)
(atomic request) ffffe00086364070 ffffe00087888200 (SearchProtocolHost.exe)
(atomic request) ffffe000848df1f0 ffffe000877d38c0 (SearchProtocolHost.exe)
(atomic request) ffffe00087082c90 ffffe00086e640c0 (SearchProtocolHost.exe)
(atomic request) ffffe00084f0c7f0 ffffe0008551d8c0 (SearchProtocolHost.exe)

Breaking RH Oplocks : (ffffc00031f2cb78) Count=1
File Object Owning Process Breaking To
---------------- ---------------- -----------
ffffe00086b46940 ffffe000844b50c0 R (again SearchProtocolHost.exe)

I/O Requests Awaiting Oplock Break Acknowledgement: (ffffc00031f2cb88) Count
=2
Irp Address Completion Rtn. Completion Ctx. BreakAllRH?
---------------- ---------------- ---------------- -----------
ffffe00087667010 fffff802605e0f74 ffffd000d0115618 No or N/A
ffffe00086b996d0 fffff8006ea845ac ffffe00086b72de8 No or N/A

What I get from this is that PS thread (ffffe00086c64740) is hung waiting on an oplock break, that’s initiated by SearchProtocolHost.exe (ffffe000844b50c0).

The SearchProtocolHost.exe is itself hung:
lkd> !process ffffe000844b50c0 2
PROCESS ffffe000844b50c0
SessionId: 0 Cid: 0d38 Peb: 7ff7ecdef000 ParentCid: 0350
DirBase: 8564f000 ObjectTable: ffffc00025923b00 HandleCount: essible>
Image: SearchProtocolHost.exe

THREAD ffffe00086c2a680 Cid 0d38.0f5c Teb: 00007ff7ecde9000 Win32Threa
d: fffff90140737010 WAIT: (WrResource) KernelMode Non-Alertable
ffffe00084d77370 SynchronizationEvent

The thread ffffe00086c2a680 is hung waiting on a resource owned by PowerShell (visible in the following result from !locks):

Resource @ 0xffffe00083dde4c8 Exclusively owned
Contention Count = 25
NumberOfExclusiveWaiters = 25
Threads: ffffe00086c64740-01<*>
Threads Waiting On Exclusive Access:
ffffe00083ab5880 ffffe0008432e080 ffffe00087560880
ffffe000842b75c0
ffffe000889fa880 ffffe00084a8c540 ffffe00089233880
ffffe000891d0080
ffffe000873a6080 ffffe00087283080 ffffe0008975b080
ffffe000887c1880
ffffe000874b8080 ffffe00086615880 ffffe00086608440
ffffe00087c9d080
ffffe0008724d800 ffffe000879fa600 ffffe000870b0080
ffffe000844af080
ffffe000866e7880 ffffe000844b5880 ffffe000860ce080
ffffe00086c2a680 (the hung thread)
ffffe0008601f040

So… from this data this seems to me to be the chain of events leading to the deadlock:
1. powershell obtained a lock on resource 0xffffe00083dde4c8
2. searchprotocolhost obtained an RH lock on the problematic file
3. searchprotocolhost decided to access resource 0xffffe00083dde4c8 and got blocked on it
4. powershell tried to break the RH lock to R for the same file, but as searchprotocolhost got blocked in step #3 and couldn’t acknowledge the break, it also resulted in powershell going to bed

… all leading to the deadlock I’m seeing.

Does the above line of events make sense?

If yes, then… how do I figure who’s to blame - PowerShell or searchprotocolhost. Ideas? :slight_smile:

That’s a great hang. Neither is “at fault” in any obvious way, though I’ve often wondered why we don’t see more of this sort of things since applications are permitted to take out oplocks (and the very type that need to call back and process the break).

I’d suggest you take this one up with Microsoft, because in the end everything you have is standard, so it’s their bug (and they can decide how to “fix this”).

In the interim, in your case, you might want to stop the search indexer while you’re running your power shell script.

Tony
OSR

Yup, I think I’ll hand it over to them :slight_smile:

But before being calling it a day and ruining all the fun for me - two questions:

  • is it possible to get more info about this resource (is it a CS, etc.)
  • how could I go about figuring out when/where has the ps process obtained a lock on this resource 0xffffe00083dde4c8? is there an alternative to ‘~e!dso’ that sos offers?

    I’d like to see where the ps process issued the lock and what it’s using it for

    Thanks,
    Alex