File object locked indefinitely

Lenka_Trochtova · November 14, 2011, 9:17am

Hello,

I am porting a legacy encryption FSFD to Windows 7. The FSFD uses the technique of shadow file objects. The shadow file objects are used for both encrypted files and selected directories, which (may) contain encrypted files.

I use our own automated tests. When running the tests on the network share (smb), the crashes occasionally occur during the test which does directory moving and renaming (it uses Win32 MoveFileEx API). The crashes occur only on the 64-bit version of the operating system and only if verifier is completely disabled and the release version of the FSFD is used. (Probably some fragile timing issue?)

The crashes happen when NtSetFileInfo function tries to wait on an event on its stack. The KEVENT structure is apparently corrupted and the crash happens when the thread is being added into its corrupted wait list.

Just for the experiment I tried to enforce synchronous completion for directory renames in our driver (the driver waits on the lower driver to complete the directory rename request). The crashes stopped occurring, but occasional deadlocks started to occur instead. The deadlocks look always the same:
The stuck thread belongs to the automated test’s process and it is trying to close a handle to the previously renamed directory. The thread waits in the IopAcquireFileObjectLock indefinitely. The file object associated with the handle being closed is locked - i.e. it is busy and its synchronization event FILE_OBJECT::Lock is not signaling. What is signaling is the second file object’s event - the notification event FILE_OBJECT::Event. When inspecting the deadlock, I cannot see any thread with our driver on stack. Neither is there any thread with the file object’s address on its stack (with the exception of the already mentioned stuck thread).

I am quite puzzled, it seems like the file object has been locked and hasn’t been unlocked, but I have no idea how such thing could have happened. As far as I know the file lock should be used only by the I/O Manager, but I suppose the I/O Manager would handle it correctly. I cannot find any other thread working with the file object, there are no pending APCs and trying to find the irp with the specified file object using !irpfind gives me nothing as well.

Do you have any idea what might be causing this? Have you ever seen anything similar? Do you have any tips for me what should I try next?

Regards

Lenka

OSR_Community_User · November 14, 2011, 10:25pm

> Just for the experiment I tried to enforce synchronous completion for directory renames in our driver

(the driver waits on the lower driver to complete the directory rename request).

I have a strong suspect that crashes turned to deadlocks due to this magic.

FILE_OBJECT::Lock is not signaling.

Any OwnerThread field there? and on checked build?

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Lenka_Trochtova · November 15, 2011, 3:59am

On Tue, Nov 15, 2011 at 4:24 AM, Maxim S. Shatskih
wrote:
>> Just for the experiment I tried to enforce synchronous completion for directory renames in our driver
>>(the driver waits on the lower driver to complete the directory rename request).
>
> I have a strong suspect that crashes turned to deadlocks due to this magic.

Yes, I also think so. That’s why I mentioned these two problems
together. I think it is actually the same problem and I hoped
inspecting it as a deadlock as well would reveal more information
about it, but it haven’t helped much.

>>FILE_OBJECT::Lock is not signaling.
>
> Any OwnerThread field there?

FILE_OBJECT::Lock is a KEVENT. I think KEVENT structure has no
OwnerThread entry.

> and on checked build?

Checked build of Windows? I don’t use it. Is it worth the effort which
is needed to install it? I have read about other people having trouble
installing it.

Regarding the checked build of the FSFD, the crashes don’t occur when
it is used. I haven’t verified that the deadlocks would stop to occur
too, but I will do it. But I suppose the deadlocks will stop occurring
with the checked FSFD, because they stopped occurring when I used
verifier (the same as with the crashes).

Thank you for your response.

Regards

Lenka

OSR_Community_User · November 15, 2011, 7:00am

What was your analysis of the crash? I’m assuming you are using driver verifier?

Tony
OSR

Lenka_Trochtova · November 15, 2011, 9:59am

Thank you for your response.

On Tue, Nov 15, 2011 at 1:01 PM, Tony Mason wrote:
> ?I’m assuming you are using driver verifier?

No, when driver verifier is used the system does not crash and
everything works fine.
I can reproduce the crash only if driver verifier is completely disabled.

> What was your analysis of the crash?

!analyze -v

Bugcheck Analysis

IRQL_NOT_LESS_OR_EQUAL (a)
An attempt was made to access a pageable (or completely invalid) address at an
interrupt request level (IRQL) that is too high. This is usually
caused by drivers using improper addresses.
If a kernel debugger is available get the stack backtrace.
Arguments:
Arg1: 0000000000000000, memory referenced
Arg2: 0000000000000002, IRQL
Arg3: 0000000000000001, bitfield :
bit 0 : value 0 = read operation, 1 = write operation
bit 3 : value 0 = not an execute operation, 1 = execute operation
(only on chips which support this level of status)
Arg4: fffff8000288388c, address which referenced memory

Debugging Details:
------------------

WRITE_ADDRESS: 0000000000000000

CURRENT_IRQL: 2

FAULTING_IP:
nt!KeWaitForSingleObject+17c
fffff8000288388c 4c8938 mov qword ptr [rax],r15 DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT BUGCHECK_STR: 0xA PROCESS_NAME: IFDTest.exe TRAP_FRAME: fffff88001e67860 -- (.trap 0xfffff88001e67860) NOTE: The trap frame does not contain all registers. Some register values may be zeroed or incorrect. rax=0000000000000000 rbx=0000000000000000 rcx=fffff88001e67b60 rdx=0000000000000000 rsi=0000000000000000 rdi=0000000000000000 rip=fffff8000288388c rsp=fffff88001e679f0 rbp=0000000000000000 r8=fffff78000000008 r9=0000000000000000 r10=0000000000000001 r11=fffff800029f4e80 r12=0000000000000000 r13=0000000000000000 r14=0000000000000000 r15=0000000000000000 iopl=0 nv up ei pl zr na po nc nt!KeWaitForSingleObject+0x17c: fffff8000288388c 4c8938 mov qword ptr [rax],r15
ds:0010:0000000000000000=???????????????? Resetting default scope LAST_CONTROL_TRANSFER: from fffff80002971682 to fffff80002872660 STACK_TEXT: fffff88001e66fa8 fffff80002971682 : 0000000000000000
fffffa80034e8770 0000000000000065 fffff800028b8b14 : nt!DbgBreakPointWithStatus fffff88001e66fb0 fffff8000297246e : 0000000000000003
0000000000000000 fffff800028b56e0 000000000000000a : nt!KiBugCheckDebugBreak+0x12 fffff88001e67010 fffff8000287a704 : fffffa8002b50010
0000000000000001 fffffa8003772010 fffff800028f9312 : nt!KeBugCheck2+0x71e fffff88001e676e0 fffff80002879b69 : 000000000000000a
0000000000000000 0000000000000002 0000000000000001 : nt!KeBugCheckEx+0x104 fffff88001e67720 fffff800028787e0 : 0000000000100004
0000000000000000 0000000000000105 0000000000000000 : nt!KiBugCheckDispatch+0x69 fffff88001e67860 fffff8000288388c : 0000000000000001
fffff88001e679b0 000000006c546953 fffffa8002b50010 : nt!KiPageFault+0x260 fffff88001e679f0 fffff80002bdd555 : 0000000000000000
fffffa8000000000 0000000000000001 fffffa8002e03100 : nt!KeWaitForSingleObject+0x17c fffff88001e67a90 fffff80002879853 : 0000000000000190
fffffa80034e8770 000000000415d6d8 fffff88000000092 : nt! ?? ::NNGAKEGL::string’+0x55da4

fffff88001e67bb0 000000007757012a : 0000000077404905 0000000000000002 00000000c000000d 0000000000000000 :
nt!KiSystemServiceCopyEnd+0x13

000000000415d6b8 0000000077404905 : 0000000000000002 00000000c000000d 0000000000000000 0000000000000000 :
ntdll!NtSetInformationFile+0xa

000000000415d6c0 0000000000000002 : 00000000c000000d 0000000000000000 0000000000000000 000000000000000a :
kernel32!MoveFileExW+0x275

000000000415d6c8 00000000c000000d : 0000000000000000 0000000000000000 000000000000000a 0000000000204020 :

0x2

000000000415d6d0 0000000000000000 : 0000000000000000 000000000000000a 0000000000204020 0000000000000000 :

0xc000000d

STACK_COMMAND: kb

FOLLOWUP_IP:
nt!KeWaitForSingleObject+17c
fffff8000288388c 4c8938 mov qword ptr [rax],r15 SYMBOL_STACK_INDEX: 6 SYMBOL_NAME: nt!KeWaitForSingleObject+17c FOLLOWUP_NAME: MachineOwner MODULE_NAME: nt IMAGE_NAME: ntkrnlmp.exe DEBUG_FLR_IMAGE_TIMESTAMP: 4b88cfeb FAILURE_BUCKET_ID: X64_0xA_nt!KeWaitForSingleObject+17c BUCKET_ID: X64_0xA_nt!KeWaitForSingleObject+17c Followup: MachineOwner --------- This is the stack without parameters: nt!DbgBreakPointWithStatus nt!KiBugCheckDebugBreak+0x12 nt!KeBugCheck2+0x71e nt!KeBugCheckEx+0x104 nt!KiBugCheckDispatch+0x69 nt!KiPageFault+0x260 nt!KeWaitForSingleObject+0x17c nt! ?? ::NNGAKEGL::string’+0x55da4
nt!KiSystemServiceCopyEnd+0x13
ntdll!NtSetInformationFile+0xa
kernel32!MoveFileExW+0x275
0x2
0xc000000d

nt! ?? ::NNGAKEGL::string'+0x55da4 seems to be actually part of nt!NtSetInformationFile. The thread tries to wait on the KEVENT on its stack. According to the disassembly of nt!KeWaitForSingleObject, the address of the event is: rcx - 8 = fffff88001e67b60 - 8 = fffff88001e67b58. 0: kd> dt -r _KEVENT fffff88001e67b58 ntdll!_KEVENT +0x000 Header : _DISPATCHER_HEADER +0x000 Type : 0x81 '' +0x001 TimerControlFlags : 0 '' +0x001 Absolute : 0y0 +0x001 Coalescable : 0y0 +0x001 KeepShifting : 0y0 +0x001 EncodedTolerableDelay : 0y00000 (0) +0x001 Abandoned : 0 '' +0x001 Signalling : 0 '' +0x002 ThreadControlFlags : 0 '' +0x002 CpuThrottled : 0y0 +0x002 CycleProfiling : 0y0 +0x002 CounterProfiling : 0y0 +0x002 Reserved : 0y00000 (0) +0x002 Hand : 0 '' +0x002 Size : 0 '' +0x003 TimerMiscFlags : 0 '' +0x003 Index : 0y000000 (0) +0x003 Inserted : 0y0 +0x003 Expired : 0y0 +0x003 DebugActive : 0 '' +0x003 ActiveDR7 : 0y0 +0x003 Instrumented : 0y0 +0x003 Reserved2 : 0y0000 +0x003 UmsScheduled : 0y0 +0x003 UmsPrimary : 0y0 +0x003 DpcActive : 0 '' +0x000 Lock : 0x81 +0x004 SignalState : 0 +0x008 WaitListHead : _LIST_ENTRY [0xfffff8a000204020 - 0x0]
+0x000 Flink : 0xfffff8a000204020 _LIST_ENTRY [ 0x00000001fffff750 - 0xffffffff0000000f] +0x008 Blink : (null) The system crashes when the thread is putting itself into the event's wait list. As you can see the event's WaitListHead contains completely insane values of Flink and Blink pointers. I can see an address of a file object on the stack of the thread near the corrupted event. The file object belongs to our FSFD: !fileobj fffffa800378eac0

;LanmanRedirector;Z:000000000001c410\TestServer\Temp\myfolder\XXXTest\Win32_MoveDir\InTree\NDir\Dest\XRDir0N

Device Object: 0xfffffa8002492d10 \FileSystem\Mup
Vpb is NULL
Event signalled
Access: Read Delete SharedRead SharedWrite SharedDelete

Flags: 0x40002
Synchronous IO
Handle Created

File Object is currently busy and has 0 waiters.

FsContext: 0xfffffa8002764230 FsContext2: 0xfffffa80033de838
CurrentByteOffset: 0

XRDir0N is one of the test directories renamed by our automated tests.
In this case the directory contains encrypted files and it is moved
within the directory tree, but I have also seen this crash with empty
directories or with directories renamed in place.

As we are using the shadow FO technique, there is another FO pointed
from the mentioned FO’s FsContext2 structure.
We send this second file object to the lower device:

!fileobj 0xfffffa80`033de760

;Z:000000000001c410\TestServer\Temp\myfolder\XXXTest\Win32_MoveDir\InTree\NDir\Src\XRDir0N

Device Object: 0xfffffa8002492d10 \FileSystem\Mup
Vpb is NULL
Access: Read Delete SharedRead SharedWrite SharedDelete

Flags: 0x40002
Synchronous IO
Handle Created

FsContext: 0xfffff8a0029b3a90 FsContext2: 0xfffff8a0029b3e30
CurrentByteOffset: 0
Cache Data:
Section Object Pointers: fffffa8003b50cd8
Shared Cache Map: 00000000

File object extension is at fffffa80039b7270:

Lenka_Trochtova · November 15, 2011, 10:02am

> Regarding the checked build of the FSFD, the crashes don’t occur when

it is used. I haven’t verified that the deadlocks would stop to occur
too, but I will do it.

With the checked build of the FSFD, the deadlocks stopped to occur as well.

OSR_Community_User · November 15, 2011, 10:17pm

> Checked build of Windows? I don’t use it. Is it worth the effort which

is needed to install it? I have read about other people having trouble
installing it.

No, only checked kernel and HAL.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Lenka_Trochtova · November 16, 2011, 3:32am

> No, only checked kernel and HAL.

No, I don’t use it, but I will give it a try. Thank you.

OSR_Community_User · November 16, 2011, 5:02am

> No, I don’t use it, but I will give it a try. Thank you.

Also note that this lock is acquired in NtXxxFile or IopSynchronousServiceTail, and released in IopCompleteRequest.

Probably you spoil the IRP in way so that IopCompleteRequest forgets to release the lock.

You can, for instance, register all in-progress IRPs for your code to some table. On dispatch entry, the IRP is registered. And, instead of IoCompleteRequest, you call MyIoCompleteRequest which removes the table entry and calls IoCompleteRequest.

This way you can, for instance, set breakpoint on your “move” IRP completion, you can check the lock state at this moment and the IRP state too, and can deduce something.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Lenka_Trochtova · November 21, 2011, 6:46am

Hello,

I have had several free days and I have let the tests run with the
checked kernel and HAL meantime.
Unfortunatelly the crash has not occured. So it seems the crash occurs
only within the most debug-unfriendly conditions.

On Wed, Nov 16, 2011 at 9:31 AM, Lenka Trochtova
wrote:
>> No, only checked kernel and HAL.
>
> No, I don’t use it, but I will give it a try. Thank you.
>

Now, I am planning to do more experiments including the expetiment
with IRP’s you have suggested me. I will let you know when I know
something new.

Thank you.

Lenka

Lenka_Trochtova · December 1, 2011, 4:42am

Hello,

I have fixed it. It was a stack corruption caused by incorrect
TopLevelIrp handling.
The stack was corrupted in such a way that a boolean variable was
overwritten by the least significant byte of an address.
The boolean variable was telling the NtSetInformationFile function
whether the operation is synchronous or not and it should have been
true within our test. It worked most of the time, only occassionally
the least significant byte of the address was 0, in which case the
function started to process the rename operation as a synchronous
operation and switched to asynchronous processing after calling the
driver. The result was following then:
If the driver returned STATUS_PENDING the function tryed to wait on
the event on its stack and crashed, because the event had not been
initialized (

On Mon, Nov 21, 2011 at 12:45 PM, Lenka Trochtova
wrote:
> Hello,
>
> I have had several free days and I have let the tests run with the
> checked kernel and HAL meantime.
> Unfortunatelly the crash has not occured. So it seems the crash occurs
> only within the most debug-unfriendly conditions.
>
> On Wed, Nov 16, 2011 at 9:31 AM, Lenka Trochtova
> wrote:
>>> No, only checked kernel and HAL.
>>
>> No, I don’t use it, but I will give it a try. Thank you.
>>
>
> Now, I am planning to do more experiments including the expetiment
> with IRP’s you have suggested me. I will let you know when I know
> something new.
>
> Thank you.
>
> Lenka

Lenka_Trochtova · December 1, 2011, 4:48am

I am sorry, I hit send too soon.

On Thu, Dec 1, 2011 at 10:41 AM, Lenka Trochtova
wrote:
> Hello,
>
> I have fixed it. It was a stack corruption caused by incorrect
> TopLevelIrp handling.
> The stack was corrupted in such a way that a boolean variable was
> overwritten by the least significant byte of an address.
> The boolean variable was telling the NtSetInformationFile function
> whether the operation is synchronous or not and it should have been
> true within our test. It worked most of the time, only occassionally
> the least significant byte of the address was 0, in which case the
> function started to process the rename operation as a synchronous
> operation and switched to asynchronous processing after calling the
> driver. The result was following then:
> ?If the driver returned STATUS_PENDING the function tryed to wait on
> the event on its stack and crashed, because the event had not been
> initialized (

the event had not been initialized because it was actually synchronous
operation at the beginning.

If the driver returned other status the NtSetInformationFile function
didn’t release the file lock, because it thought it hadn’t been
acquired (it is not acquired for asynchronous operations).

So… this explains why the problem lead to crashes and when I
enforced synchronous processing of the operation within the driver the
crashes turned into deadlocks.

Thank you for help.

Lenka

OSR_Community_User · December 1, 2011, 6:23am

> I have fixed it. It was a stack corruption caused by incorrect

TopLevelIrp handling.
The stack was corrupted in such a way that a boolean variable was
overwritten by the least significant byte of an address.

Yes, this was the bug congratulations!

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com