We’ve been chasing down some really nasty BSOD problems that only seem to occur when we run our FS minifilter under very heavy load and only after running for hours and sometimes days.
At first it seemed the crashes were pretty much random with very little to no good information in the dumps to tie back to our filter. But after looking over a half dozen or so dumps and adding better backtrace capability to our filter we are starting to notice some similarity and common theme. Hoping that with posting here what we’ve gleaned so far someone might have some ideas of what we might do next to move closer to a resolution.
Gory details follow…I tend to err on the side of too-much-detail so sorry in advance :):
This problem only happens when we run our fs mini-filter in a proxy/shadow mode where we ‘borrow’ or take ‘ownership’ of some file objects that normally NTFS would be managing.
So basically, in our PostCreate:
-
We watch for STATUS_OBJECT_NAME/PATH_NOT_FOUND errors for files that don’t exist on the NTFS local drive/path that we are filtering.
-
When we see this case we do a check (using FltCreateFile) to see if the file in question actually exists on a remote file server\share.
-
If file DOES exists on remote server, we then take ownership of the failed TargetFileObject,setting up all of the required FO fields, linking the local TargetFileObject to our remote file handle/object and other context info (using StreamHandleContext and then return STATUS_SUCCESS.
So now that we’ve essentially hijacked a FO that NTFS would normally manage, our filter becomes much more complex and we have to do whatever it takes to prevent one of these hijacked FOs of ever making it’s way down to NTFS, so that we avoid just the kind of random crashes/BSODs that we are seeing in this case. (BTW I know this way of doing things if fraught with peril but it turns out for what we are trying to do and after much research and consultation this is the only way to do it).
So after many hassles we seem to have things working pretty well most of the time. As I mentioned earlier the BSODs we still see only happen when under extreme load and after running for quite a while, and only on some of our test servers.
I’ve copied the contents from a recent crash dump below. In this case it would appear that somehow one of the FO’s that we own (address fffffa8007445320, filename users\sluser49\file1.txt) got passed to NTFS - who as would be expected eventually crashed the system by assuming certain things about the FO/FsContexts that aren’t true.
In our filter trace buffer I can see that prior to this our filter did indeed take ownership of this FO and completed the Create successfully in our PostCreate function. But after that I see no calls to any of our Pre/Post functions.
In this case there are dozens of clients/users running various test scripts, mostly doing DIRs and dumping the contents of various file1.txt files from a mapped drive to the server running our mini-filter. In all cases the file1.txt files laying around in user-specific subdirs doesn’t actually reside on the server the clients are mapped to but actually reside on another remote server that we are redirecting to as described earlier.
Any ideas or suggestions where we should go from here are greatly appreciated. In some of the other crash dumps I’ve seen where it appears the FO’s that we had owned earlier and have completed/close somehow get resurrected with a Close on the FO being passed down to NTFS then we BSOD because NTFS attempts to free memory that has already been freed.
0: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)
This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Arguments:
Arg1: ffffffffc0000005, The exception code that was not handled
Arg2: fffffa60012cc021, The address that the exception occurred at
Arg3: fffffa60055da678, Exception Record Address
Arg4: fffffa60055da050, Context Record Address
Debugging Details:
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.
FAULTING_IP:
Ntfs!NtfsCopyReadA+d1
fffffa60`012cc021 488b4850 mov rcx,qword ptr [rax+50h]
EXCEPTION_RECORD: fffffa60055da678 – (.exr 0xfffffa60055da678)
ExceptionAddress: fffffa60012cc021 (Ntfs!NtfsCopyReadA+0x00000000000000d1)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 0000000000000000
Parameter[1]: 0000000000000050
Attempt to read from address 0000000000000050
CONTEXT: fffffa60055da050 – (.cxr 0xfffffa60055da050)
rax=0000000000000000 rbx=0000000000000e00 rcx=fffffa60055daa58
rdx=0000000000000000 rsi=fffffa8007445320 rdi=fffff8800f1dcb00
rip=fffffa60012cc021 rsp=fffffa60055da8b0 rbp=0000000000000000
r8=0000000000000000 r9=0000000000000000 r10=0000000000000200
r11=fffffa60055da910 r12=0000000000000001 r13=0000000000000200
r14=fffffa80075cd650 r15=0000000000000001
iopl=0 nv up ei ng nz na pe nc
cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b efl=00010282
Ntfs!NtfsCopyReadA+0xd1:
fffffa60012cc021 488b4850 mov rcx,qword ptr [rax+50h] ds:002b:0000000000000050=???
Resetting default scope
PROCESS_NAME: System
CURRENT_IRQL: 0
ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.
EXCEPTION_PARAMETER1: 0000000000000000
EXCEPTION_PARAMETER2: 0000000000000050
READ_ADDRESS: 0000000000000050
FOLLOWUP_IP:
srv2!Smb2ExecuteRead+574
fffffa60`051c5254 413ac7 cmp al,r15b
BUGCHECK_STR: 0x7E
DEFAULT_BUCKET_ID: NULL_CLASS_PTR_DEREFERENCE
LAST_CONTROL_TRANSFER: from fffffa6000cc3248 to fffffa60012cc021
STACK_TEXT:
fffffa60055da8b0 fffffa6000cc3248 : 0000000000000001 fffffa60055da910 0000000000000000 0000000000000001 : Ntfs!NtfsCopyReadA+0xd1
fffffa60055daaa0 fffffa6000cc61d5 : fffffa60055dab80 0000000000000000 fffffa8007445303 0000000000000000 :
fltmgr!FltpPerformFastIoCall+0x88
fffffa60055dab00 fffffa6000ce0599 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 :
fltmgr!FltpPassThroughFastIo+0xb5
fffffa60055dab50 fffffa60051c5254 : fffffa8000000200 0000000000000000 fffffa800e3e0001 fffffa800bb74080 : fltmgr!FltpFastIoRead+0x1a9
fffffa60055dabf0 fffffa60051af657 : fffffa800e3eb010 fffffa800e3eb290 0000000000000004 fffffa800522b500 : srv2!Smb2ExecuteRead+0x574
fffffa60055dac80 fffffa60051d2539 : fffffa800e3eb010 fffffa800e3eb010 0000000000000004 fffffa800522b500 :
srv2!Smb2ExecuteProviderCallback+0xd7
fffffa60055dace0 fffffa60051d1db5 : fffffa800522b500 0000000000000003 fffffa800522b500 fffffa800e3eb020 : srv2!SrvProcessPacket+0x59
fffffa60055dad10 fffff80001a788b3 : fffffa800522e270 0000000000000080 fffffa800522b500 fffffa800398c3d0 :
srv2!SrvProcWorkerThread+0x175
fffffa60055dad50 fffff8000188f4f6 : fffffa60005ec180 fffffa800522e270 fffffa8005088bb0 0000000000000001 :
nt!PspSystemThreadStartup+0x57
fffffa60055dad80 0000000000000000 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : nt!KiStartSystemThread+0x16
SYMBOL_STACK_INDEX: 4
SYMBOL_NAME: srv2!Smb2ExecuteRead+574
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: srv2
IMAGE_NAME: srv2.sys
DEBUG_FLR_IMAGE_TIMESTAMP: 4c1b8ce4
STACK_COMMAND: .cxr 0xfffffa60055da050 ; kb
FAILURE_BUCKET_ID: X64_0x7E_srv2!Smb2ExecuteRead+574
BUCKET_ID: X64_0x7E_srv2!Smb2ExecuteRead+574
Followup: MachineOwner
0: kd> !fileobj fffffa8007445320
users\sluser49\file1.txt
Related File Object: 0xfffffa8007be3670
Device Object: 0xfffffa800402ecc0 \Driver\volmgr
Vpb: 0xfffffa800402ec00
Access: Read SharedRead SharedWrite
Flags: 0x1040000
Handle Created
Remote Origin
FsContext: 0xfffff8800f1dcb00 FsContext2: 0xfffff8800f1dce98
CurrentByteOffset: 0
Cache Data:
Section Object Pointers: fffffa800bba5bf8
Shared Cache Map: 00000000