BSOD problem under load with FS mini-filter running in proxy/shadow file object mode

We’ve been chasing down some really nasty BSOD problems that only seem to occur when we run our FS minifilter under very heavy load and only after running for hours and sometimes days.

At first it seemed the crashes were pretty much random with very little to no good information in the dumps to tie back to our filter. But after looking over a half dozen or so dumps and adding better backtrace capability to our filter we are starting to notice some similarity and common theme. Hoping that with posting here what we’ve gleaned so far someone might have some ideas of what we might do next to move closer to a resolution.

Gory details follow…I tend to err on the side of too-much-detail so sorry in advance :):

This problem only happens when we run our fs mini-filter in a proxy/shadow mode where we ‘borrow’ or take ‘ownership’ of some file objects that normally NTFS would be managing.

So basically, in our PostCreate:

  • We watch for STATUS_OBJECT_NAME/PATH_NOT_FOUND errors for files that don’t exist on the NTFS local drive/path that we are filtering.

  • When we see this case we do a check (using FltCreateFile) to see if the file in question actually exists on a remote file server\share.

  • If file DOES exists on remote server, we then take ownership of the failed TargetFileObject,setting up all of the required FO fields, linking the local TargetFileObject to our remote file handle/object and other context info (using StreamHandleContext and then return STATUS_SUCCESS.

So now that we’ve essentially hijacked a FO that NTFS would normally manage, our filter becomes much more complex and we have to do whatever it takes to prevent one of these hijacked FOs of ever making it’s way down to NTFS, so that we avoid just the kind of random crashes/BSODs that we are seeing in this case. (BTW I know this way of doing things if fraught with peril but it turns out for what we are trying to do and after much research and consultation this is the only way to do it).

So after many hassles we seem to have things working pretty well most of the time. As I mentioned earlier the BSODs we still see only happen when under extreme load and after running for quite a while, and only on some of our test servers.

I’ve copied the contents from a recent crash dump below. In this case it would appear that somehow one of the FO’s that we own (address fffffa8007445320, filename users\sluser49\file1.txt) got passed to NTFS - who as would be expected eventually crashed the system by assuming certain things about the FO/FsContexts that aren’t true.

In our filter trace buffer I can see that prior to this our filter did indeed take ownership of this FO and completed the Create successfully in our PostCreate function. But after that I see no calls to any of our Pre/Post functions.

In this case there are dozens of clients/users running various test scripts, mostly doing DIRs and dumping the contents of various file1.txt files from a mapped drive to the server running our mini-filter. In all cases the file1.txt files laying around in user-specific subdirs doesn’t actually reside on the server the clients are mapped to but actually reside on another remote server that we are redirecting to as described earlier.

Any ideas or suggestions where we should go from here are greatly appreciated. In some of the other crash dumps I’ve seen where it appears the FO’s that we had owned earlier and have completed/close somehow get resurrected with a Close on the FO being passed down to NTFS then we BSOD because NTFS attempts to free memory that has already been freed.

0: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)
This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Arguments:
Arg1: ffffffffc0000005, The exception code that was not handled
Arg2: fffffa60012cc021, The address that the exception occurred at
Arg3: fffffa60055da678, Exception Record Address
Arg4: fffffa60055da050, Context Record Address

Debugging Details:

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

FAULTING_IP:
Ntfs!NtfsCopyReadA+d1
fffffa60`012cc021 488b4850 mov rcx,qword ptr [rax+50h]

EXCEPTION_RECORD: fffffa60055da678 – (.exr 0xfffffa60055da678)
ExceptionAddress: fffffa60012cc021 (Ntfs!NtfsCopyReadA+0x00000000000000d1)
ExceptionCode: c0000005 (Access violation)
ExceptionFlags: 00000000
NumberParameters: 2
Parameter[0]: 0000000000000000
Parameter[1]: 0000000000000050
Attempt to read from address 0000000000000050

CONTEXT: fffffa60055da050 – (.cxr 0xfffffa60055da050)
rax=0000000000000000 rbx=0000000000000e00 rcx=fffffa60055daa58
rdx=0000000000000000 rsi=fffffa8007445320 rdi=fffff8800f1dcb00
rip=fffffa60012cc021 rsp=fffffa60055da8b0 rbp=0000000000000000
r8=0000000000000000 r9=0000000000000000 r10=0000000000000200
r11=fffffa60055da910 r12=0000000000000001 r13=0000000000000200
r14=fffffa80075cd650 r15=0000000000000001
iopl=0 nv up ei ng nz na pe nc
cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b efl=00010282
Ntfs!NtfsCopyReadA+0xd1:
fffffa60012cc021 488b4850 mov rcx,qword ptr [rax+50h] ds:002b:0000000000000050=???
Resetting default scope

PROCESS_NAME: System

CURRENT_IRQL: 0

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_PARAMETER1: 0000000000000000

EXCEPTION_PARAMETER2: 0000000000000050

READ_ADDRESS: 0000000000000050

FOLLOWUP_IP:
srv2!Smb2ExecuteRead+574
fffffa60`051c5254 413ac7 cmp al,r15b

BUGCHECK_STR: 0x7E

DEFAULT_BUCKET_ID: NULL_CLASS_PTR_DEREFERENCE

LAST_CONTROL_TRANSFER: from fffffa6000cc3248 to fffffa60012cc021

STACK_TEXT:
fffffa60055da8b0 fffffa6000cc3248 : 0000000000000001 fffffa60055da910 0000000000000000 0000000000000001 : Ntfs!NtfsCopyReadA+0xd1
fffffa60055daaa0 fffffa6000cc61d5 : fffffa60055dab80 0000000000000000 fffffa8007445303 0000000000000000 :

fltmgr!FltpPerformFastIoCall+0x88
fffffa60055dab00 fffffa6000ce0599 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 :

fltmgr!FltpPassThroughFastIo+0xb5
fffffa60055dab50 fffffa60051c5254 : fffffa8000000200 0000000000000000 fffffa800e3e0001 fffffa800bb74080 : fltmgr!FltpFastIoRead+0x1a9
fffffa60055dabf0 fffffa60051af657 : fffffa800e3eb010 fffffa800e3eb290 0000000000000004 fffffa800522b500 : srv2!Smb2ExecuteRead+0x574
fffffa60055dac80 fffffa60051d2539 : fffffa800e3eb010 fffffa800e3eb010 0000000000000004 fffffa800522b500 :

srv2!Smb2ExecuteProviderCallback+0xd7
fffffa60055dace0 fffffa60051d1db5 : fffffa800522b500 0000000000000003 fffffa800522b500 fffffa800e3eb020 : srv2!SrvProcessPacket+0x59
fffffa60055dad10 fffff80001a788b3 : fffffa800522e270 0000000000000080 fffffa800522b500 fffffa800398c3d0 :

srv2!SrvProcWorkerThread+0x175
fffffa60055dad50 fffff8000188f4f6 : fffffa60005ec180 fffffa800522e270 fffffa8005088bb0 0000000000000001 :

nt!PspSystemThreadStartup+0x57
fffffa60055dad80 0000000000000000 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000 : nt!KiStartSystemThread+0x16

SYMBOL_STACK_INDEX: 4

SYMBOL_NAME: srv2!Smb2ExecuteRead+574

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: srv2

IMAGE_NAME: srv2.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 4c1b8ce4

STACK_COMMAND: .cxr 0xfffffa60055da050 ; kb

FAILURE_BUCKET_ID: X64_0x7E_srv2!Smb2ExecuteRead+574

BUCKET_ID: X64_0x7E_srv2!Smb2ExecuteRead+574

Followup: MachineOwner

0: kd> !fileobj fffffa8007445320

users\sluser49\file1.txt

Related File Object: 0xfffffa8007be3670

Device Object: 0xfffffa800402ecc0 \Driver\volmgr
Vpb: 0xfffffa800402ec00
Access: Read SharedRead SharedWrite

Flags: 0x1040000
Handle Created
Remote Origin

FsContext: 0xfffff8800f1dcb00 FsContext2: 0xfffff8800f1dce98
CurrentByteOffset: 0
Cache Data:
Section Object Pointers: fffffa800bba5bf8
Shared Cache Map: 00000000

My first question would be: what are you indicating in the IsFastIoPossible field of your common FCB header structure? If that is set to FastIoIsNotPossible then these calls shouldn’t be sent via fast I/O in the first place (and thus you’re hitting an SMB2 implementation bug.) If that is set to FastIoIsPossible or FastIoIsQuestionable then you might try setting it to FastIoIsNotPossible.

My other question would be whether or not you have registered (in your mini-filter) for fast I/O operations. If so, you should be seeing this Fast I/O read call in your mini-filter before it gets passed to NTFS (assuming you are monitoring that instance.) If not, then it sounds like you have some peculiar issue with filter manager.

Can you tell if this file object was opened in relative mode? Ergo, you were given a RelatedFileObject when the file was first opened (or rather, first attempted to be opened?) Whether that is causing the issue is not clear, but it is definitely something else to check in such a case. I’m guessing the answer to this is “yes” because the name in the file object looks like a relative name. Again, that seems somewhat suspicious.

Did you look at how SMB2 is invoking the fast I/O function? I’m asking because in the past SMB had this habit of building a “table” of fast I/O entry points. If it inherited that table from the related file object (for example) then it might explain this behavior - in which case you have an SMB bug that you can report back to Microsoft. Still, it seems like it should be calling your mini-filter, assuming you’ve set up properly for fast I/O.

Tony
OSR

Some more questions:

  • Which operating system?
  • Are you setting up FileObject->SOP? and if so, what to? This is a game
    changer.
  • Equally what about the FileObject->Vpb?

Finally, you say that “But after that I see no calls to any of our Pre/Post
functions”, How are you determining that? By FileObject Address, or by
StreamHandleContext ? Are you filtering all devices (just in case?).

wrote in message news:xxxxx@ntfsd…
> We’ve been chasing down some really nasty BSOD problems that only seem to
> occur when we run our FS minifilter under very heavy load and only after
> running for hours and sometimes days.
>
> At first it seemed the crashes were pretty much random with very little to
> no good information in the dumps to tie back to our filter. But after
> looking over a half dozen or so dumps and adding better backtrace
> capability to our filter we are starting to notice some similarity and
> common theme. Hoping that with posting here what we’ve gleaned so far
> someone might have some ideas of what we might do next to move closer to a
> resolution.
>
> Gory details follow…I tend to err on the side of too-much-detail so
> sorry in advance :):
>
> This problem only happens when we run our fs mini-filter in a proxy/shadow
> mode where we ‘borrow’ or take ‘ownership’ of some file objects that
> normally NTFS would be managing.
>
> So basically, in our PostCreate:
> - We watch for STATUS_OBJECT_NAME/PATH_NOT_FOUND errors for files that
> don’t exist on the NTFS local drive/path that we are filtering.
>
> - When we see this case we do a check (using FltCreateFile) to see if the
> file in question actually exists on a remote file server\share.
>
> - If file DOES exists on remote server, we then take ownership of the
> failed TargetFileObject,setting up all of the required FO fields, linking
> the local TargetFileObject to our remote file handle/object and other
> context info (using StreamHandleContext and then return STATUS_SUCCESS.
>
> So now that we’ve essentially hijacked a FO that NTFS would normally
> manage, our filter becomes much more complex and we have to do whatever it
> takes to prevent one of these hijacked FOs of ever making it’s way down to
> NTFS, so that we avoid just the kind of random crashes/BSODs that we are
> seeing in this case. (BTW I know this way of doing things if fraught with
> peril but it turns out for what we are trying to do and after much
> research and consultation this is the only way to do it).
>
> So after many hassles we seem to have things working pretty well most of
> the time. As I mentioned earlier the BSODs we still see only happen when
> under extreme load and after running for quite a while, and only on some
> of our test servers.
>
> I’ve copied the contents from a recent crash dump below. In this case it
> would appear that somehow one of the FO’s that we own (address
> fffffa8007445320, filename users\sluser49\file1.txt) got passed to NTFS -
> who as would be expected eventually crashed the system by assuming certain
> things about the FO/FsContexts that aren’t true.
>
> In our filter trace buffer I can see that prior to this our filter did
> indeed take ownership of this FO and completed the Create successfully in
> our PostCreate function. But after that I see no calls to any of our
> Pre/Post functions.
>
> In this case there are dozens of clients/users running various test
> scripts, mostly doing DIRs and dumping the contents of various file1.txt
> files from a mapped drive to the server running our mini-filter. In all
> cases the file1.txt files laying around in user-specific subdirs doesn’t
> actually reside on the server the clients are mapped to but actually
> reside on another remote server that we are redirecting to as described
> earlier.
>
> Any ideas or suggestions where we should go from here are greatly
> appreciated. In some of the other crash dumps I’ve seen where it appears
> the FO’s that we had owned earlier and have completed/close somehow get
> resurrected with a Close on the FO being passed down to NTFS then we BSOD
> because NTFS attempts to free memory that has already been freed.
>
> 0: kd> !analyze -v
> ***
> *
>
> * Bugcheck Analysis
>
> *
>
>

>
> SYSTEM_THREAD_EXCEPTION_NOT_HANDLED (7e)
> This is a very common bugcheck. Usually the exception address pinpoints
> the driver/function that caused the problem. Always note this address
> as well as the link date of the driver/image that contains this address.
> Arguments:
> Arg1: ffffffffc0000005, The exception code that was not handled
> Arg2: fffffa60012cc021, The address that the exception occurred at
> Arg3: fffffa60055da678, Exception Record Address
> Arg4: fffffa60055da050, Context Record Address
>
> Debugging Details:
> ------------------
>
>
> EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx
> referenced memory at 0x%08lx. The memory could not be %s.
>
> FAULTING_IP:
> Ntfs!NtfsCopyReadA+d1
> fffffa60012cc021 488b4850 mov rcx,qword ptr [rax+50h]<br>&gt;<br>&gt; EXCEPTION_RECORD: fffffa60055da678 -- (.exr 0xfffffa60055da678)<br>&gt; ExceptionAddress: fffffa60012cc021 (Ntfs!NtfsCopyReadA+0x00000000000000d1)<br>&gt; ExceptionCode: c0000005 (Access violation)<br>&gt; ExceptionFlags: 00000000<br>&gt; NumberParameters: 2<br>&gt; Parameter[0]: 0000000000000000<br>&gt; Parameter[1]: 0000000000000050<br>&gt; Attempt to read from address 0000000000000050<br>&gt;<br>&gt; CONTEXT: fffffa60055da050 -- (.cxr 0xfffffa60055da050)<br>&gt; rax=0000000000000000 rbx=0000000000000e00 rcx=fffffa60055daa58<br>&gt; rdx=0000000000000000 rsi=fffffa8007445320 rdi=fffff8800f1dcb00<br>&gt; rip=fffffa60012cc021 rsp=fffffa60055da8b0 rbp=0000000000000000<br>&gt; r8=0000000000000000 r9=0000000000000000 r10=0000000000000200<br>&gt; r11=fffffa60055da910 r12=0000000000000001 r13=0000000000000200<br>&gt; r14=fffffa80075cd650 r15=0000000000000001<br>&gt; iopl=0 nv up ei ng nz na pe nc<br>&gt; cs=0010 ss=0018 ds=002b es=002b fs=0053 gs=002b <br>&gt; efl=00010282<br>&gt; Ntfs!NtfsCopyReadA+0xd1:<br>&gt; fffffa60012cc021 488b4850 mov rcx,qword ptr [rax+50h]
> ds:002b:0000000000000050=????????????????<br>&gt; Resetting default scope<br>&gt;<br>&gt; PROCESS_NAME: System<br>&gt;<br>&gt; CURRENT_IRQL: 0<br>&gt;<br>&gt; ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced <br>&gt; memory at 0x%08lx. The memory could not be %s.<br>&gt;<br>&gt; EXCEPTION_PARAMETER1: 0000000000000000<br>&gt;<br>&gt; EXCEPTION_PARAMETER2: 0000000000000050<br>&gt;<br>&gt; READ_ADDRESS: 0000000000000050<br>&gt;<br>&gt; FOLLOWUP_IP:<br>&gt; srv2!Smb2ExecuteRead+574<br>&gt; fffffa60051c5254 413ac7 cmp al,r15b
>
> BUGCHECK_STR: 0x7E
>
> DEFAULT_BUCKET_ID: NULL_CLASS_PTR_DEREFERENCE
>
> LAST_CONTROL_TRANSFER: from fffffa6000cc3248 to fffffa60012cc021
>
> STACK_TEXT:
> fffffa60055da8b0 fffffa6000cc3248 : 0000000000000001 fffffa60055da910
> 0000000000000000 0000000000000001 : Ntfs!NtfsCopyReadA+0xd1
> fffffa60055daaa0 fffffa6000cc61d5 : fffffa60055dab80 0000000000000000
> fffffa8007445303 0000000000000000 :
>
> fltmgr!FltpPerformFastIoCall+0x88
> fffffa60055dab00 fffffa6000ce0599 : 0000000000000000 0000000000000000
> 0000000000000000 0000000000000000 :
>
> fltmgr!FltpPassThroughFastIo+0xb5
> fffffa60055dab50 fffffa60051c5254 : fffffa8000000200 0000000000000000
> fffffa800e3e0001 fffffa800bb74080 : fltmgr!FltpFastIoRead+0x1a9
> fffffa60055dabf0 fffffa60051af657 : fffffa800e3eb010 fffffa800e3eb290
> 0000000000000004 fffffa800522b500 : srv2!Smb2ExecuteRead+0x574
> fffffa60055dac80 fffffa60051d2539 : fffffa800e3eb010 fffffa800e3eb010
> 0000000000000004 fffffa800522b500 :
>
> srv2!Smb2ExecuteProviderCallback+0xd7
> fffffa60055dace0 fffffa60051d1db5 : fffffa800522b500 0000000000000003
> fffffa800522b500 fffffa800e3eb020 : srv2!SrvProcessPacket+0x59
> fffffa60055dad10 fffff80001a788b3 : fffffa800522e270 0000000000000080
> fffffa800522b500 fffffa800398c3d0 :
>
> srv2!SrvProcWorkerThread+0x175
> fffffa60055dad50 fffff8000188f4f6 : fffffa60005ec180 fffffa800522e270
> fffffa8005088bb0 0000000000000001 :
>
> nt!PspSystemThreadStartup+0x57
> fffffa60055dad80 0000000000000000 : 0000000000000000 0000000000000000
> 0000000000000000 0000000000000000 : nt!KiStartSystemThread+0x16
>
>
> SYMBOL_STACK_INDEX: 4
>
> SYMBOL_NAME: srv2!Smb2ExecuteRead+574
>
> FOLLOWUP_NAME: MachineOwner
>
> MODULE_NAME: srv2
>
> IMAGE_NAME: srv2.sys
>
> DEBUG_FLR_IMAGE_TIMESTAMP: 4c1b8ce4
>
> STACK_COMMAND: .cxr 0xfffffa60055da050 ; kb
>
> FAILURE_BUCKET_ID: X64_0x7E_srv2!Smb2ExecuteRead+574
>
> BUCKET_ID: X64_0x7E_srv2!Smb2ExecuteRead+574
>
> Followup: MachineOwner
> ---------
>
> 0: kd> !fileobj fffffa8007445320
>
> users\sluser49\file1.txt
>
> Related File Object: 0xfffffa8007be3670
>
> Device Object: 0xfffffa800402ecc0 \Driver\volmgr
> Vpb: 0xfffffa800402ec00
> Access: Read SharedRead SharedWrite
>
> Flags: 0x1040000
> Handle Created
> Remote Origin
>
> FsContext: 0xfffff8800f1dcb00 FsContext2: 0xfffff8800f1dce98
> CurrentByteOffset: 0
> Cache Data:
> Section Object Pointers: fffffa800bba5bf8
> Shared Cache Map: 00000000
>
>
>
>
>

Thanks for the feedback Tony and Rod.

Tony - I’ll try disabling fast I/O as you’re describing and see what happens. Currently we DO register for fast I/O operations but for file objects we are interested in (i.e. file objects that we have attached a StreamHandleContext to) we check for FLT_IS_FASTIO_OPERATION and return FLT_PREOP_DISALLOW_FASTIO.

After adding more backtrace info in our filter log I know a little more now. I can see that normally for the file1.txt reads our fast I/O handler does get called on a fast read, we DO get our StreamHandle context and do DISALLOW_FASTIO, then our handler for NON-fast/normal read gets called and all is well.

However in the case when it crashes - for some reason when we call FltGetStreamHandleContext to retrieve our context and see if we care about the operation we are getting an ERROR return from FltGetStreamHandleContext and we currently return FLT_PREOP_SUCCESS_NO_CALLBACK which leads to the I/O getting down to NTFS and thus the crash.

So I guess my question is: Why occasionally and apparently only under heavy load are we NOT able to get our StreamHandle context? I’m not sure at this point what error code we are getting, all I know is that we don’t get STATUS_SUCCESS.

My fear is that even if I disable fast I/O I’ll end up seeing the same type crash in the non-fast read path if we are not reliably able to get our stream context.

You should be able to get the StreamHandle regardless of the load on the
system. Could you please investigate some more around that area and perhaps
find out what the error is ?

Thanks,
Alex.

> However in the case when it crashes - for some reason when we call

FltGetStreamHandleContext to retrieve our context and see if we care
about the operation we are getting an ERROR return from
FltGetStreamHandleContext and we currently return
FLT_PREOP_SUCCESS_NO_CALLBACK which
leads to the I/O getting down to NTFS and thus the crash.

So I guess my question is: Why occasionally and apparently only under
heavy load are we NOT able to get our StreamHandle context? I’m not
sure at this point what error code we are getting, all I know is that we
don’t get STATUS_SUCCESS.

This was what I was (obliquely) referring to in my questions. Has the
Instance changed? This is an interesting failure which crops up under load
in the present of format or dismount or media removal - without FO->Vpb
being setup, Fltmgr will collect the instance from Fo->Do->VPB and that may
be a different Instance.


Rod Widdowson
Consulting Partner
Steading System Software LLP
+44 1368 850217 +1 508 915 4790