Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results
The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.
Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/
We have a file protection minifilter that protects some files from getting written on. The problem is that on very few systems that have both kasperseky and our minifilter, sometimes the system "freezes", but its not a full freeze, mouse signals seems to still work because they can move the mouse, but thats the only thing that works, and for example cntrl+alt+delete doesnt work to bring up task manager. This happens both in Win7 and Win10 machines, And if they unload our driver, the machines no longer freeze.
The problem is that when the machine freezes like this, we can only move the mouse so the output of dbgview or procmon cannot be captured anymore.
So we set up the required keys in the kbdhid service's registry to force BSOD, and indeed that one worked and we managed to get a full kernel dump to work it and find out what is happening.
So I thought i ask the experts here on how should i go on about finding the root cause of this freeze? What windbg commands do you guys suggest i should use to find what is causing this?
Upcoming OSR Seminars | ||
---|---|---|
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead! | ||
Kernel Debugging | 16-20 October 2023 | Live, Online |
Developing Minifilters | 13-17 November 2023 | Live, Online |
Internals & Software Drivers | 4-8 Dec 2023 | Live, Online |
Writing WDF Drivers | 10-14 July 2023 | Live, Online |
Comments
My go to behavior is pretty "brute force and ignorance" but FWIW:
.logappend
to a file (the next two commands product a lot of output will be long)!locks
!process 0 7
(this will take a fair chunk of time)Then open the log file in an editor where it’s easy to search.
9 times out of 10 the output of
!locks
will lead you to a deadlock then its just a matter of working out what you are holding that Kaspersky is taking umbrage to.From your description it feels like something is blocked while holding the system volume locked - all the other threads will be backed up behind that - it's a matter of working out why that thread isn't moving
If
!locks
doesn't help you are going to have to spelunk through all the threads (the output of the!process
command) looking for candidates for a deadlock -look for threads with your driver on the stack or with kaspersky on the stack.!stacks 2 <yourdrvier.sys>
just lists the threads where you are on the stack and can be helpful but I find that I usually need to be able to peruse all the threads at some stage and so doing it up front saves me time in the long run.Hi Rod, thank you for the reply.
I am looking through the output of the commands you suggested as i am writing this, but as a side note, i found our driver and kaspersky's klfle in the stack of one of the cpu cores, although i am not sure whether its related to the problem or not, but in this stack we are calling FltCreateFile and then kaspersky is calling FltGetFileNameInformation, probably on the filename that we just tried to open. And we are opening the file using FILE_SHARE_READ/WRITE/DELETE and also with the READ+WRITE+SYNCHRONIZE access and FILE_SYNCHRONOUS_IO_NONALERT flag, and the file is a local file and not a network file.
output of locks:
Looking at the output of !locks, it seems like there are no deadlocks, and looking through the output of !process 0 7, this is the only thread that i can find both our driver and kaspersky in the stack:
Interestingly, i just remembered a while ago we had a BSOD issue with another AV (manual BSOD, not a Windows BSOD), that also had a similar call stack:
So in both cases, this starts in our preclose callback in the minifilter when we try to open a handle to the backup of file, using the name of the backup file that we stored on the context (and the name is correct), and in both cases, the target AV is calling FltGetFileNameInformation on that file after we use the FltCreateFile, anyone has any idea what is happening? I explained the flags that we pass to FltCreateFile in the above post for anyone interested.
What is thread ffffb408e3bbc080 doing?
-scott
OSR
This is a chrome.exe thread, and it seems like its doing some stuff one of its tmp files in appdata, and its kernel time seem to be relatively high compared to other threads:
This is its callstack:
And after further investigating its IRP list and their corresponding FILE_OBJECT, it seems like all related to some tmp files in its cachestore, and all the IRPs (close,cleanup,create for \FileSystem\FltMgr) are pending?
or Cleanup was always involved, the other IO could be any.
There is always a deadlock, just not visible with !locks sometimes. Not our
locks, NTFS locls on the file.
It looks like some combo of file IO can cause it, but I also saw it with
some AVs only, never with just our filter. Iirc, Avast and Symantec SEP.
Not directly helpful, but you are not alone.
We moved some IO in the end, we had no other choice.
Dejan.
What status is NTFS raising there?
And I can tell you where to go from there...
Any chance you can put the dump somewhere? It would be faster than trying to do it over the forum.
-scott
OSR
I asked the customer and unfortunately we are not allowed to share the dump file.
So it seems like the status is 0xC0000005 = STATUS_ACCESS_VIOLATION:
(based on the fact that the EDI is the status value and its content and the rbx+10h both are 0xC0000005)
Rdi is 80000005 there and that is a normal error for Ntfs to cause an
exception on (stupid, but it does).
It's buffer overflow, quite common code for an Information Query request.
I apologize, indeed it is a buffer overflow error (0x80000005), i thought rdi was the classic 0xC0000005, seems like I'm going blind.
So can this be happening because of some sort of loop in the minifilter stack? How can i find who's fault is it? We are following the rules and using FltCreate instead of NtCreate, thus the request should only go below us and not loop back, and i don't see anyone not following the rule on the call stack.
So do you think this is because of windows running out of kernel stack limit? How can i check if this is what's causing it? Could this error even be related to why the system is freezing?
I might also try to reduce our altitude below the AV and see if that helps or not.
call (both file info query and directory listing query).
I don't see how it relates to a stack limit?
So how can we find the root cause of this problem? The problem still exists.
Is this a bug in chrome.exe or Kaspersky or our driver or..?
Considering that it seems like the error code is 0x80000005, can this be caused loop in chrome.exe or..? Anyway I can find the root cause of this?
just looks strange the first time you see the stack.
The issue is not there.
I wish I knew myself what it is, as it got pretty common.
My experience is that a Close on the file was (almost) always involved, and
my guess is that some parent folder lock + file lock hierarchy is wrong.
But that is all I can help with myself.
If you can't share the dump and can't run DebugView or ProcMon then you need to figure out a new plan of attack...Have you tried reproducing it locally? Also, are you using WPP tracing? If so you can extract your logs from the dump file. Otherwise you can try running !dbgprint on the crash dump, sometimes the last bit of the DbgPrint buffer is available. Maybe something in there useful for you.
-scott
OSR
Unfortunately it is not reproducible, it only happens on few particular machines of the customer, and even then it takes hours before it happens again, and there are no useful logs found in WPP tracing or !dbgprint.
But I am really sure that the root cause of this is related to that Resource @ 0xffffb408e35b8310 and those four threads that are using it. because:
chrome.exe
thread, and this is the one that is taking the most kernel time out of all the processes (by alot, KernelTime 00:03:28.390). its callstack ends with. :And this is for a preclose happening on a backup file that we have created in our minifilter (so we have created this file previously, not in this callstack, using a FltCreateFile. Here however we are trying to get some info about the file on disk so we have to first use FltCreateFile to open the file, and we are using share delete/read/write). Our product is a backup system that backs up some files on some particular folders.
(This is the only meaningful core call stack in the crash dump, and also note that in this time, the system was frozen, so it has to be related to this)
And inspecting this one, shows that this is also related to another file in the CacheStorage of chrome.
Sadly, I couldn't retrieve any file name from these threads
, but I bet they are related to some chrome caching file too.
And the following is the content of that resource :
But as I am no an expert in the NTFS and file system in kernel, I can't put all these things together to understand what is causing this system freeze.. !locks output doesn't seem to indicate a deadlock, but it sure as hell seems like it..
How can I find more info about that resource @ 0xffffb408e35b8310? Anyone has any idea what else can search for in my crash dump to find out the root cause of this?
As a side note, lowering our minifilter altitude and making ourself go below kaspersky's minifilter solved the issue, but this is not a optimal long term solution, I need to find the reason that this is happening..
So overall this seems like a NTFS related deadlock on a internal NTFS ERESOURCE, and one weird thing I just found, is that considering that kaspersky (in callstack of thread ffffb408e36d7800) is using FltGetDestinationFileNameInformation, which surely wouldn't write to the file and is a read only operation, then why is NTFS.sys using ExAcquireResourceExclusiveLite, instead of ExAcquireResourceExclusiveLite for that resource? Why would FltGetDestinationFileNameInformation need a exclusive lock ?!!
Bur FILE_OPEN_IF or any other do, though FltMgr should not be asking that
type of Disposition.
Do you have MEX extensions for Windbg? If not, get them from MS's site, and
run !mex.t on the threads and post it please.
Long shot, but I wonder exactly what time they spent waiting.
Here you go, so indeed this is really related to these chrome cache files. Note that from these three files, we are the one who have created the index.txt.tmp.backup using FltCreateFile (with share read/write/delete) + FltWriteFile. But the other two files are not created by us. Considering that the driver for index.txt.tmp is FltMgr, maybe the minifilter for Kaspersky is the one that has created that file?
Also as a reminder, in the following call stack which is in the pre close of the index.txt.tmp.backup, we are using FltCreateFile to get the file object corresponding to the file (so we are using FILE_OPEN, with share read/write/delete), so we can get more info related to that file.