Not clever at all !
contains 800000012994F963
the NX bit is set!
Not clever at all !
contains 800000012994F963
the NX bit is set!
OK… Let’s all back up a minute.
If we may, let’s first determine exactly how the OP is building this driver.
Mr. Staniford: Starting with Windows 8, the WDK is integrated with Visual Studio. There are VS project types for drivers, including mini-filters. This allows the WDK (and VS) to control the compiler settings appropriately, and ensures everything “works right.”
In Windows 7 and prior WDK releases, drivers were built from the command line with a unique set of tools. SOME people, however, “rolled their own” VS projects to allow them to build drivers using VS. This was always a bad idea, but… you know… people do what they want (usually out of ignorance of the consequences) when it comes to such things.
What I’m trying to determine is if your mini-filter is being built with the standard driver development environment, or if your mini-filter is being built with some home-hacked VS project. If the later (that is, the “roll your own project” solution), I’m not at all surprised something is going wrong, and the fix will be to create a proper new driver project is VS. If, on the other hand, your project is being built with the standard integrated VS+WDK project type… then we have another set of issues to be discussing.
Peter
OSR
@OSRDrivers
I’m not sure how the driver would have any impact on the NX state of the
kernel mode stack for the current thread.
OP, if you can put the dump somewhere I can try to poke at it and see if I
can pull anything else out of it (not promising a miracle, but I’d be
interested to see what’s going on).
-scott
OSR
@OSRDrivers
wrote in message news:xxxxx@ntdev…
Not clever at all !
contains 800000012994F963
the NX bit is set!
(Scott and I are not co-located at the moment… so we’ll talk to each other via NTDEV)
Sure… But I’m not entirely sure that this isn’t a red-herring, which is why I’m asking about the build environment. It’s a kernel worker thread that’s loading the driver, after all. And so it’s got to be SOMEthing related to the driver, I figure.
Could the driver not be built and linked with some ancient, crufty, misbegotten stuff that’s using the wrong prologue or doing something that’s ridiculously up-level incompatible? That’s where I’m going with the queries about the build process.
Peter
OSR
@OSRDrivers
Thanks for all your help guys!
This is pretty puzzling (certainly to me). What really odd is that two guys from my QA team have demonstrated to me that this driver will sometimes load even if virtual code isolation is turned on and the related BIOS settings are on. They seem to be able to make this BSOD happen most frequently if they upgrade from a previous version of the driver, causing an unload and then a load of a newer version… I can’t account for this. Installing the driver fresh doesn’t seem to cause the issue… (I’ve heard one report to the contrary but that guy couldn’t reproduce it)
Not clever at all !
> contains 800000012994F963
the NX bit is set!
Apologies, I wasn’t very familiar with the !pte command and when I “.hh !pte” it makes no mention of the NX bit so I assumed that “WE” meant write-execute. Now that I’ve looked up the underlying structure and read about it in the Winternals book I see you were referring to the most significant bit, which is of course set. So the stack most certainly is NX as you say.
So I guess that leaves me still trying to understand what caused the access violation. It sounds from what you’re suggesting that the page protections for both the read and write addresses involved in the mov are set up correctly and that there’s no reason why I should have had an access violation.
What I’m trying to determine is if your mini-filter is being built with the
standard driver development environment, or if your mini-filter is being built
with some home-hacked VS project. If the later (that is, the “roll your own
project” solution), I’m not at all surprised something is going wrong, and the
fix will be to create a proper new driver project is VS. If, on the other hand,
your project is being built with the standard integrated VS+WDK project type…
then we have another set of issues to be discussing.
Thank you, I recall that a couple of years ago we were building the driver with a makefile against Windows 7 SDK. When we moved to VS2013 we switched to using a visual studio project (the standard way you mention) and we have continued to do so, upgrading to the version 10 driver kit just recently.
The main properties page in visual studio shows:
Target platform: 10.0.1586.0
Platform Toolset: WindowsKernelModeDriver10.0
Configuration Type: Driver
I also have a .inf file which contains the altitude and a few other bits which I can post if desired.
OP, if you can put the dump somewhere
I’ve just checked with my boss and yes I can do this. I’d be really grateful if you can make anything of it…
https://dl.dropboxusercontent.com/u/31477495/BSOD%20Win%20Dump.zip
Here’s a public symbol file which goes with our driver:
https://dl.dropboxusercontent.com/u/31477495/PDB_Public_Release_x64_4.1.163.0.zip
> Linker:
[snip]
/MERGE:“_TEXT=.text;_PAGE=PAGE”
/LIBPATH:“V:\4.1\CppUnit\Properties..\Source\lib\x64”
/LIBPATH:“V:\4.1\Office2010DeveloperResources\Properties..\Source\x64\
Win7 Release” /LIBPATH:"V:\4.1\detours\Properties..\Source\lib\x64"
The ./MERGE might possibly be significant, but IMO the libpaths listed here are a little fishy.
CppUnit in a driver? Office2010DeveloperResources?
Is detours the kind of functionality which might need an executable stack?
> CppUnit in a driver? Office2010DeveloperResources?
Is detours the kind of functionality which might need an executable stack?
Our global properties file contains all our lib paths… So the driver gets a rather over the top set of library include paths. It doesn’t however actually get linked to any of those things, since they’re all user mode. Detours is a user mode trampolining library and isn’t used by our driver.
Anything is possible when it comes to Device Guard. According to the output
from the debugger we’re dying on a very benign initial instruction though (a
write to the home space), so seems more subtle. I’ll update what I found in
the dump (which admittedly isn’t much) in another response.
-scott
OSR
@OSRDrivers
wrote in message news:xxxxx@ntdev…
(Scott and I are not co-located at the moment… so we’ll talk to each other
via NTDEV)
Sure… But I’m not entirely sure that this isn’t a red-herring, which is
why I’m asking about the build environment. It’s a kernel worker thread
that’s loading the driver, after all. And so it’s got to be SOMEthing
related to the driver, I figure.
Could the driver not be built and linked with some ancient, crufty,
misbegotten stuff that’s using the wrong prologue or doing something that’s
ridiculously up-level incompatible? That’s where I’m going with the queries
about the build process.
Peter
OSR
@OSRDrivers
This kind of things happens when the stack overflow lib and the entry point doesn’t match the rest of libs.
DO NOT USE YOUR OWN BUILD SCRIPT.
The fact that it only happens on reload is really interesting. In fact, I
can see that was the case in the dump that you sent as the driver is also
sitting on the unloaded module list.
I wanted the dump because I was curious what fault led to the exception,
which we can see with the call stack:
0: kd> kc
00 nt!KeBugCheckEx
01 nt! ?? ::FNODOBFM::`string’
02 nt!_C_specific_handler
03 nt!RtlpExecuteHandlerForException
04 nt!RtlDispatchException
05 nt!KiDispatchException
06 nt!KiExceptionDispatch
07 nt!KiGeneralProtectionFault
08 DataNow_Driver!GsDriverEntry
09 nt!IopLoadDriver
0a nt!IopLoadUnloadDriver
0b nt!ExpWorkerThread
0c nt!PspSystemThreadStartup
0d nt!KiStartSystemThread
So, this exception starts off as a GP fault on the processor. Looking at the
faulting instruction, I just don’t see it:
0: kd> .frame /c 8
…
DataNow_Driver!GsDriverEntry:
fffff8005c0263ac mov qword ptr [rsp+8],rbx ss:0018:ffffd000
20e91970=0000000000000000
We’re dereferencing both the instruction pointer and RSP+8 here, so I went
looking for the usual suspects:
* Both addresses are canonical (check)
* RIP protection is Kernel, Executable, Read only. (check)
* RSP protection is Kernel, Writable. (check)
* The instruction is 100% benign, no weird shenanigans going on with iret,
control registers, etc. (check)
Unfortunately the processor doesn’t leave a reason for the GP fault lying
around, so I’m stumped. My *guess* would be that it’s coming from the
hypervisor component, but I don’t have any further evidence.
What’s really interesting is the fact that your driver previously loaded.
This fits with the fact that everything here looks OK and normal: last time
it worked just fine. Now for the fun bit…The last time your driver loaded
it was at the same address:
0: kd> lm mDataNow*
Browse full module list
start end module name
fffff8005c010000 fffff800
5c029000 DataNow_Driver (pdb symbols)
e:\dumps\ntdev\DataNow_Driver.pdb
Unloaded modules:
fffff8005c010000 fffff800
5c025000 DataNow_Driver.sys
The module was also previously shorter, which is weird. But I’m willing to
believe that’s just some artifact and not really interesting.
This leaves me with two possible explanations:
We’re crashing in the INIT section, which is unique in that it’s
discardable. Maybe Device Guard is confused and thinks that this region is
not mapped? Seems a stretch, but that’s all I have at the moment…Does it
ever WORK to unload the driver and reload it? If yes, it would be
interesting to see if the driver ends up at a different load address
You’re not properly cleaning up on unload, which is leaving some dangling
reference around that causes Device Guard to lose its mind (somehow). I ran
through your unload quickly and didn’t see anything obvious missing, so I’m
putting this here for completeness.
It’s very strange. For fun, you might want to build your driver in VS2015 +
WDK 10586 and see if it makes things better. Comparing a binary that I built
I don’t see any difference in PE headers, so I suspect that this will just
was your time. But, gives you something to do while you wait for a better
idea
-scott
OSR
@OSRDrivers
wrote in message news:xxxxx@ntdev…
Thanks for all your help guys!
This is pretty puzzling (certainly to me). What really odd is that two guys
from my QA team have demonstrated to me that this driver will sometimes load
even if virtual code isolation is turned on and the related BIOS settings
are on. They seem to be able to make this BSOD happen most frequently if
they upgrade from a previous version of the driver, causing an unload and
then a load of a newer version… I can’t account for this. Installing the
driver fresh doesn’t seem to cause the issue… (I’ve heard one report to the
contrary but that guy couldn’t reproduce it)
Not clever at all !
> contains 800000012994F963
the NX bit is set!
Apologies, I wasn’t very familiar with the !pte command and when I “.hh
!pte” it makes no mention of the NX bit so I assumed that “WE” meant
write-execute. Now that I’ve looked up the underlying structure and read
about it in the Winternals book I see you were referring to the most
significant bit, which is of course set. So the stack most certainly is NX
as you say.
So I guess that leaves me still trying to understand what caused the access
violation. It sounds from what you’re suggesting that the page protections
for both the read and write addresses involved in the mov are set up
correctly and that there’s no reason why I should have had an access
violation.
What I’m trying to determine is if your mini-filter is being built with
the
standard driver development environment, or if your mini-filter is being
built
with some home-hacked VS project. If the later (that is, the “roll your
own
project” solution), I’m not at all surprised something is going wrong, and
the
fix will be to create a proper new driver project is VS. If, on the other
hand,
your project is being built with the standard integrated VS+WDK project
type…
then we have another set of issues to be discussing.
Thank you, I recall that a couple of years ago we were building the driver
with a makefile against Windows 7 SDK. When we moved to VS2013 we switched
to using a visual studio project (the standard way you mention) and we have
continued to do so, upgrading to the version 10 driver kit just recently.
The main properties page in visual studio shows:
Target platform: 10.0.1586.0
Platform Toolset: WindowsKernelModeDriver10.0
Configuration Type: Driver
I also have a .inf file which contains the altitude and a few other bits
which I can post if desired.
OP, if you can put the dump somewhere
I’ve just checked with my boss and yes I can do this. I’d be really
grateful if you can make anything of it…
https://dl.dropboxusercontent.com/u/31477495/BSOD%20Win%20Dump.zip
Here’s a public symbol file which goes with our driver:
https://dl.dropboxusercontent.com/u/31477495/PDB_Public_Release_x64_4.1.163.0.zip
Scott,
Thanks so much for all this analysis. It’s nice to have someone with a lot more knowledge than me helping me out! I’m also glad you guys find my little problem interesting I’m going to try the new WDK as you suggest but first something you said as an aside really interested me and prompted me to test something in our lab.
The module was also previously shorter, which is weird
This isn’t surprising to me. That dump was created by one of our testers when he was upgrading from the 4.0.102 version of our product to the 4.1.163 version of our product. Both versions are MS signed and both are compliant with the device guard rules as published by MS. The first one is ~15kb smaller than the second because we’ve added a new feature to the latest version of our driver.
Now here’s the weird part… I was interested to know if it was simply unloading/reloading the driver which caused the issue or if it was unloading one driver and loading the other. So I put both drivers on the test machine and did this test:
Loaded 102 driver, unloaded, loaded, unloaded x 4, all went fine
Swapped to the 163 driver which is bigger and immediate BSOD
Then, I rebooted, and repaired the damaged files (again all our dlls and .sys files were corrupt despite the fact that the system booted)
Loaded 163 driver, unloaded, loaded, unloaded x 4, all went fine
Swapped to the 102 driver which is smaller and immediate BSOD
So what’s happening here?? Device guard doesn’t like it if your executable gets bigger/smaller ? Clearly it’s happy to reload exactly the same driver…
See if you have stray breakpoints. Clear all breakpoint and reboot.
> See if you have stray breakpoints. Clear all breakpoint and reboot.
I wasn’t live debugging, I was loading and unloading the drivers using “fltmc” with no debugger attached.
Also I did test with rebooting:
Load driver #1 - Reboot - Load driver #2 = Fine
Load driver #1 - Unload - Load driver #1 = Fine
Load driver #2 - Unload - Load driver #2 = Fine
Load driver #1 - Unload - Load driver #2 = BSOD
Load driver #2 - Unload - Load driver #1 = BSOD
Nice find! I totally assumed it was just noise and was too lazy to
investigate further. The results you’re getting are really spooky, either
there’s some mystic incantation that we’re missing or there’s a real bug
here.
I’ll be curious to hear if the new build tools change anything. I’m not sure
what that would be, but it’s worth a shot.
As a nature study I’d also be interested to know if changing the image
without changing the size also causes the problem. Or if beating the build
into making the INIT section non-discardable changes anything. These are for
my own curiosity though as I don’t think the result of either gets you a
solution.
-scott
OSR
@OSRDrivers
wrote in message news:xxxxx@ntdev…
See if you have stray breakpoints. Clear all breakpoint and reboot.
I wasn’t live debugging, I was loading and unloading the drivers using
“fltmc” with no debugger attached.
Also I did test with rebooting:
Load driver #1 - Reboot - Load driver #2 = Fine
Load driver #1 - Unload - Load driver #1 = Fine
Load driver #2 - Unload - Load driver #2 = Fine
Load driver #1 - Unload - Load driver #2 = BSOD
Load driver #2 - Unload - Load driver #1 = BSOD
> As a nature study I’d also be interested to know if changing the image
without changing the size also causes the problem.
Well, as fortune would have it. I was in a rush when I last submitted the last EV signed driver to Microsoft and became impatient when they took almost 24 hours to sign the driver via the portal. In the end I decided their portal might be broken so I submitted the same driver again just to see if that finished quickly. That means I happen to have to copies of the same driver which are signed. I thought it might be interesting if these two can be interchanged without a BSOD… (Although it’s possible that they’re binary identical, I’ll compare them first)
>Now for the fun bit…The last time your driver loaded it was at the same address:
This should not happen. ASLR (Address Space Layout Randomization) does not allow this.
But the /DYNAMICBASE linker option is active (by default) and explicitly turned on in your project. Another big question mark!
Your driver may be blacklisted for missing some security check and prevented from executing any instruction… Remember that your driver crashes at the very first instruction.
Review the project, remove any Windows XP and SDK related stuff. You also use a buffer overflow library that is not the default one for Windows 10.
There continue to be two possibilities, one of which is relatively easily eliminated:
Your build environment is broken. If I were you, I’d create an empty WDM driver project, pull your existing files into it, and see if it builds and loads. I’d do this just to satisfy myself that this is not the problem. Our own Mr. Noone finds this possibility unlikely, IIUC. But… Having seen horrible things in the past due to fucked-up build environments, this is one of my “go to” answers when weird things happen.
Device Guard is broken. Yesterday here at OSR we were discussing this at lunch. We all agreed that this was possible… Testing this scenario could potentially have fallen through the cracks, and we’re convinced there aren’t a ton of folks using it in the real world.
I’ve never played with Device Guard… I bet not a lot of folks have. If you attach the debugger can you load your driver without signing it? If you disable secure boot does Device Guard still work, and does your driver load?
Using a standard WDK sample that does nothing… Can you load, unload, load… Then update and load a new version, in the same scenario as your driver? Or does that also fail?
Peter
OSR
@OSRDrivers
Abdel:
But the /DYNAMICBASE linker option is active (by default) and explicitly turned
on in your project. Another big question mark!
So I’ve tested what happens and done some live debugging on a system without device guard and tried loading and unloading our driver using fltmc. I do appear to get ASLR:
Here’s the output of lm after 2 reloads:
fffff800a49b0000 fffff800
a49d1000 DataNow_Driver (deferred)
fffff800a49e0000 fffff800
a4a01000 DataNow_Driver (deferred)
fffff800a4a50000 fffff800
a4a6a000 DataNow_Driver (deferred)
Unfortunately I can only run with device guard enabled on a real machine, not a VM. I’ve ordered a cable so that I can do this but right now I can’t see if the same test produces the same results on a device guard machine.
I’ve never played with Device Guard… I bet not a lot of folks have. If you
attach the debugger can you load your driver without signing it?
Unfortunately I know of no way to do this other than by turning device guard off (at which point now BSOD occurs). Attempting to load a standard cross signed driver results in a load failure. So the only way I can test it is to go through the full EV signing process. For me, this involves gaining management approval to use our one Microsoft token, getting it from a safe over the other side of the building, going through a series of manual steps to pre-sign the build and then getting Microsoft to sign it before packaging it up ready to test it once they’ve done so. The time consuming part is waiting for Microsoft, the last time I did this, they took 24 hours to sign the driver through their portal. It’s very difficult to simply try things… to see if they work…
However, I have a series of suggestions from this thread so it’s possible that I can perform several Microsoft requests at once in order to build a series of debugging drivers. I will try your suggestion of going back to a simplified build.
When it happens, do you just delete the driver1 and then copy the driver2 over and load it?
The BSOD may be happening because of page-in error, which may happen if the file that backs the driver section is deleted, but the handle is not closed.
It may be a consequence of yet another Microsoft’s shark-jumping exercise which is backing the driver image from driver file.
> When it happens, do you just delete the driver1 and then copy the driver2 over
and load it?
Well, let’s see… Originally this problem occurred during an MSI upgrade of our whole product. The MSI would no doubt have done as you say, unloading one driver, deleting it. Then copying in and loading the newer driver once the new version of the product is installed.
However, when I was testing this, I had two different MS signed copies of our driver in the system32\drivers folder and was loading/unloading and then renaming the old one out of the way before trying the other.
So I’ve tried delete+copy and rename+rename…
The BSOD may be happening because of page-in error, which may happen if the file
that backs the driver section is deleted, but the handle is not closed.
Do you think this kind of thing would affect the driver only when code isolation is turned on though?