The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.
Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/
I am quite new to Windows driver development, but I am a fairly experienced C/C++ developer. I am trying to track down a stability issue on a system with some new hardware we are developing. When running our new system, we see BSOD quite regularly. In hooking up the kernel debugger, I know that the reported reason is always due to PCI express completion timeouts. From the !analyze -v output:
STACK_COMMAND: .cxr; .ecxr ; kb
OSNAME: Windows 10
This lines up with what we have determined thus far. I've added read/write counters in our Windows driver. I have counters before & after each transaction. What I find is that when a crash occurs, we always crash (seemingly randomly after a few 100k transactions) during reads. I see that the "before" read count is one higher than the "after" read count. We also have similar counters in our FPGAs which shows that we see the same number of read requests as the driver, but the FPGA also isn't getting a response back from another hardware component. So, we know we have an issue in our hardware. It seems as though the FPGA waits for a response from the hardware before issuing a PCIe response back to our PCI card/driver (as far as I can tell). So, the PCI timeout makes sense.
Obviously, the root of the problem lies in fixing our hardware such that we no longer get PCIe completion timeouts. However, I'm wondering whether there is something I can do in our driver to make us more stable, such that we don't BSOD every time this problem occurs. If nothing else, it will make debugging our hardware easier. The problem is that the BSOD seems to be coming from the GenuineIntel driver. The stack trace shows a call to KeBugCheckEx. In my research, it seems as though this is the call that issues the BSOD. I'm wondering if there is something I can do in our driver to override this behavior? I've been trying to learn more about WHEA. I see some structures that have masks for the PCI completion timeout, and other things in relation to recoverable vs unrecoverable errors (such as https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/ns-wdm-_pci_express_uncorrectable_error_severity). However, I can find no examples of how to use them. Is there some way to define the PCIe completion timeout as a recoverable error such that we don't crash? Maybe this is just a bad idea?
For reference, The calls that the driver makes when issuing PCI transactions are the following:
Writes: WRITE_REGISTER_BUFFER_ULONG (from wdm.h)
Reads: READ_REGISTER_BUFFER_ULONG (from wdm.h)
BTW - I did not write the driver, I'm merely attempting to help debug (the original author is no longer available for consult).
Thank you in advance for any insights!
|Upcoming OSR Seminars|
|OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!|
|Kernel Debugging||16-20 October 2023||Live, Online|
|Developing Minifilters||13-17 November 2023||Live, Online|
|Internals & Software Drivers||4-8 Dec 2023||Live, Online|
|Writing WDF Drivers||10-14 July 2023||Live, Online|