Hello,
I am quite new to Windows driver development, but I am a fairly experienced C/C++ developer. I am trying to track down a stability issue on a system with some new hardware we are developing. When running our new system, we see BSOD quite regularly. In hooking up the kernel debugger, I know that the reported reason is always due to PCI express completion timeouts. From the !analyze -v output:
MODULE_NAME: GenuineIntel
IMAGE_NAME: GenuineIntel.sys
STACK_COMMAND: .cxr; .ecxr ; kb
FAILURE_BUCKET_ID: 0x124_5_GenuineIntel_PCIEXPRESS_VENID_157D_DEVID_3151_COMPLETION_TIMEOUT_IMAGE_GenuineIntel.sys
OS_VERSION: 10.0.14393.0
BUILDLAB_STR: rs1_release
OSPLATFORM_TYPE: x64
OSNAME: Windows 10
FAILURE_ID_HASH: {293ef821-34fe-25dc-a682-28aa7c3857d8}
This lines up with what we have determined thus far. I’ve added read/write counters in our Windows driver. I have counters before & after each transaction. What I find is that when a crash occurs, we always crash (seemingly randomly after a few 100k transactions) during reads. I see that the “before” read count is one higher than the “after” read count. We also have similar counters in our FPGAs which shows that we see the same number of read requests as the driver, but the FPGA also isn’t getting a response back from another hardware component. So, we know we have an issue in our hardware. It seems as though the FPGA waits for a response from the hardware before issuing a PCIe response back to our PCI card/driver (as far as I can tell). So, the PCI timeout makes sense.
Obviously, the root of the problem lies in fixing our hardware such that we no longer get PCIe completion timeouts. However, I’m wondering whether there is something I can do in our driver to make us more stable, such that we don’t BSOD every time this problem occurs. If nothing else, it will make debugging our hardware easier. The problem is that the BSOD seems to be coming from the GenuineIntel driver. The stack trace shows a call to KeBugCheckEx. In my research, it seems as though this is the call that issues the BSOD. I’m wondering if there is something I can do in our driver to override this behavior? I’ve been trying to learn more about WHEA. I see some structures that have masks for the PCI completion timeout, and other things in relation to recoverable vs unrecoverable errors (such as https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/ns-wdm-_pci_express_uncorrectable_error_severity). However, I can find no examples of how to use them. Is there some way to define the PCIe completion timeout as a recoverable error such that we don’t crash? Maybe this is just a bad idea?
For reference, The calls that the driver makes when issuing PCI transactions are the following:
Writes: WRITE_REGISTER_BUFFER_ULONG (from wdm.h)
Reads: READ_REGISTER_BUFFER_ULONG (from wdm.h)
BTW - I did not write the driver, I’m merely attempting to help debug (the original author is no longer available for consult).
Thank you in advance for any insights!