Can I avoid BSOD due to PCI express completion timeout?

rosie1canobe · August 29, 2023, 10:16pm

Hello,

I am quite new to Windows driver development, but I am a fairly experienced C/C++ developer. I am trying to track down a stability issue on a system with some new hardware we are developing. When running our new system, we see BSOD quite regularly. In hooking up the kernel debugger, I know that the reported reason is always due to PCI express completion timeouts. From the !analyze -v output:

MODULE_NAME: GenuineIntel

IMAGE_NAME: GenuineIntel.sys

STACK_COMMAND: .cxr; .ecxr ; kb

FAILURE_BUCKET_ID: 0x124_5_GenuineIntel_PCIEXPRESS_VENID_157D_DEVID_3151_COMPLETION_TIMEOUT_IMAGE_GenuineIntel.sys

OS_VERSION: 10.0.14393.0

BUILDLAB_STR: rs1_release

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

FAILURE_ID_HASH: {293ef821-34fe-25dc-a682-28aa7c3857d8}

This lines up with what we have determined thus far. I’ve added read/write counters in our Windows driver. I have counters before & after each transaction. What I find is that when a crash occurs, we always crash (seemingly randomly after a few 100k transactions) during reads. I see that the “before” read count is one higher than the “after” read count. We also have similar counters in our FPGAs which shows that we see the same number of read requests as the driver, but the FPGA also isn’t getting a response back from another hardware component. So, we know we have an issue in our hardware. It seems as though the FPGA waits for a response from the hardware before issuing a PCIe response back to our PCI card/driver (as far as I can tell). So, the PCI timeout makes sense.

Obviously, the root of the problem lies in fixing our hardware such that we no longer get PCIe completion timeouts. However, I’m wondering whether there is something I can do in our driver to make us more stable, such that we don’t BSOD every time this problem occurs. If nothing else, it will make debugging our hardware easier. The problem is that the BSOD seems to be coming from the GenuineIntel driver. The stack trace shows a call to KeBugCheckEx. In my research, it seems as though this is the call that issues the BSOD. I’m wondering if there is something I can do in our driver to override this behavior? I’ve been trying to learn more about WHEA. I see some structures that have masks for the PCI completion timeout, and other things in relation to recoverable vs unrecoverable errors (such as https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/ns-wdm-_pci_express_uncorrectable_error_severity). However, I can find no examples of how to use them. Is there some way to define the PCIe completion timeout as a recoverable error such that we don’t crash? Maybe this is just a bad idea?

For reference, The calls that the driver makes when issuing PCI transactions are the following:

Writes: WRITE_REGISTER_BUFFER_ULONG (from wdm.h)
Reads: READ_REGISTER_BUFFER_ULONG (from wdm.h)

BTW - I did not write the driver, I’m merely attempting to help debug (the original author is no longer available for consult).

Thank you in advance for any insights!

Tim_Roberts · August 30, 2023, 12:03am

This is a hardware problem. You are violating the PCIe bus specification, and that is not tolerated. There is nothing you can do about this in software.

It may be time for you to invest in a PCIExpress bus analyzer. They cost as much as a house, but they are lifesavers for problems like this.

Peter_Viscarola_OSR · August 30, 2023, 1:31pm

Moving this to the proper forum…

rosie1canobe · August 30, 2023, 2:57pm

@Tim_Roberts - Thank you for your comment, I appreciate your insight.

So, just for clarification, is there really nothing that can be done? My hope was that if I caught the error, I could cancel the transaction and return an error in the driver, but then continue normal operation. Unfortunately, since the Intel driver seems to call KeBugCheckEx because the error is defined as unrecoverable, I don’t have a chance to resolve it. I was hoping that maybe during initialization, I could make some calls using the WHEA to redefine the error as recoverable, and then tie a a callback to the interrupt to properly stop the transaction and note the error.

Right now it’s just really hard for us to do much testing/debugging of the system since we blue screen, often within a couple minutes (sometimes seconds) of running the system. I don’t mean to offend, I just thought it was worth double checking. I admit that my knowledge of Windows drivers (and PCIe for that matter) is quite limited. I’m learning about this stuff as I go. Thanks!

Mark_Roddy · August 30, 2023, 6:12pm

I agree with @Tim_Roberts that your device has a hardware problem, however you can also adjust the latency timeout in your parent pcie root complex in its pcie capabilities… The problem is knowing which root complex is ‘yours’.