Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Home NTDEV

Before Posting...

Please check out the Community Guidelines in the Announcements and Administration Category.

More Info on Driver Writing and Debugging


The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.


Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/


Can I avoid BSOD due to PCI express completion timeout?

rosie1canoberosie1canobe Member Posts: 2
edited August 2023 in NTDEV

Hello,

I am quite new to Windows driver development, but I am a fairly experienced C/C++ developer. I am trying to track down a stability issue on a system with some new hardware we are developing. When running our new system, we see BSOD quite regularly. In hooking up the kernel debugger, I know that the reported reason is always due to PCI express completion timeouts. From the !analyze -v output:

MODULE_NAME: GenuineIntel

IMAGE_NAME: GenuineIntel.sys

STACK_COMMAND: .cxr; .ecxr ; kb

FAILURE_BUCKET_ID: 0x124_5_GenuineIntel_PCIEXPRESS_VENID_157D_DEVID_3151_COMPLETION_TIMEOUT_IMAGE_GenuineIntel.sys

OS_VERSION: 10.0.14393.0

BUILDLAB_STR: rs1_release

OSPLATFORM_TYPE: x64

OSNAME: Windows 10

FAILURE_ID_HASH: {293ef821-34fe-25dc-a682-28aa7c3857d8}

This lines up with what we have determined thus far. I've added read/write counters in our Windows driver. I have counters before & after each transaction. What I find is that when a crash occurs, we always crash (seemingly randomly after a few 100k transactions) during reads. I see that the "before" read count is one higher than the "after" read count. We also have similar counters in our FPGAs which shows that we see the same number of read requests as the driver, but the FPGA also isn't getting a response back from another hardware component. So, we know we have an issue in our hardware. It seems as though the FPGA waits for a response from the hardware before issuing a PCIe response back to our PCI card/driver (as far as I can tell). So, the PCI timeout makes sense.

Obviously, the root of the problem lies in fixing our hardware such that we no longer get PCIe completion timeouts. However, I'm wondering whether there is something I can do in our driver to make us more stable, such that we don't BSOD every time this problem occurs. If nothing else, it will make debugging our hardware easier. The problem is that the BSOD seems to be coming from the GenuineIntel driver. The stack trace shows a call to KeBugCheckEx. In my research, it seems as though this is the call that issues the BSOD. I'm wondering if there is something I can do in our driver to override this behavior? I've been trying to learn more about WHEA. I see some structures that have masks for the PCI completion timeout, and other things in relation to recoverable vs unrecoverable errors (such as https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/ns-wdm-_pci_express_uncorrectable_error_severity). However, I can find no examples of how to use them. Is there some way to define the PCIe completion timeout as a recoverable error such that we don't crash? Maybe this is just a bad idea?

For reference, The calls that the driver makes when issuing PCI transactions are the following:

Writes: WRITE_REGISTER_BUFFER_ULONG (from wdm.h)
Reads: READ_REGISTER_BUFFER_ULONG (from wdm.h)

BTW - I did not write the driver, I'm merely attempting to help debug (the original author is no longer available for consult).

Thank you in advance for any insights!

Post edited by Peter_Viscarola_(OSR) on

Comments

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,837

    This is a hardware problem. You are violating the PCIe bus specification, and that is not tolerated. There is nothing you can do about this in software.

    It may be time for you to invest in a PCIExpress bus analyzer. They cost as much as a house, but they are lifesavers for problems like this.

    Tim Roberts, [email protected]
    Software Wizard Emeritus

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 9,160

    Moving this to the proper forum...

    Peter Viscarola
    OSR
    @OSRDrivers

  • rosie1canoberosie1canobe Member Posts: 2

    @Tim_Roberts - Thank you for your comment, I appreciate your insight.

    So, just for clarification, is there really nothing that can be done? My hope was that if I caught the error, I could cancel the transaction and return an error in the driver, but then continue normal operation. Unfortunately, since the Intel driver seems to call KeBugCheckEx because the error is defined as unrecoverable, I don't have a chance to resolve it. I was hoping that maybe during initialization, I could make some calls using the WHEA to redefine the error as recoverable, and then tie a a callback to the interrupt to properly stop the transaction and note the error.

    Right now it's just really hard for us to do much testing/debugging of the system since we blue screen, often within a couple minutes (sometimes seconds) of running the system. I don't mean to offend, I just thought it was worth double checking. I admit that my knowledge of Windows drivers (and PCIe for that matter) is quite limited. I'm learning about this stuff as I go. Thanks!

  • Mark_RoddyMark_Roddy Member - All Emails Posts: 4,757

    I agree with @Tim_Roberts that your device has a hardware problem, however you can also adjust the latency timeout in your parent pcie root complex in its pcie capabilities.. The problem is knowing which root complex is 'yours'.

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. Sign in or register to get started.

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Kernel Debugging 13-17 May 2024 Live, Online
Developing Minifilters 1-5 Apr 2024 Live, Online
Internals & Software Drivers 11-15 Mar 2024 Live, Online
Writing WDF Drivers 20-24 May 2024 Live, Online