Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Home NTDEV

Before Posting...

Please check out the Community Guidelines in the Announcements and Administration Category.

More Info on Driver Writing and Debugging


The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.


Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/


Disabling PCIe relaxed ordering on the Root Complex

DBF_DGRDBF_DGR Member Posts: 20

Hi all,

I'm facing some troubles with PCIe TLP re-ordering. Our device doesn't actually support TLP relaxed oerering even announcing the support in its configuration.
I've tried to disabled it through a call to BUS_INTERFACE_STANDARD::SetBusData() and even this was correctly set in the PCI_EXPRESS_CAPABILITY, this was not effective. It looks that the capabilities are only for information (read access).
Then I've tried to disable this feature at the chipset/BIOS level. But, until now I've found nothing to do this (PCI.SYS, Registry).

So, is it possible to disable it ?
Context : Windows 10-21H2, kernel mode driver.

Thanks in advance,
Regards,
Eric.

Comments

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,719

    I assume you can understand why this doesn't work at an O/S level. TLP ordering is a protocol ting handled at the hardware level. The operating system is not involved. The only way to fix this is to modify your configuration space. Indeed, all of that has to work well before the operating system even loads.

    HOWEVER, TLP reordering will only occur if that bit is set AND your device sets the "reorder" bit in the TLP header. Are you setting that bit? Why? It is a protocol violation to set that bit in a TLP if the enable bit in config space is clear, so if you're doing that, you'll need more than a config space change.

    The Linux kernel does have a small list of devices that are known to violate the spec in this way. Windows does not.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • DBF_DGRDBF_DGR Member Posts: 20

    Hi Tim,
    Thanks for your answer.
    I perfectly understand that TLP ordering is only a PCIe controller stuff. But I presumed that Windows could disabled this through the PCIe controller's configuration overcoming the card's configuration space. After all, as you write, Linux does it well in the PCI subsystem.

    About setting the reorder bit in the TLP, the device has such a configuration, but it looks uneffective. When I disable TLP re-ordering this way, it simply lost packets. But in this case, what will happens on the PCI root complex ? Will it send re-ordered TLP as the card says it supports this ? The TLP re-order bit is just for the TLP issued by the device and not the PCIe root complex. This is could explain packets losses.

    Thanks for your help.

    Eric.

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,719

    TLP reordering should never cause packet losses. You might get data corruption, if you try to do a read while a posted write is still queued, but it would not get lost. Reordering just allows some packets to get processed ahead of others; reordering of TLPs going in your direction should have no effect. It's all the interaction of reads and writes. I assume you have read the somewhat murky articles about TLP reordering. Do you have a (horribly expensive) PCIExpress Bus Analyzer?

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • Don_BurnDon_Burn Member - All Emails Posts: 1,767

    "After all, as you write, Linux does it well in the PCI subsystem."

    Over the years I developed a lot of drivers for board that "this has been well tested it works fine on Linux" of the probably 2 dozen boards, one did not need firmware changes to pass PCI compliance. The biggest pain I had was getting the companies to rent a PCI test setup to see how far off the board was, or to accept that just because it worked on Linux did not mean it was complaint.

  • DBF_DGRDBF_DGR Member Posts: 20

    Tim,

    Packet losses : this is exactly what I think : WHY such losses ?? Actually, I'm not sure about packet losses : the output looks very strange even more that with re-ordered pakets.
    So, as you said, simply disabling the TLP re-order on the device will do the trick even if it is not disabled on te PCIe root complex ?

    PCIe Analyzer : one of my job dreams at the moment ;) expensive and quite imposible to rent.

    But definitively required is such situation.

    Best regards,
    Eric (lunch time here),
    Have a good day.

  • DBF_DGRDBF_DGR Member Posts: 20

    Hi Don,

    Sidenote warning :wink:
    Here, this is not a matter of compliancy, but a matter of bug workaround. Linux does this in order to overcome some issue with Intel CPU : https://elixir.bootlin.com/linux/latest/source/drivers/pci/quirks.c#L4316.

    Compliancy is a sometime very subjective, ACPI : specified by INTEL and Microsoft's implementation was not so close but followed by manufacturers.

    Relaxed ordering is not the only case where Linux and Windows differ on hardware handling (interrupt is another one).

    Regards,
    Eric.

  • DBF_DGRDBF_DGR Member Posts: 20

    Don,

    One question, do you remember some of the PCIe implementations on these boards (ASIC/FPGA/CPU) ?
    Our is based on a Xilinx IP (XDMA) : https://docs.xilinx.com/v/u/en-US/pg195-pcie-dma

    Thanks for your help,
    Regards,
    Eric.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 9,131
    edited May 2022

    I don’t know about Don, but I’ve used Xilinx XDMA multiple times, on Windows, with multiple (maybe 20?) MSI-x interrupts, and very high throughout loads, without any problem.

    If you suspect there’s a bug in the Xilinx IP, I’d suggest that’s not very likely. In my experience, those guys really know their FPGAs.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • DBF_DGRDBF_DGR Member Posts: 20

    Hi Peter,

    Good to read this. I prefer a soft bug that one in the FPGA. I could handle the first one much more easily than the latter.

    About the XDMA, did you used the AXIStream mode like we do here ?
    Do you used any FIFOs in the board : PCIe <=> XDMA <=> FIFOs.
    At the moment, I've done a special driver which fills the shared buffers with a predefined pattern on each XDMA completed interrupt to avoid any other perturbation sources. And the glitch stills occuring.

    We only found TWO computers where it never happens. One is an old (2008) PC Pentium E2180 the other is a Supermicro Core i5-3610ME with "NoSnoop" disabled in BIOS (yes one where it is possible) and Relaxed is enabled (that's strange).
    According to what I understand, NoSnoop is about cache consistency and not required for PCIe card.

    In the driver, I've tried to change the cache policy when calling AllocateCommonBuffer() : for the data buffers and the all descriptors, no way.

    Thanks for your help,
    Regards,
    Eric.

  • MBond2MBond2 Member Posts: 629

    It is a complete guess on my part, but the fact that you don't see your problem on old slow hardware with few cores, but do on other hardware, suggests to me that you have a thread synch problem of some kind. Also, if changing CPU cache settings matters, that's another thing that will affect timing and not much else - which also points to a synchronization issue.

  • DBF_DGRDBF_DGR Member Posts: 20

    Hi MBond2,

    We've already followed this path (single core & thread SMT in BIOS) but same results.

    We've tried to fill the DMA buffers straight in the DPC (no race condition with PASSIVE level here) and got the same results too.

    BTW, we've already got an issue with DPC scheduling in multi-core systems : not so documented in MSDN.

    Thanks for your help.

    Regards.
    Eric.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 9,131

    BTW, we've already got an issue with DPC scheduling in multi-core systems : not so documented in MSDN.

    Care to share that with us?

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • DBF_DGRDBF_DGR Member Posts: 20

    Peter,

    About DPC race condition : there is two DPC (one per direction) and associated DPC objects and then we observed that DPC will be scheduled in // (one per core). That's ok for us and protected by design and shared areas use a spinlock. I found nothing about DPC scheduling with multi-core CPU. It's perfectly logical to use one core per DPC object, but this could be helpfull to get this documented or maybe I don't search enough.

    Eric.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 9,131

    OK... Thanks for sharing.

    Yes... that's a standard "feature" of how DPCs work on Windows.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • MBond2MBond2 Member Posts: 629

    You can't assess if you have a thread sync problem or not by disabling cores in the BIOS. Even when Windows sees only a single CPU, corruption can still happen. Using 'only' elevated IRQL has the same problem since you have to eventually allow the dispatcher to run

    But reading your question again, you are seeing data loss. Is it possible that you are not handling coalesced interrupts properly? Device timing and chipset hardware trigger different behaviours on different machines and different Windows versions.

  • DBF_DGRDBF_DGR Member Posts: 20

    MBond2,

    Thanks for your comment, I agree, this is why it is called a "preemtive multitasking scheduler". About the coalesced interrupts, it was one of our hypothesis, but it happens at very low speed too. The interrupt handler is quite simple and disable the interrupt first. The DPC will re-enable it once it processed the event. I know it sounds strange, but this is requested by the IP's designer.
    But I keep it in mind : and add a simple test right now : a counter, incremented in ISR and decremented in DPC (interlocked functions), never more than ONE.

    Regards,
    Eric.

  • DBF_DGRDBF_DGR Member Posts: 20

    Hi @Peter_Viscarola_(OSR) ,
    About your previous projects with Xilinx XDMA, did you used them in Streaming mode ?
    This would be a major help for us.

    Regards,
    Eric.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 9,131

    About the XDMA, did you used the AXIStream mode like we do here ?
    Do you used any FIFOs in the board : PCIe <=> XDMA <=> FIFOs.

    Sorry to not answer your question more quickly... I've been sick for the past couple of days.

    The (incredibly not helpful) answer to your question is "I don't know" -- We were responsible for the DRIVER, but not for the FPGA programming. We used (an updated and customized) version of the (infamous) LibXDMA on the driver side, PLUS support a ton of other features such as transfers via the AXI Address Translator (formerly called the AXI "Slave Bridge").

    I'm sorry to make you wait, and then not be able to provide any useful information.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • DBF_DGRDBF_DGR Member Posts: 20

    @Peter_Viscarola_(OSR) : I hope your doing well now.

    About XDMA, let me tell you that I've re-developped a brand new XDMA layer : no libXDMA here (we've surely read the same comments on the forum).

    We did the same for the Linux driver (based on DMAEngine).
    So no libXDMA issue, but others are more than possible.

    Driver side : perfect, we are on the same side. Do you remember if it was streaming or memory mapped mode ?
    Here, we only use XDMA/BAR1 and somme registers in the BAR0 all with MSI interrupt.

    Regards,
    Eric.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 9,131

    Hi Eric,

    Do you remember if it was streaming or memory mapped mode

    I remember that we setup a set of S/G descriptors, and did a series of discrete transfers between host memory and FPGA memory (IOW, we didn't continually stream data into host memory using XDMA... when we did this, we used the Address Translator).

    I'm sorry I can't be more helpful.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • MBond2MBond2 Member Posts: 629

    That does not sound strange to me. It sounds like a common interrupt moderation scheme. In fact many devices leave interrupts disabled while operating at their highest throughput by transitioning to a polling mode to improve overall system efficiency through a reduction in interrupts and context switches.

    I assume that you are aware that in a design like this your DPC should be prepared to handle multiple events and must check the hardware for one more after re-enabling the interrupt. Both of these are to avoid data loss (or long delays)

  • DBF_DGRDBF_DGR Member Posts: 20

    Hi all,
    Sorry for my late reply : bug hunting on friday afternoon and lost time reference ;-).

    So,
    @Peter_Viscarola_(OSR) thanks for your reply, It tells me that this is the Memory Mapped mode.

    @MBond : you're so right, I forgot this point for high speed card. Fortunately, this is not our case yet. At the moment, the minimal interrupt period is about 160us which can be correctly handled by current CPUs (3GHz and above). But we experience troubles even at low speed (4ms ISR period).
    BTW, I'll take your remark in note for our next release : a simple while() loop in the DPC will do the trick. THANKS !

    Our last tests/fixes are leading us to another hypothesis : mix of DMA streams and registers accesses. Each are on separate BARs on the same card. If only DMA streams are actives : everything is fine. Troubles occur when we mixed DMA with registers accesses.

    So it's more an issue with the IP itself.

    Did SO already heared this ?
    To be continued ...

    Regards,
    Eric.

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,719

    Now, THAT is a plausible explanation. Verilog makes it pretty easy to confuse "this changes state immediately" and "this changes state at the next clock tick", and the difference can trigger ugly errors.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. Sign in or register to get started.

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Kernel Debugging 13-17 May 2024 Live, Online
Developing Minifilters 1-5 Apr 2024 Live, Online
Internals & Software Drivers 11-15 Mar 2024 Live, Online
Writing WDF Drivers 26 Feb - 1 Mar 2024 Live, Online