Disabling PCIe relaxed ordering on the Root Complex

Hi Tim,
Thanks for your answer.
I perfectly understand that TLP ordering is only a PCIe controller stuff. But I presumed that Windows could disabled this through the PCIe controller’s configuration overcoming the card’s configuration space. After all, as you write, Linux does it well in the PCI subsystem.

About setting the reorder bit in the TLP, the device has such a configuration, but it looks uneffective. When I disable TLP re-ordering this way, it simply lost packets. But in this case, what will happens on the PCI root complex ? Will it send re-ordered TLP as the card says it supports this ? The TLP re-order bit is just for the TLP issued by the device and not the PCIe root complex. This is could explain packets losses.

Thanks for your help.

Eric.

TLP reordering should never cause packet losses. You might get data corruption, if you try to do a read while a posted write is still queued, but it would not get lost. Reordering just allows some packets to get processed ahead of others; reordering of TLPs going in your direction should have no effect. It’s all the interaction of reads and writes. I assume you have read the somewhat murky articles about TLP reordering. Do you have a (horribly expensive) PCIExpress Bus Analyzer?

“After all, as you write, Linux does it well in the PCI subsystem.”

Over the years I developed a lot of drivers for board that “this has been well tested it works fine on Linux” of the probably 2 dozen boards, one did not need firmware changes to pass PCI compliance. The biggest pain I had was getting the companies to rent a PCI test setup to see how far off the board was, or to accept that just because it worked on Linux did not mean it was complaint.

Tim,

Packet losses : this is exactly what I think : WHY such losses ?? Actually, I’m not sure about packet losses : the output looks very strange even more that with re-ordered pakets.
So, as you said, simply disabling the TLP re-order on the device will do the trick even if it is not disabled on te PCIe root complex ?

PCIe Analyzer : one of my job dreams at the moment :wink: expensive and quite imposible to rent.

But definitively required is such situation.

Best regards,
Eric (lunch time here),
Have a good day.

Hi Don,

Sidenote warning :wink:
Here, this is not a matter of compliancy, but a matter of bug workaround. Linux does this in order to overcome some issue with Intel CPU : https://elixir.bootlin.com/linux/latest/source/drivers/pci/quirks.c#L4316.

Compliancy is a sometime very subjective, ACPI : specified by INTEL and Microsoft’s implementation was not so close but followed by manufacturers.

Relaxed ordering is not the only case where Linux and Windows differ on hardware handling (interrupt is another one).

Regards,
Eric.

Don,

One question, do you remember some of the PCIe implementations on these boards (ASIC/FPGA/CPU) ?
Our is based on a Xilinx IP (XDMA) : https://docs.xilinx.com/v/u/en-US/pg195-pcie-dma

Thanks for your help,
Regards,
Eric.

I don’t know about Don, but I’ve used Xilinx XDMA multiple times, on Windows, with multiple (maybe 20?) MSI-x interrupts, and very high throughout loads, without any problem.

If you suspect there’s a bug in the Xilinx IP, I’d suggest that’s not very likely. In my experience, those guys really know their FPGAs.

Peter

Hi Peter,

Good to read this. I prefer a soft bug that one in the FPGA. I could handle the first one much more easily than the latter.

About the XDMA, did you used the AXIStream mode like we do here ?
Do you used any FIFOs in the board : PCIe <=> XDMA <=> FIFOs.
At the moment, I’ve done a special driver which fills the shared buffers with a predefined pattern on each XDMA completed interrupt to avoid any other perturbation sources. And the glitch stills occuring.

We only found TWO computers where it never happens. One is an old (2008) PC Pentium E2180 the other is a Supermicro Core i5-3610ME with “NoSnoop” disabled in BIOS (yes one where it is possible) and Relaxed is enabled (that’s strange).
According to what I understand, NoSnoop is about cache consistency and not required for PCIe card.

In the driver, I’ve tried to change the cache policy when calling AllocateCommonBuffer() : for the data buffers and the all descriptors, no way.

Thanks for your help,
Regards,
Eric.

It is a complete guess on my part, but the fact that you don’t see your problem on old slow hardware with few cores, but do on other hardware, suggests to me that you have a thread synch problem of some kind. Also, if changing CPU cache settings matters, that’s another thing that will affect timing and not much else - which also points to a synchronization issue.

Hi MBond2,

We’ve already followed this path (single core & thread SMT in BIOS) but same results.

We’ve tried to fill the DMA buffers straight in the DPC (no race condition with PASSIVE level here) and got the same results too.

BTW, we’ve already got an issue with DPC scheduling in multi-core systems : not so documented in MSDN.

Thanks for your help.

Regards.
Eric.

BTW, we’ve already got an issue with DPC scheduling in multi-core systems : not so documented in MSDN.

Care to share that with us?

Peter

Peter,

About DPC race condition : there is two DPC (one per direction) and associated DPC objects and then we observed that DPC will be scheduled in // (one per core). That’s ok for us and protected by design and shared areas use a spinlock. I found nothing about DPC scheduling with multi-core CPU. It’s perfectly logical to use one core per DPC object, but this could be helpfull to get this documented or maybe I don’t search enough.

Eric.

OK… Thanks for sharing.

Yes… that’s a standard “feature” of how DPCs work on Windows.

Peter

You can’t assess if you have a thread sync problem or not by disabling cores in the BIOS. Even when Windows sees only a single CPU, corruption can still happen. Using ‘only’ elevated IRQL has the same problem since you have to eventually allow the dispatcher to run

But reading your question again, you are seeing data loss. Is it possible that you are not handling coalesced interrupts properly? Device timing and chipset hardware trigger different behaviours on different machines and different Windows versions.

MBond2,

Thanks for your comment, I agree, this is why it is called a “preemtive multitasking scheduler”. About the coalesced interrupts, it was one of our hypothesis, but it happens at very low speed too. The interrupt handler is quite simple and disable the interrupt first. The DPC will re-enable it once it processed the event. I know it sounds strange, but this is requested by the IP’s designer.
But I keep it in mind : and add a simple test right now : a counter, incremented in ISR and decremented in DPC (interlocked functions), never more than ONE.

Regards,
Eric.

Hi @“Peter_Viscarola_(OSR)” ,
About your previous projects with Xilinx XDMA, did you used them in Streaming mode ?
This would be a major help for us.

Regards,
Eric.

About the XDMA, did you used the AXIStream mode like we do here ?
Do you used any FIFOs in the board : PCIe <=> XDMA <=> FIFOs.

Sorry to not answer your question more quickly… I’ve been sick for the past couple of days.

The (incredibly not helpful) answer to your question is “I don’t know” – We were responsible for the DRIVER, but not for the FPGA programming. We used (an updated and customized) version of the (infamous) LibXDMA on the driver side, PLUS support a ton of other features such as transfers via the AXI Address Translator (formerly called the AXI “Slave Bridge”).

I’m sorry to make you wait, and then not be able to provide any useful information.

Peter

@“Peter_Viscarola_(OSR)” : I hope your doing well now.

About XDMA, let me tell you that I’ve re-developped a brand new XDMA layer : no libXDMA here (we’ve surely read the same comments on the forum).

We did the same for the Linux driver (based on DMAEngine).
So no libXDMA issue, but others are more than possible.

Driver side : perfect, we are on the same side. Do you remember if it was streaming or memory mapped mode ?
Here, we only use XDMA/BAR1 and somme registers in the BAR0 all with MSI interrupt.

Regards,
Eric.

Hi Eric,

Do you remember if it was streaming or memory mapped mode

I remember that we setup a set of S/G descriptors, and did a series of discrete transfers between host memory and FPGA memory (IOW, we didn’t continually stream data into host memory using XDMA… when we did this, we used the Address Translator).

I’m sorry I can’t be more helpful.

Peter

That does not sound strange to me. It sounds like a common interrupt moderation scheme. In fact many devices leave interrupts disabled while operating at their highest throughput by transitioning to a polling mode to improve overall system efficiency through a reduction in interrupts and context switches.

I assume that you are aware that in a design like this your DPC should be prepared to handle multiple events and must check the hardware for one more after re-enabling the interrupt. Both of these are to avoid data loss (or long delays)

1 Like