PCIe dma not working as expected

Guys… there’s GOT to be a simple explanation here. I can pretty much guarantee you that, whatever your problem is, it has nothing to do with cache, memory barriers, fences, neutrinos, left-spin vs right-spin, or anything else similarly esoteric.

If you were experienced Windows devs, and all of a sudden, you were seeing a problem… or if you were seeing a problem SOMEtimes… or you were on ARM64, which probably hasn’t been nearly as well tested (in terms of the Windows abstractions)… then MAYBE I’d buy that this is a memory barrier problem.

Even worrying about KeFlushIoBuffers is a bit of a stretch. Until just a few years ago, this function was a noop on x86 and x64 architecture systems. From wdm.h:

#if (NTDDI_VERSION >= NTDDI_WINTHRESHOLD)

VOID
KeFlushIoBuffers (
    _In_ PMDL Mdl,
    _In_ BOOLEAN ReadOperation,
    _In_ BOOLEAN DmaOperation
    );

#else

#define KeFlushIoBuffers(Mdl, ReadOperation, DmaOperation)

#endif

Soooo… isn’t it much more likely that (a) there’s a bug in your FPGA, or (b) you’re making some simple error in your Windows API calls?

I just did this with a FPGA dev, four weeks ago. He SWORE he was doing a DMA to the memory segment I provided… but on closer inspection of his code, he saw… oooops… wrong address. So he was doing a DMA to some random place in physical memory. Ooopsie!

And presence or absence of an IOMMU has no bearing on anything. This is all cooked into calling GetAlignedLogicalAddress… the “Logical Address” is provided by the HAL and takes into account the IOMMU. This is why we don’t call MmGetPhysicalAddress, but instead WdfCommonBufferGetAlignedLogicalAddress.

SO, let’s go back to first principals, shall we?

  • Let’s be sure your programming your registers with the LogicalAddress – all 64-bits of it.
  • Let’s be sure the rest of the registers are setup right… For the guy using AXI CMDA… did you ever tell me whether this was simple mode or not? Regardless, see if you can get things working first with simple mode… then if you need to worry about S/G and descriptors you can.
  • Let’s be sure you’re looking at the data after a device-to-host memory transfer in the debugger, and NOT from some program you’ve written (too many chances for errors)
  • Let’s make sure that when you look at the data in the debugger, you try looking at it using the memory window, using first the kernel virtual address (that you get back from GetAlignedVirtualAddress) and the “physical memory” address you get back from GetAlignedLogicalAddress
  • Setup ChipScope or SignalTap or whatever… and see if you can monitor that DMA operation (easy for ME to say, never having actually used either one of these tools… I’m a host-side software guy, not an FPGA guy… though I sometimes masquerade as one).

Peter

Just for fun, here’s the code for KeFlushIoBuffers from Windows 20H1:

nt!KeFlushIoBuffers:
fffff804`3f52f7b0 48895c2410      mov     qword ptr [rsp+10h],rbx
fffff804`3f52f7b5 48896c2418      mov     qword ptr [rsp+18h],rbp
fffff804`3f52f7ba 56              push    rsi
fffff804`3f52f7bb 57              push    rdi
fffff804`3f52f7bc 4154            push    r12
fffff804`3f52f7be 4156            push    r14
fffff804`3f52f7c0 4157            push    r15
fffff804`3f52f7c2 4883ec60        sub     rsp,60h
fffff804`3f52f7c6 488b05f3c98e00  mov     rax,qword ptr [nt!_security_cookie (fffff804`3fe1c1c0)]
fffff804`3f52f7cd 4833c4          xor     rax,rsp
fffff804`3f52f7d0 4889442450      mov     qword ptr [rsp+50h],rax
fffff804`3f52f7d5 8b055db98e00    mov     eax,dword ptr [nt!KiSystemFullyCoherent (fffff804`3fe1b138)]
fffff804`3f52f7db 0f57c0          xorps   xmm0,xmm0
fffff804`3f52f7de 418ae8          mov     bpl,r8b
fffff804`3f52f7e1 448af2          mov     r14b,dl
fffff804`3f52f7e4 488bf9          mov     rdi,rcx
fffff804`3f52f7e7 0f11442430      movups  xmmword ptr [rsp+30h],xmm0
fffff804`3f52f7ec 0f11442440      movups  xmmword ptr [rsp+40h],xmm0
fffff804`3f52f7f1 85c0            test    eax,eax
fffff804`3f52f7f3 0f84d37c1500    je      nt!KeFlushIoBuffers+0x157d1c (fffff804`3f6874cc)
fffff804`3f52f7f9 488b4c2450      mov     rcx,qword ptr [rsp+50h]
fffff804`3f52f7fe 4833cc          xor     rcx,rsp
fffff804`3f52f801 e88a380900      call    nt!_security_check_cookie (fffff804`3f5c3090)
fffff804`3f52f806 4c8d5c2460      lea     r11,[rsp+60h]
fffff804`3f52f80b 498b5b38        mov     rbx,qword ptr [r11+38h]
fffff804`3f52f80f 498b6b40        mov     rbp,qword ptr [r11+40h]
fffff804`3f52f813 498be3          mov     rsp,r11
fffff804`3f52f816 415f            pop     r15
fffff804`3f52f818 415e            pop     r14
fffff804`3f52f81a 415c            pop     r12
fffff804`3f52f81c 5f              pop     rdi
fffff804`3f52f81d 5e              pop     rsi
fffff804`3f52f81e c3              ret

and (on my X64 VM):

0: kd> dd nt!KiSystemFullyCoherent
fffff804`3fe1b138  00000001 000032c9 00000001 02000504

So… put as many calls to KeFlushIoBuffers as you want, anywhere you want… :wink:

Let me hasten to add that the above is strictly aimed at x86/x64 architecture machines. ARM… is a different story. And Mr @Gregory_G._Dyess in his comments hasn’t been talking about cache flushing in any case, he’s been talking about instruction re-ordering.

Peter

@“Peter_Viscarola_(OSR)” thank you for your very detailed analysis. I have a feeling, at least in my case, I do not understand the Synopsys DMA controller programming model so there is a chance (pretty big) that I am not programming the DMA controller properly. I am still looking through the documentation. I suppose I should refrain from posting anything here until I get that sorted out.

1 Like

We were able to resolve the problem!
Many thanks to everyone who helped both in and out of this thread.

The issue, in case it helps someone else, was that we are using a 32-bit address space on the Zynq FPGA. To map from the address space of the FPGA to the 64-bit address space of the host we use the Address Translation registers of the AXI Memory Mapped
to PCI Express core that we are using ( https://www.xilinx.com/support/documentation/ip_documentation/axi_pcie/v2_8/pg055-axi-bridge-pcie.pdf ). We correctly set the AXI Base Address Translation Configuration Registers, however what we missed was that the lower address translation registers are limited by the address width of each AXI BAR, which is described on pg. 9 of XAPP1171 ( https://www.xilinx.com/support/documentation/application_notes/xapp1171-pcie-central-dma-subsystem.pdf ). In our case we have 64MB AXI BAR addresses, so had to call WdfDeviceSetAlignmentRequirement with 0x3FFFFFF. Hopefully this helps you too @Ramakrishna_Saripalli otherwise I wish you good luck getting to the bottom of your bug.

Thanks again everyone and enjoy the weekend.
-David

1 Like

Nice follow-up… thanks for telling us the ultimate solution to your problem.

Yup… Horses.

Peter