PCIe dma not working as expected

I am running into a similar problem. With or without VT-d, I start DMA on my device. Apparently this is a Synopsys DMA engine. The DMA controller does increment the source and target addresses by the transfer size but the data does not show up (I am doing device to memory). I allocated an MDL for the common buffer (It is 2 pages long) and built the physical pages using MmBuildMdlForNonPagedPool during driver bringup.

Before DMA start, I am calling KeFlushIoBuffers(MDL, TRUE, TRUE) and still data does not show up in my common buffer.

I am wondering if I have to transition to a packet DMA model where I think the flushing of caches is done by the framework when I complete the transaction. Unfortunately, my DMA is not interrupt driven, it is just polling but that should be ok I think.

1 Like

@Ramakrishna_Saripalli …. Why are you building that MDL?

You don’t need to change to packet mode unless that’s what your user interface needs/wants.

Let’s start by you, also, answering the same questions I posed to the OP.

Peter

@“Peter_Viscarola_(OSR)” wdk says in order to use keflushiobuffers i need an mdl. Wrt your questions yes i am getting the logical address and writing the lower 32 and upper 32 into the dma controller registers In fact after i start the dma i can see the dest addr and src addr getting incremented by the transfer size. But the data does not show up

Thought I would share my source code. This is one of the Synopsys DMA controllers (I do not have the specs for it). Instead I have the linux driver code so I am trying to make windows work from it. One of these days, I am going to boot Ubuntu 18.04, build the driver and see if it works. But the linux driver code was given to me so I am hoping it is the reference model.

The following is done during initialization. I have removed the code for error checking (The driver has it though).
WdfDeviceSetAlignmentRequirement(
device_ctxt,
FILE_BYTE_ALIGNMENT);

   WDF_DMA_ENABLER_CONFIG_INIT(
    &dmaConfig,
    WdfDmaProfilePacket64, /* device is capable of addressing all 64-bits but no scatter gather */
    8192);

status = WdfDmaEnablerCreate(
    p_x1_device_ctxt->device,
    &dmaConfig,
    WDF_NO_OBJECT_ATTRIBUTES,
    &device_ctxt->DmaEnablerHandle
);

    status = WdfCommonBufferCreate(
        device_ctxt->DmaEnablerHandle,
        p_x1_device_ctxt->common_dma_buffer_size, /* This is equal to 8192 */
        WDF_NO_OBJECT_ATTRIBUTES,
        &device_ctxt->h_common_dma_buffer
    );

        device_ctxt->common_dma_buffer_kernel_va =
            WdfCommonBufferGetAlignedVirtualAddress(device_ctxt->h_common_dma_buffer);
        device_ctxt->common_dma_buffer_device_la =
            WdfCommonBufferGetAlignedLogicalAddress(device_ctxt->h_common_dma_buffer);

        RtlZeroMemory(....)====> to zero out the common buffer

        device_ctxt->common_buffer_mdl = IoAllocateMdl(device_ctxt->device_ctxt->common_dma_buffer_kernel_va , 8192, FALSE, FALSE, NULL)

Before the DMA starts, driver writes a pattern of 0xdeadbeef to the DMA buffer using another ioctl. I have verified this works.

When I get the ioctl to start the DMA (from device memory to system memory).
Driver logic to start DMA. The ioctl is a METHOD_BUFFERED and provides the size of dma and other parameters.

KeFlushIoBuffers(device_ctxt->common_buffer_mdl, TRUE, TRUE);

< I even threw in a __wbinvd() here to flush out the whole cache hierarchy. Did not make a difference >

regs->dest_addr_low = inputBuffer->internal_addr & 0xFFFFFFFF;
regs->dest_addr_high = inputBuffer->internal_addr >> 32;
regs->source_addr_low = device_ctxt->common_dma_buffer_device_la.LowPart;
regs->source_addr_high = device_ctxt->common_dma_buffer_device_la.HighPart;

regs->transfer_size = inputBuffer->num_dma_bytes;
< another write to another register in the DMA for some control operation>.
MemoryBarrier();
< write to the doorbell register to start the DMA>

After the above operation, I can see the dest_addr_low and source_addr_low being incremented by the num_dma_bytes.
The transfer_size register turns to zero.

But the data is not showing up in the common buffer DMA. I have a DbgPrintEx after the DMA for the first DWORD of the common buffer and it still shows 0xdeadbeef.

This almost sounds like a stale cache on the destination side of the DMA transfer. 

Try cleaning the cache on the destination buffer before starting the DMA transfer. 
Then do a memory barrier after the DMA completes.

Greg

On Thu, 07 Oct 2021 13:32:52 +0000 (UTC), Ramakrishna_Saripalli wrote:

After the above operation, I can see the dest_addr_low and source_addr_low being incremented by the num_dma_bytes.

The transfer_size register turns to zero.

But the data is not showing up in the common buffer DMA. I have a DbgPrintEx after the DMA for the first DWORD of the common buffer and it still shows 0xdeadbeef.

@“Gregory_G._Dyess” I agree but the KeFlushIoBuffers() call should do that and that is being done right before the DMA operation.
This link https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/flushing-cached-data-during-dma-operations talks about KeFlushIoBuffers.

I also have a MemoryBarrier() just before writing to the doorbell to start the DMA so I know that all the operations before that have been completed before the DMA operation starts.

But you are recommending a MemoryBarrier() after the DMA operation completes. Is the KeFlushIoBuffers() not good enough?. I do not have any read or write operations to the DMA buffer between the KeFlushIoBuffers and the DMA operation (which might cause the cachelines to be refilled again).

I am suggesting the following sequence of operation:

  1. write data to source buffer
  2. flush buffers and caches of source buffer
  3. Memory barrier
  4. Invalidate cache of destination buffer
  5. perform DMA
  6. Memory barrier
  7. Enjoy your data

Keep in mind, I am primarily a kernel developer on Arm architectures (even worked for Arm and taught software architecture classes while there).  I saw an issue with a PCIe driver that was very similar to this on on a Xilinx-based RFSoC (4x Cortex A53) and it was the out of order execution and caches that turned out to be the issue.  The sequence I described above is recommended to ensure there are no stale caches, memory in the HW write buffers within the processor or out of order execution issues.

Intel might be different.

Greg
 

Thank you for the warm welcome,

To answer your questions Peter;
Yes I am giving the logical address from WdfCommonBufferGetAlignedLogicalAddress to the hardware.
The configuration mode for the dma_enabler is WdfDmaProfileScatterGather64Duplex.
I do have winDbg setup and checking the memory shows the same data I initialized it to both before and after the dma.

I’ll give it a shot using KeFlushIoBuffers + memory barrier and let you all know if anything interesting happens

Thanks for the many responses so far.
-David

@DavidN I would love to see if you have any better luck with this than I have. Do you have IOMMU enabled on your setup?.( assuming it supports one). FYI, my results do not matter whether VT-d is enabled or disabled.

Hmmm… I’m not sayin’ Mr @“Gregory_G._Dyess” isnt correct… but,

  1. When I hear hoof beats, I tend to think horses not zebras

  2. I have never seen a case where a properly constructed Windows driver needs to manually add a memory barrier into the code. This is supposed to be all handled by the Windows abstractions. Note we don’t code memory barriers around register reads and writes (though, in theory at least, you might have to call KeFlushIoBuffers).

Which memory barrier, or fence, specifically are you recommending these guys add, Mr @“Gregory_G._Dyess” ?.

Peter

@“Peter_Viscarola_(OSR)” that is what I thought too. Given that PCIe memory register reads and writes are mapped to uncached regions, I did not think memory barriers were needed. The processor (at least x64) does not reorder around uncached regions. At least that is my understanding.

I can see why the barriers are needed if we access cached regions (normal memory).

Thanks,
RK

If I were on a Zebra Farm instead of a cattle ranch…

If it were an Arm core, it would be a memory write barrier (DMB) after writing the data and before triggering the DMA.  After the DMA completes, a memory read barrier (DMB).  A Data Synchronization Barrier would be too heavy handed and, if used too often, would kill system performance.

Again, my experience is with Arm A-Class cores, not Intel.  Principles are similar but execution might be different. 

As I said, I’ve not written a desktop Windows driver in 20 years.  I write mostly Windows Embedded Compact (and Linux) kernel drivers now.  I’m not claiming the barriers are a fix-all.  Just the behavior being described sounded a lot like a stale cache and/or out of order execution.

Greg

On Thu, 07 Oct 2021 22:21:40 +0000 (UTC), “Peter_Viscarola_(OSR)” wrote:

OSR https://community.osr.com/

Peter_Viscarola_(OSR) commented on PCIe dma not working as expected

Hmmm… I’m not sayin’ Mr @Gregory_G._Dyess isnt correct… but,

1) When I hear hoof beats, I tend to think horses not zebras

2) I have never seen a case where a properly constructed Windows driver needs to manually add a memory barrier into the code. This is supposed to be all handled by the Windows abstractions. Note we don’t code memory barriers around register reads and writes (though, in theory at least, you might have to call KeFlushIoBuffers).

Which memory barrier, or fence, specifically are you recommending these guys add, Mr @Gregory_G._Dyess ?.

Peter

Barriers are not for synchronizing caches.  That would be the cache maintenance instructions which, as Mr Viscarola pointed out, should be handled by the Windows-provided driver framework(s).  The barriers simply keep the processor from executing certain sequences of code out of order.

Greg

On Thu, 07 Oct 2021 22:50:45 +0000 (UTC), Ramakrishna_Saripalli wrote:

OSR https://community.osr.com/

Ramakrishna_Saripalli commented on PCIe dma not working as expected

@Peter_Viscarola_(OSR) that is what I thought too. Given that PCIe memory register reads and writes are mapped to uncached regions, I did not think memory barriers were needed. The processor (at least x64) does not reorder around uncached regions. At least that is my understanding.

I can see why the barriers are needed if we access cached regions (normal memory).

Thanks,

RK

Unfortunately no luck yet for me. IOMMU being enabled/disabled didn’t influence anything on my end either, nor did KeFlushIoBuffers. my initialization code looks very similar to that which you posted RK.
Though it’s possible I’m doing something wrong I appear to be following the documentation on the use of KeFlushIoBuffers, just in case though I had tried moving it around to different points in the code with different configurations to no avail. To be honest I’m at a bit of a loss for what’s even left to try, but I’ll keep at it and let you know of any breakthroughs.

some extra info If it helps at all, the destination addr is a block ram on the fpga so unless I’m mistaken I don’t think it should be subject to cache problems.

Thanks,
David

@Ramakrishna_Saripalli said:
@DavidN I would love to see if you have any better luck with this than I have. Do you have IOMMU enabled on your setup?.( assuming it supports one). FYI, my results do not matter whether VT-d is enabled or disabled.

@DavidN my design is using the Synopsys Designware cores pci express controller. I have not had much luck either although I have a feeling I might be programming the DMA controller incorrectly. You are right. Unless the design has a cache for the onboard RAM, it should not be subject to cache problems.

Guys… there’s GOT to be a simple explanation here. I can pretty much guarantee you that, whatever your problem is, it has nothing to do with cache, memory barriers, fences, neutrinos, left-spin vs right-spin, or anything else similarly esoteric.

If you were experienced Windows devs, and all of a sudden, you were seeing a problem… or if you were seeing a problem SOMEtimes… or you were on ARM64, which probably hasn’t been nearly as well tested (in terms of the Windows abstractions)… then MAYBE I’d buy that this is a memory barrier problem.

Even worrying about KeFlushIoBuffers is a bit of a stretch. Until just a few years ago, this function was a noop on x86 and x64 architecture systems. From wdm.h:

#if (NTDDI_VERSION >= NTDDI_WINTHRESHOLD)

VOID
KeFlushIoBuffers (
    _In_ PMDL Mdl,
    _In_ BOOLEAN ReadOperation,
    _In_ BOOLEAN DmaOperation
    );

#else

#define KeFlushIoBuffers(Mdl, ReadOperation, DmaOperation)

#endif

Soooo… isn’t it much more likely that (a) there’s a bug in your FPGA, or (b) you’re making some simple error in your Windows API calls?

I just did this with a FPGA dev, four weeks ago. He SWORE he was doing a DMA to the memory segment I provided… but on closer inspection of his code, he saw… oooops… wrong address. So he was doing a DMA to some random place in physical memory. Ooopsie!

And presence or absence of an IOMMU has no bearing on anything. This is all cooked into calling GetAlignedLogicalAddress… the “Logical Address” is provided by the HAL and takes into account the IOMMU. This is why we don’t call MmGetPhysicalAddress, but instead WdfCommonBufferGetAlignedLogicalAddress.

SO, let’s go back to first principals, shall we?

  • Let’s be sure your programming your registers with the LogicalAddress – all 64-bits of it.
  • Let’s be sure the rest of the registers are setup right… For the guy using AXI CMDA… did you ever tell me whether this was simple mode or not? Regardless, see if you can get things working first with simple mode… then if you need to worry about S/G and descriptors you can.
  • Let’s be sure you’re looking at the data after a device-to-host memory transfer in the debugger, and NOT from some program you’ve written (too many chances for errors)
  • Let’s make sure that when you look at the data in the debugger, you try looking at it using the memory window, using first the kernel virtual address (that you get back from GetAlignedVirtualAddress) and the “physical memory” address you get back from GetAlignedLogicalAddress
  • Setup ChipScope or SignalTap or whatever… and see if you can monitor that DMA operation (easy for ME to say, never having actually used either one of these tools… I’m a host-side software guy, not an FPGA guy… though I sometimes masquerade as one).

Peter

Just for fun, here’s the code for KeFlushIoBuffers from Windows 20H1:

nt!KeFlushIoBuffers:
fffff804`3f52f7b0 48895c2410      mov     qword ptr [rsp+10h],rbx
fffff804`3f52f7b5 48896c2418      mov     qword ptr [rsp+18h],rbp
fffff804`3f52f7ba 56              push    rsi
fffff804`3f52f7bb 57              push    rdi
fffff804`3f52f7bc 4154            push    r12
fffff804`3f52f7be 4156            push    r14
fffff804`3f52f7c0 4157            push    r15
fffff804`3f52f7c2 4883ec60        sub     rsp,60h
fffff804`3f52f7c6 488b05f3c98e00  mov     rax,qword ptr [nt!_security_cookie (fffff804`3fe1c1c0)]
fffff804`3f52f7cd 4833c4          xor     rax,rsp
fffff804`3f52f7d0 4889442450      mov     qword ptr [rsp+50h],rax
fffff804`3f52f7d5 8b055db98e00    mov     eax,dword ptr [nt!KiSystemFullyCoherent (fffff804`3fe1b138)]
fffff804`3f52f7db 0f57c0          xorps   xmm0,xmm0
fffff804`3f52f7de 418ae8          mov     bpl,r8b
fffff804`3f52f7e1 448af2          mov     r14b,dl
fffff804`3f52f7e4 488bf9          mov     rdi,rcx
fffff804`3f52f7e7 0f11442430      movups  xmmword ptr [rsp+30h],xmm0
fffff804`3f52f7ec 0f11442440      movups  xmmword ptr [rsp+40h],xmm0
fffff804`3f52f7f1 85c0            test    eax,eax
fffff804`3f52f7f3 0f84d37c1500    je      nt!KeFlushIoBuffers+0x157d1c (fffff804`3f6874cc)
fffff804`3f52f7f9 488b4c2450      mov     rcx,qword ptr [rsp+50h]
fffff804`3f52f7fe 4833cc          xor     rcx,rsp
fffff804`3f52f801 e88a380900      call    nt!_security_check_cookie (fffff804`3f5c3090)
fffff804`3f52f806 4c8d5c2460      lea     r11,[rsp+60h]
fffff804`3f52f80b 498b5b38        mov     rbx,qword ptr [r11+38h]
fffff804`3f52f80f 498b6b40        mov     rbp,qword ptr [r11+40h]
fffff804`3f52f813 498be3          mov     rsp,r11
fffff804`3f52f816 415f            pop     r15
fffff804`3f52f818 415e            pop     r14
fffff804`3f52f81a 415c            pop     r12
fffff804`3f52f81c 5f              pop     rdi
fffff804`3f52f81d 5e              pop     rsi
fffff804`3f52f81e c3              ret

and (on my X64 VM):

0: kd> dd nt!KiSystemFullyCoherent
fffff804`3fe1b138  00000001 000032c9 00000001 02000504

So… put as many calls to KeFlushIoBuffers as you want, anywhere you want… :wink:

Let me hasten to add that the above is strictly aimed at x86/x64 architecture machines. ARM… is a different story. And Mr @Gregory_G._Dyess in his comments hasn’t been talking about cache flushing in any case, he’s been talking about instruction re-ordering.

Peter

@“Peter_Viscarola_(OSR)” thank you for your very detailed analysis. I have a feeling, at least in my case, I do not understand the Synopsys DMA controller programming model so there is a chance (pretty big) that I am not programming the DMA controller properly. I am still looking through the documentation. I suppose I should refrain from posting anything here until I get that sorted out.

1 Like

We were able to resolve the problem!
Many thanks to everyone who helped both in and out of this thread.

The issue, in case it helps someone else, was that we are using a 32-bit address space on the Zynq FPGA. To map from the address space of the FPGA to the 64-bit address space of the host we use the Address Translation registers of the AXI Memory Mapped
to PCI Express core that we are using ( https://www.xilinx.com/support/documentation/ip_documentation/axi_pcie/v2_8/pg055-axi-bridge-pcie.pdf ). We correctly set the AXI Base Address Translation Configuration Registers, however what we missed was that the lower address translation registers are limited by the address width of each AXI BAR, which is described on pg. 9 of XAPP1171 ( https://www.xilinx.com/support/documentation/application_notes/xapp1171-pcie-central-dma-subsystem.pdf ). In our case we have 64MB AXI BAR addresses, so had to call WdfDeviceSetAlignmentRequirement with 0x3FFFFFF. Hopefully this helps you too @Ramakrishna_Saripalli otherwise I wish you good luck getting to the bottom of your bug.

Thanks again everyone and enjoy the weekend.
-David

1 Like

Nice follow-up… thanks for telling us the ultimate solution to your problem.

Yup… Horses.

Peter