PCIe dma not working as expected

Hi all. I’m in the process of developing a KMDF PCIe driver and am having some trouble getting device->host/host->device DMA to work properly (it works perfectly fine when communicating with itself). As far as I can tell everything is being set up properly but the data is not being transferred from src to dst.

In the driver initialization I’m using WdfCommonBufferCreate and writing data to the virtual address obtained from WdfCommonBufferGetAlignedVirtualAddress.
I then have an application that should serve to initiate the dma, I initialize the dst buffer, write the src, dst addresses, and transfer size data to the AXI CDMA registers and then wait for the signal that the transfer has completed. The status & control indicate that things should have worked however when reading from dst nothing new has been written.

This is my first post and also first experience with driver development (I’m a Co-op) so I’d greatly appreciate as much detail as possible in answers & also apologize if I’ve not included enough information in this initial post.

Thanks in advance.

Welcome to the community Mr. @DavidN – We’ll try to be gentle :wink:

Before we dive down into any more detail, can you confirm for me please that Common Buffer address that you’re giving to the hardware at the target for your DMA operations is the address you get from WdfCommonBufferGetAlignedLogicalAddress?

Are you using Simple DMA Mode or Scatter/Gather Mode for your transfers?

Also, please verify that you have WinDbg set up… and that you’re actually looking at the buffer in question via WinDbg. I assume that, setting up for a DMA write operation to memory (data coming FROM the device TO host memory), you initialize the buffer to something (just as a test, for example, you set it to all 0xFF or something), setup and do the DMA write to memory, then look at the memory buffer and see… that it hasn’t changed??

Peter

A common DMA gotcha I’ve seen, especially on ARM platforms, is the need for a memory barrier between the write to the buffer and the write to trigger the DMA controller.  This bit me BIG TIME on a PCIe driver in the past.

Most modern processors have “out of order” execution capabilities.  Without the memory barrier, the write to memory can sometimes be incomplete by the time the DMA operation is triggered.

Greg

On Wed, 06 Oct 2021 17:46:10 +0000 (UTC), “Peter_Viscarola_(OSR)” wrote:

OSR https://community.osr.com/

Peter_Viscarola_(OSR) commented on PCIe dma not working as expected

Welcome to the community Mr. @DavidN – We’ll try to be gentle :wink:

Before we dive down into any more detail, can you confirm for me please that Common Buffer address that you’re giving to the hardware at the target for your DMA operations is the address you get from WdfCommonBufferGetAlignedLogicalAddress?

Are you using Simple DMA Mode or Scatter/Gather Mode for your transfers?

Also, please verify that you have WinDbg set up… and that you’re actually looking at the buffer in question via WinDbg. I assume that, setting up for a DMA write operation to memory (data coming FROM the device TO host memory), you initialize the buffer to something (just as a test, for example, you set it to all 0xFF or something), setup and do the DMA write to memory, then look at the memory buffer and see… that it hasn’t changed??

Peter

Great point. The docs are wildly remiss in not mentioning this for KMDF and Commkn Buffers. KeFluhIoBuffers is your friend.

If you’re using the Packet Based interface (DMA Transactions) the flushing should be taken care of for you. Is that not you experience Mr. @“Gregory_G._Dyess” ?

I’ve yet to do any bushmaster DMA on ARM. I’m looking forward to the chance, though.

Peter

I have to admit I’ve not done a lot of driver development for desktop Windows in a LONG time.  My entire Windows kernel development now is limited to Windows Embedded and Linux on ARM processors.  Windows Embedded does things completely differently when it comes to drivers (except for NDIS).

Greg

On Wed, 06 Oct 2021 21:54:07 +0000 (UTC), “Peter_Viscarola_(OSR)” wrote:

OSR https://community.osr.com/

Peter_Viscarola_(OSR) commented on PCIe dma not working as expected

Great point. The docs are wildly remiss in not mentioning this for KMDF and Commkn Buffers. KeFluhIoBuffers is your friend.

If you’re using the Packet Based interface (DMA Transactions) the flushing should be taken care of for you. Is that not you experience Mr. @Gregory_G._Dyess ?

I’ve yet to do any bushmaster DMA on ARM. I’m looking forward to the chance, though.

Peter

I am running into a similar problem. With or without VT-d, I start DMA on my device. Apparently this is a Synopsys DMA engine. The DMA controller does increment the source and target addresses by the transfer size but the data does not show up (I am doing device to memory). I allocated an MDL for the common buffer (It is 2 pages long) and built the physical pages using MmBuildMdlForNonPagedPool during driver bringup.

Before DMA start, I am calling KeFlushIoBuffers(MDL, TRUE, TRUE) and still data does not show up in my common buffer.

I am wondering if I have to transition to a packet DMA model where I think the flushing of caches is done by the framework when I complete the transaction. Unfortunately, my DMA is not interrupt driven, it is just polling but that should be ok I think.

1 Like

@Ramakrishna_Saripalli …. Why are you building that MDL?

You don’t need to change to packet mode unless that’s what your user interface needs/wants.

Let’s start by you, also, answering the same questions I posed to the OP.

Peter

@“Peter_Viscarola_(OSR)” wdk says in order to use keflushiobuffers i need an mdl. Wrt your questions yes i am getting the logical address and writing the lower 32 and upper 32 into the dma controller registers In fact after i start the dma i can see the dest addr and src addr getting incremented by the transfer size. But the data does not show up

Thought I would share my source code. This is one of the Synopsys DMA controllers (I do not have the specs for it). Instead I have the linux driver code so I am trying to make windows work from it. One of these days, I am going to boot Ubuntu 18.04, build the driver and see if it works. But the linux driver code was given to me so I am hoping it is the reference model.

The following is done during initialization. I have removed the code for error checking (The driver has it though).
WdfDeviceSetAlignmentRequirement(
device_ctxt,
FILE_BYTE_ALIGNMENT);

   WDF_DMA_ENABLER_CONFIG_INIT(
    &dmaConfig,
    WdfDmaProfilePacket64, /* device is capable of addressing all 64-bits but no scatter gather */
    8192);

status = WdfDmaEnablerCreate(
    p_x1_device_ctxt->device,
    &dmaConfig,
    WDF_NO_OBJECT_ATTRIBUTES,
    &device_ctxt->DmaEnablerHandle
);

    status = WdfCommonBufferCreate(
        device_ctxt->DmaEnablerHandle,
        p_x1_device_ctxt->common_dma_buffer_size, /* This is equal to 8192 */
        WDF_NO_OBJECT_ATTRIBUTES,
        &device_ctxt->h_common_dma_buffer
    );

        device_ctxt->common_dma_buffer_kernel_va =
            WdfCommonBufferGetAlignedVirtualAddress(device_ctxt->h_common_dma_buffer);
        device_ctxt->common_dma_buffer_device_la =
            WdfCommonBufferGetAlignedLogicalAddress(device_ctxt->h_common_dma_buffer);

        RtlZeroMemory(....)====> to zero out the common buffer

        device_ctxt->common_buffer_mdl = IoAllocateMdl(device_ctxt->device_ctxt->common_dma_buffer_kernel_va , 8192, FALSE, FALSE, NULL)

Before the DMA starts, driver writes a pattern of 0xdeadbeef to the DMA buffer using another ioctl. I have verified this works.

When I get the ioctl to start the DMA (from device memory to system memory).
Driver logic to start DMA. The ioctl is a METHOD_BUFFERED and provides the size of dma and other parameters.

KeFlushIoBuffers(device_ctxt->common_buffer_mdl, TRUE, TRUE);

< I even threw in a __wbinvd() here to flush out the whole cache hierarchy. Did not make a difference >

regs->dest_addr_low = inputBuffer->internal_addr & 0xFFFFFFFF;
regs->dest_addr_high = inputBuffer->internal_addr >> 32;
regs->source_addr_low = device_ctxt->common_dma_buffer_device_la.LowPart;
regs->source_addr_high = device_ctxt->common_dma_buffer_device_la.HighPart;

regs->transfer_size = inputBuffer->num_dma_bytes;
< another write to another register in the DMA for some control operation>.
MemoryBarrier();
< write to the doorbell register to start the DMA>

After the above operation, I can see the dest_addr_low and source_addr_low being incremented by the num_dma_bytes.
The transfer_size register turns to zero.

But the data is not showing up in the common buffer DMA. I have a DbgPrintEx after the DMA for the first DWORD of the common buffer and it still shows 0xdeadbeef.

This almost sounds like a stale cache on the destination side of the DMA transfer. 

Try cleaning the cache on the destination buffer before starting the DMA transfer. 
Then do a memory barrier after the DMA completes.

Greg

On Thu, 07 Oct 2021 13:32:52 +0000 (UTC), Ramakrishna_Saripalli wrote:

After the above operation, I can see the dest_addr_low and source_addr_low being incremented by the num_dma_bytes.

The transfer_size register turns to zero.

But the data is not showing up in the common buffer DMA. I have a DbgPrintEx after the DMA for the first DWORD of the common buffer and it still shows 0xdeadbeef.

@“Gregory_G._Dyess” I agree but the KeFlushIoBuffers() call should do that and that is being done right before the DMA operation.
This link https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/flushing-cached-data-during-dma-operations talks about KeFlushIoBuffers.

I also have a MemoryBarrier() just before writing to the doorbell to start the DMA so I know that all the operations before that have been completed before the DMA operation starts.

But you are recommending a MemoryBarrier() after the DMA operation completes. Is the KeFlushIoBuffers() not good enough?. I do not have any read or write operations to the DMA buffer between the KeFlushIoBuffers and the DMA operation (which might cause the cachelines to be refilled again).

I am suggesting the following sequence of operation:

  1. write data to source buffer
  2. flush buffers and caches of source buffer
  3. Memory barrier
  4. Invalidate cache of destination buffer
  5. perform DMA
  6. Memory barrier
  7. Enjoy your data

Keep in mind, I am primarily a kernel developer on Arm architectures (even worked for Arm and taught software architecture classes while there).  I saw an issue with a PCIe driver that was very similar to this on on a Xilinx-based RFSoC (4x Cortex A53) and it was the out of order execution and caches that turned out to be the issue.  The sequence I described above is recommended to ensure there are no stale caches, memory in the HW write buffers within the processor or out of order execution issues.

Intel might be different.

Greg
 

Thank you for the warm welcome,

To answer your questions Peter;
Yes I am giving the logical address from WdfCommonBufferGetAlignedLogicalAddress to the hardware.
The configuration mode for the dma_enabler is WdfDmaProfileScatterGather64Duplex.
I do have winDbg setup and checking the memory shows the same data I initialized it to both before and after the dma.

I’ll give it a shot using KeFlushIoBuffers + memory barrier and let you all know if anything interesting happens

Thanks for the many responses so far.
-David

@DavidN I would love to see if you have any better luck with this than I have. Do you have IOMMU enabled on your setup?.( assuming it supports one). FYI, my results do not matter whether VT-d is enabled or disabled.

Hmmm… I’m not sayin’ Mr @“Gregory_G._Dyess” isnt correct… but,

  1. When I hear hoof beats, I tend to think horses not zebras

  2. I have never seen a case where a properly constructed Windows driver needs to manually add a memory barrier into the code. This is supposed to be all handled by the Windows abstractions. Note we don’t code memory barriers around register reads and writes (though, in theory at least, you might have to call KeFlushIoBuffers).

Which memory barrier, or fence, specifically are you recommending these guys add, Mr @“Gregory_G._Dyess” ?.

Peter

@“Peter_Viscarola_(OSR)” that is what I thought too. Given that PCIe memory register reads and writes are mapped to uncached regions, I did not think memory barriers were needed. The processor (at least x64) does not reorder around uncached regions. At least that is my understanding.

I can see why the barriers are needed if we access cached regions (normal memory).

Thanks,
RK

If I were on a Zebra Farm instead of a cattle ranch…

If it were an Arm core, it would be a memory write barrier (DMB) after writing the data and before triggering the DMA.  After the DMA completes, a memory read barrier (DMB).  A Data Synchronization Barrier would be too heavy handed and, if used too often, would kill system performance.

Again, my experience is with Arm A-Class cores, not Intel.  Principles are similar but execution might be different. 

As I said, I’ve not written a desktop Windows driver in 20 years.  I write mostly Windows Embedded Compact (and Linux) kernel drivers now.  I’m not claiming the barriers are a fix-all.  Just the behavior being described sounded a lot like a stale cache and/or out of order execution.

Greg

On Thu, 07 Oct 2021 22:21:40 +0000 (UTC), “Peter_Viscarola_(OSR)” wrote:

OSR https://community.osr.com/

Peter_Viscarola_(OSR) commented on PCIe dma not working as expected

Hmmm… I’m not sayin’ Mr @Gregory_G._Dyess isnt correct… but,

1) When I hear hoof beats, I tend to think horses not zebras

2) I have never seen a case where a properly constructed Windows driver needs to manually add a memory barrier into the code. This is supposed to be all handled by the Windows abstractions. Note we don’t code memory barriers around register reads and writes (though, in theory at least, you might have to call KeFlushIoBuffers).

Which memory barrier, or fence, specifically are you recommending these guys add, Mr @Gregory_G._Dyess ?.

Peter

Barriers are not for synchronizing caches.  That would be the cache maintenance instructions which, as Mr Viscarola pointed out, should be handled by the Windows-provided driver framework(s).  The barriers simply keep the processor from executing certain sequences of code out of order.

Greg

On Thu, 07 Oct 2021 22:50:45 +0000 (UTC), Ramakrishna_Saripalli wrote:

OSR https://community.osr.com/

Ramakrishna_Saripalli commented on PCIe dma not working as expected

@Peter_Viscarola_(OSR) that is what I thought too. Given that PCIe memory register reads and writes are mapped to uncached regions, I did not think memory barriers were needed. The processor (at least x64) does not reorder around uncached regions. At least that is my understanding.

I can see why the barriers are needed if we access cached regions (normal memory).

Thanks,

RK

Unfortunately no luck yet for me. IOMMU being enabled/disabled didn’t influence anything on my end either, nor did KeFlushIoBuffers. my initialization code looks very similar to that which you posted RK.
Though it’s possible I’m doing something wrong I appear to be following the documentation on the use of KeFlushIoBuffers, just in case though I had tried moving it around to different points in the code with different configurations to no avail. To be honest I’m at a bit of a loss for what’s even left to try, but I’ll keep at it and let you know of any breakthroughs.

some extra info If it helps at all, the destination addr is a block ram on the fpga so unless I’m mistaken I don’t think it should be subject to cache problems.

Thanks,
David

@Ramakrishna_Saripalli said:
@DavidN I would love to see if you have any better luck with this than I have. Do you have IOMMU enabled on your setup?.( assuming it supports one). FYI, my results do not matter whether VT-d is enabled or disabled.

@DavidN my design is using the Synopsys Designware cores pci express controller. I have not had much luck either although I have a feeling I might be programming the DMA controller incorrectly. You are right. Unless the design has a cache for the onboard RAM, it should not be subject to cache problems.