Setting up DMA - User Buffer, Handing I/O

Hello Tim:

Thanks for the response. I wont go into the reason for this approach, but I am looking into changing things around.

Forgive my lack of Driver knowledge: I did look both Transaction and DMA_ADAPTER methods before I posted, but none of them mentioned being able to obtain the Logical Address - nothing as straight forward as, “WdfCommonBufferGetAlignedLogicalAddress”.

I do see methods related to getting SG lists - is that the approach I should be using? Get the S/G List based on the Direct IO out_buffer MDL, then iterate through the SG List to get the Logical Addresses. I would need to know the corresponding virtual address as SW will use the virtual address to access the data for processing.

Thank you, Tim.
Juan

Yes, S/G lists have logical addresses. MmGetSystemAddressForMdlSafe gets you the virtual address.

If the usermode app just calls VirtualAlloc, even with large pages, you’ll need 1024 pages to cobble together a 2GB buffer. These pages will not be contiguous, so you’ll need to juggle 1024 different logical addresses. That can be inconvenient and possibly slow things down.

Getting a contiguous buffer makes this way easier, and possibly faster. The downside is that 2GB of contiguous pages is pretty unlikely to just happen to be available, except early at boot. You could have your driver reserve 2GB of contiguous memory at boot, then use it once the device & usermode app are ready. This approach might be better, if your end-users are okay with permanently dedicating 2GB of RAM to your hardware, and with possibly needing to reboot the first time the hardware is installed. (I worked on a project a few years ago that did exactly this, and the customer preferred paying for extra RAM up-front to get deterministic and low execution time, versus worrying about the variable performance of allocating that RAM in small pieces dynamically later.)

Note that regardless of what approach you use, you must use a DMA API at some point. For contiguous memory, you’d ideally get it from AllocateCommonBuffer[Ex]; although if the device doesn’t exist yet, you can use MmAllocateContiguousMemory and then later call BuildScatterGatherList[Ex] on it when the device is ready. If you’re going with VirtualAlloc’d memory, use BuildScatterGatherList[Ex] on it. You can’t do anything useful with MmGetPhysicalAddress or the PFN array in the MDL, (unless ADAPTER_INFO_API_BYPASS explicitly says you can).

There’s no reverse mapping from a SG list back to a virtual address, because HAL reserves the rights to bounce the buffer. That is, it might have silently allocated a whole new buffer and copied the data there. At that point, what’s the virtual address? The original buffer (which you already know about) or the bounce buffer (which you shouldn’t need to touch)?

Thank you, Tim:

My plan is:

  1. During EvtDeviceAdd, create the DMA Enabler.
  2. Add a method, “ConfigureFpgaForDma”.
  3. Call ConfigureFpgaForDma via an EvtIoInCallerContext
  4. SW issues the IOCTL, providing info on the In_buffer and the VirtualAlloc buffer in the out_buffer.
  5. Inside the ConfigureFpgaForDma, call WdfRequestRetrieveOutputWdmMdl to get the MDL.
  6. Then call WdfDmaEnablerWdmGetDmaAdapter to get the DMA_ADAPTER (using WdfDmaDirectionReadFromDevice since the FPGA will write into the host buffer).
  7. Call BuildScatterGatherList.
  • The method requires an, “Execution Routine” which I don’t really need. So this will be an empty routine that enters and exits.
  • I also need to manually allocate a block of memory for the SG List itself.
  1. Now I have the SG List, so I will loop through the SG list and write each SG logical address to the appropriate FPGA addresses.

I have a question about the following:

The driver must call PutScatterGatherList (which flushes the list) before it can access the data in the list.
Does the Driver need to do this before User SW can access the data via the original pointer?

Regards,
Juan

Does the driver need to do this before User SW can access the data via the original pointer?

No, PutScatterGatherList says “I’m done with this list, you can free your resources”. However, you should call FlushAdapterBuffers when you know a transfer has completed. The need for that call depends on your CPU architecture; on x64 architectures, it often does nothing, but you’re still supposed to call it.

1 Like

@“Jeffrey_Tippet_[MSFT]” said:

Thank you for the response!

If the usermode app just calls VirtualAlloc, even with large pages, you’ll need 1024 pages to cobble together a 2GB buffer. These pages will not be contiguous, so you’ll need to juggle 1024 different logical addresses.

The FPGA has a LUT that is programmed (by SW ) with all the logical addresses. The FPGA is legacy, so I don’t have the flexibility to change how it works. It is expecting that each LUT entry points to a single 64KB (contiguous) memory. The reason for using Large Pages is that I can then split the 2MB Page ( using the Logical Address and some math) into 32 (logical addresses).

Getting a contiguous buffer makes this way easier, and possibly faster.

Indeed! And I have been successful, with a running system, allocating 2GB during boot via the kernel driver and a CommonBuffer. However, since I am exposing that kernel space memory to user SW, I was told that this is an unsafe practice; If one must go this way, one should take extra care to address the various side-effects and holes. The better approach, aside from using standard DMA practices, is to have user SW allocate the buffer and provide it to the driver.

If you’re going with VirtualAlloc’d memory, use BuildScatterGatherList[Ex] on it.
I will be attempting this approach, given the advice I’ve gotten so far, for my specific application.

There’s no reverse mapping from a SG list back to a virtual address, because HAL reserves the rights to bounce the buffer.
This may make things a little more tricky for me. My FPGA will be sending (DMA’ing) blocks of data. SW needs to know where that block of data resides within the 2GB buffer. Finding the virtual address is easy as it’s contiguous. The tricky part is mapping the virtual address to its corresponding logical address.

  • The SG logical addresses are not guaranteed to be contiguous, unlike the CommonBuffer approach, where they are contiguous.
  • The good thing is that I only need to do the “mapping” once (at startup)…

Thanks again, Jeffrey.

Regards,
Juan

So it’s taken some time to get to this point, and I am stuck at the moment. :slight_smile:

My call to CalculateScatterGatherList is returning “STATUS_INVALID_PARAMETER” and I can’t figure out which one it is and why it’s invalid.

Code snippet (checks and declarations omitted ).
I did trace all the parameters that I am passing into the call, and they seem OK, as best I understand the documentation.
Instead of “StartVa” I tried using “MappedSystemVa” but had the same error.

START

// Get the output buffer passed in by the IOCTL.
WdfRequestRetrieveOutputMemory( request, &outputBufferObj );
        
// Get the MemoryBuffer as it's easier to get the MDL.
PVOID outBuff_P = WdfMemoryGetBuffer( outputBufferObj, &outBuffByteSize );

// Get the MDL
status = WdfRequestRetrieveOutputWdmMdl( request, &bufferMdlP );

// Get the DMA Adapter.
PDMA_ADAPTER dmaAdapter = WdfDmaEnablerWdmGetDmaAdapter( myDmaEnabler, WdfDmaDirectionReadFromDevice );

// Get the SG List size.
CalculateScatterGatherList( dmaAdapter,
    bufferMdlP,
    bufferMdlP->StartVa,
    bufferMdlP->ByteCount,
    &sgByteSize,
    &numMapRegisters );

// Get a memory buffer for the SG List
WdfMemoryCreate( &buffAttributes,
    PagedPool,
    0,
    static_cast<size_t>( sgByteSize ),
    &scatterGatherMemoryObj,
    &scatterGatherListBuffP );

// Build the SG List
dmaAdapter->DmaOperations->BuildScatterGatherList( dmaAdapter,
    deviceObjectP,
    bufferMdlP,
    bufferMdlP->StartVa,
    static_cast<ULONG>( outBuffByteSize ),
    &(MyAdapterListControl),
    deviceContextP,
    FALSE,
    scatterGatherListBuffP,
    sgByteSize );

END

Thank you for any help!

Regards,
Juan

Presumably you meant dmaAdapter->DmaOperations->CalculateScatterGatherListSize instead of CalculateScatterGatherList.

Technically, you should be using MmGetSystemAddressForMdlSafe(bufferMdlP). It will fetch StartVa if it’s already set up, and will map the pages into memory if not already set up. Did you dump StartVa in a debugger to see if it made sense?

Hello Tim:

Thank you for the response.

Presumably you meant dmaAdapter->DmaOperations->CalculateScatterGatherListSize instead of CalculateScatterGatherList.
Actually, the call CalculateScatterGatherList maps to PCALCULATE_SCATTER_GATHER_LIST_SIZE in wdm.h’s _DMA_OPERATIONS struct.

Technically, you should be using MmGetSystemAddressForMdlSafe(bufferMdlP).
Thank you for that, I ended up using MmMapLockedPagesSpecifyCache since I don’t want the memory to be cached

Unfortunately I still get the same “STATUS_INVALID_PARAMETER” return code from “…->CalculateScatterGatherList”

On the SW side, it is allocating a string buffer and passing it in via the outbuffer of the IO Control call (overlapped, generic_read).
The IOCTL is OUT_DIRECT.
The DMA Enabler: flag: WDF_DMA_ENABLER_CONFIG_REQUIRE_SINGLE_TRANSFER, Profile: WdfDmaProfilePacket64, Len: 128

I will dump the KVA and see what I find.

(Note: Abridged; Some declarations and checks omitted. EvtIoInCallerContext does check the IOControlCode and calls a helper to perform the SGList building shown below.)

void EvtIoInCallerContext( IN WDFDEVICE  device, IN WDFREQUEST  request )
  // Get the outbuffer from the IOCTL request
  WDFMEMORY outputBufferObj{ nullptr };
  WdfRequestRetrieveOutputMemory( request, &outputBufferObj );

  // Get the outbuffer size. Note: outBuffP not actually needed, we just want the size.
  PVOID outBuffP = WdfMemoryGetBuffer( outputBufferObj, &outBuffByteSize );

  // Get the MDL from the outbuffer
  WdfRequestRetrieveOutputWdmMdl( request, &deviceContextP->hostMemoryMdlP );

  // Get the DMA adapter.
  PDMA_ADAPTER dmaAdapter = WdfDmaEnablerWdmGetDmaAdapter(
                                                      myDmaEnabler,
                                                      WdfDmaDirectionReadFromDevice );

  // Get the Kernel Virtual Address for the MDL
  PVOID hostMemoryKvaP = MmMapLockedPagesSpecifyCache(
                                                      hostMemoryMdlP, KernelMode, 
                                                      MmNonCached, NULL,
                                                      FALSE, HighPagePriority );

  // Get the SG List size
  dmaAdapter->DmaOperations->CalculateScatterGatherList( dmaAdapter,
                                                                    hostMemoryMdlP, hostMemoryKvaP,
                                                                    hostMemoryMdlP->ByteCount,
                                                                    &sgByteSize, &numMapRegisters );
  // Allocate memory for the SG List
  PVOID scatterGatherListBuffP{ nullptr };
  WdfMemoryCreate( &buffAttributes, PagedPool, 0,
                                       static_cast<size_t>( sgByteSize ),
                                       &scatterGatherMemoryObj, &scatterGatherListBuffP );

  // Get the Device Object
  PDEVICE_OBJECT deviceObjectP = WdfDeviceWdmGetDeviceObject( device );

  // Get the SG List
  dmaAdapter->DmaOperations->BuildScatterGatherList( dmaAdapter, deviceObjectP,
                                                                hostMemoryMdlP, hostMemoryKvaP, 
                                                                static_cast<ULONG>( outBuffByteSize ),
                                                                &(MyAdapterListControl), deviceContextP, FALSE,
                                                                scatterGatherListBuffP, sgByteSize );

Thank you, again!
Juan

Thank you for that, I ended up using MmMapLockedPagesSpecifyCache since I don’t want the memory to be cached

You don’t get to decide that. The memory has already been allocated, so its cache attribute has already been established. You are required to use the same attribute. Why do you think you want it to be uncached? Because your reason is almost certainly wrong.

WDF_DMA_ENABLER_CONFIG_REQUIRE_SINGLE_TRANSFER requires that you set WdmDmaVersionOverride to 3, but once again, you almost certainly don’t need that at all.

@Tim_Roberts said:
The memory has already been allocated, so its cache attribute has already been established. You are required to use the same attribute.

  • Understood.

Why do you think you want it to be uncached? Because your reason is almost certainly wrong.

  • Since the FPGA will be streaming data continuously to the host memory, I did not want the OS to deal with cacheing.
  • That memory needs to stay locked down as it always needs to be available for the FPGA to DMA to and for user SW to access for processing.
  • Given your comment, I decided to remove the “NoCache” from VirtualAlloc and am now using MmGetSystemAddressForMdlSafe.

WDF_DMA_ENABLER_CONFIG_REQUIRE_SINGLE_TRANSFER requires that you set WdmDmaVersionOverride to 3, but once again, you almost certainly don’t need that at all.
My thinking here is that I did not want the FPGA DMA transaction to be split into multiple transactions. But I suppose, since the FPGA is doing the DMA, that would be up to the FPGA??

Current State

  • I am still getting the invalid parameter error.
  • Just to try something different, instead of passing in the System Address from MmGetSystemAddressForMdlSafe to CalculateScatterGatherList, I am passing in the StartVa from the MDL - this address is the same address as used by user sw.
  • With this change, I get an “Insufficient resources” error - Though I am only allocating 4MB of Large Page memory.

Question

  • What parameters should I be passing into CalculateScatterGatherList?

CalculateScatterGatherList( dmaAdapter, // Pointer to DMA Adapter I get from WdfDmaEnablerWdmGetDmaAdapter
                        hostMemoryMdlP, // Pointer to MDL I get from WdfRequestRetrieveOutputWdmMdl
                        hostMemoryKvaP, // Kernel Virtual Address I get from MmGetSystemAddressForMdlSafe
                        hostMemoryMdlP->ByteCount, // Size of Host Memory Buffer ( Size used when user sw called VirtualAlloc)
                        &sgByteSize, // Out param, Byte size of ScatterGather List. What is needed when calling WdfMemoryCreate
                        &numMapRegisters //Out param, Will not be used.
                        );

DmaEnabler

I create the DmaEnabler as follows:

    WDF_DMA_ENABLER_CONFIG dmaEnablerConfig;
    WDF_DMA_ENABLER_CONFIG_INIT( &dmaEnablerConfig, WdfDmaProfilePacket64, 128 );

    dmaEnablerConfig.EvtDmaEnablerFill = NULL;
    dmaEnablerConfig.EvtDmaEnablerFlush = NULL;
    dmaEnablerConfig.EvtDmaEnablerDisable = NULL;
    dmaEnablerConfig.EvtDmaEnablerEnable = NULL;
    dmaEnablerConfig.EvtDmaEnablerSelfManagedIoStart = NULL;
    dmaEnablerConfig.EvtDmaEnablerSelfManagedIoStop = NULL;
    dmaEnablerConfig.AddressWidthOverride = 0;
    dmaEnablerConfig.WdmDmaVersionOverride = 3;
    dmaEnablerConfig.Flags = WDF_DMA_ENABLER_CONFIG_REQUIRE_SINGLE_TRANSFER;

    NTSTATUS status{ WdfDmaEnablerCreate( wdfDevice, &dmaEnablerConfig, WDF_NO_OBJECT_ATTRIBUTES, &( myDmaEnabler ) ) };
    if( !NT_SUCCESS( status ) )
    {
        TraceEvents( TRACE_LEVEL_ERROR, DBG_INIT, "WdfDmaEnablerCreate() failed with status=[%!STATUS!]", status );
    }

Quick Note

As a quick test: I modified user SW to write 0xDEADBEEF0BADC0DE to the VirtualAlloc memory location. The driver, using outBuffP (obtained from WdfMemoryGetBuffer - see above code snippet) traced the correct value. The driver then modified the value by writing all 0xFs. User SW saw this change.

@Tim_Roberts:

In addition to my previous question (last post) I have another for you:

Technically, you should be using MmGetSystemAddressForMdlSafe(bufferMdlP).

  • How do I unlock the pages? Since I got the MDL using WdfRequestRetrieveOutputWdmMdl, I do not know which method to use for unlocking the pages.
  • I get a BSOD, “PROCESS_HAS_LOCKED_PAGES” when shutting down, even though the request was completed due to the failed CalculateScatterGatherList call.

Thank you…

Since the FPGA will be streaming data continuously to the host memory, I did not want the OS to deal with cacheing.

Caching is not an OS thing at all. It is a hardware thing. x86 hardware handles it for you.

WDF_DMA_ENABLER_CONFIG_INIT( &dmaEnablerConfig, WdfDmaProfilePacket64, 128 );

You misunderstand the “maximum transfer length” parameter. What you’re giving here is the TLP size from PCIe config space. The operating system doesn’t give a whack about that. You need to pass the maximum buffer size you can do in a single DMA transfer. With most devices that’s at LEAST a page, and is often many megabytes.

Don’e zero out the callback fields in the dmaEnablerConfig structure. That’s what the _INIT is there for. You just need to override the things that aren’t the default.

The “…REQUIRE_SINGLE_TRANSFER” thing is only needed if the system has to allocate bounce buffers for you. For example, if your device cannot handle 64-bit addressing, then when a DMA request comes in beyond the 4GB mark, the system will allocate buffers below the 4GB mark, and copy the data to and fro. Those buffers are small (like 64kB), so if you get a request for a megabyte, the system will do that as 16 requests of 64kB each.

But that’s all handled transparently on your behalf. I STRONGLY suggest you eliminate the REQURE_SINGLE_TRANSFER flag and lwave the WdmDmaVersionOverride set to its default value.

How do I unlock the pages?

You don’t have to. For METHOD_xx_DIRECT ioctls, the I/O system locks the memory on the way in and unlocks it on the way out. Do you still have calls to MmMapLockedPagesEtc in your code? Those have to be mated with corresponding MmUnmapLockedPages.

@Tim_Roberts said:

How do I unlock the pages?
You don’t have to. For METHOD_xx_DIRECT ioctls, the I/O system locks the memory on the way in and unlocks it on the way out.

That was my understanding. But doesn’t this mean that I don’t need to make a call to MmGetSystemAddressForMdlSafe as it locks the pages?

Do you still have calls to MmMapLockedPagesEtc in your code? Those have to be mated with corresponding MmUnmapLockedPages.

No, I haven’t introduced any.

  • The BSOD started happening when I introduced the call to MmGetSystemAddressForMdlSafe. If I take out the call, the BSOD doesn’t happen.

Thank you for all the great information! I will work on fixing the DmaEnabler creation. I know I’m stumbling through this; BIG Thanks for the continued patience and help.

MmGetSystemAddressForMdlSafe does not lock pages. The buffer that an MDL describes has already been locked. Otherwise, you couldn’t get physical page numbers, which is what the MDL contains.

@Tim_Roberts said:
MmGetSystemAddressForMdlSafe does not lock pages. The buffer that an MDL describes has already been locked. Otherwise, you couldn’t get physical page numbers, which is what the MDL contains.

Yes, you’re right Mr. Roberts. I had been googling and reading posts and Microsoft docs to try and figure the overall problem out; my brain is getting all jumbled.

Current State:

  • I am using MmGetSystemAddressForMdlSafe
  • I changed the DMA Config (Removed the zero’ing out of the callbacks, Updated the MaxTransferByteSize, removed the Flag and DmaVersionOverride).
  • Still, no cigar with a status of STATUS_INSUFFICIENT_RESOURCES (0xC000009A).
  • UserSW uses VirtualAlloc to allocate 4MB of Large Page memory.

I’m at a loss at the moment…it doesn’t seem like it should be complicated :smile:

Where do you get the STATUS_INSUFFICIENT_RESOURCES?

@Tim_Roberts said:
Where do you get the STATUS_INSUFFICIENT_RESOURCES?
The call to CalculateScatterGatherList

To close off this thread:

I haven’t had any luck making this work. I continue to get STATUS_INSUFFICIENT_RESOURCES from CalculateScatterGatherList.

  • For now, I will set aside the “Hanging Direct I/O” approach and use the method of exposing a Common_Buffer to UserSW approach as it is working.
  • Still, it’s on the back burner, so if you have any suggestions / comments, please post away - thanks in advance.

Recap of my [failed] implementation (Note: Code snippets have been “simplified” for brevity):

User SW:

    // SW is allocating a data buffer using Large Pages (2MB). For this, SW needs to be running as Admin + have SE Privilege.
    tp.Privileges[ 0 ].Attributes = SE_PRIVILEGE_ENABLED;

      // enable privilege
    status = AdjustTokenPrivileges( hToken, FALSE, &tp, 0, ( PTOKEN_PRIVILEGES ) NULL, 0 );

   // For testing, allocate 4MB (2 pages) but will eventually need much more than this.
   pointerToMemory = VirtualAlloc( NULL, 
                                   memorySize,
                                   MEM_COMMIT | MEM_RESERVE | MEM_LARGE_PAGES,
                                   PAGE_EXECUTE_READWRITE );

    DeviceIoControl( pDevice->hFile,
                    IOCTL_SETUP_HOST_MEMORY_DMA,
                    inDataHostMemoryInfoP, sizeof( SetupHostMemoryInputBufferType ),
                    pointerToMemory , memorySize,
                    &bytesReturned, NULL ) );

Driver Device Add:

WDF_DMA_ENABLER_CONFIG dmaEnablerConfig;
WDF_DMA_ENABLER_CONFIG_INIT( &dmaEnablerConfig, WdfDmaProfileScatterGather64Duplex, 1024*4 );

dmaEnablerConfig.AddressWidthOverride = 0;

NTSTATUS status{ WdfDmaEnablerCreate( wdfDevice, &dmaEnablerConfig, WDF_NO_OBJECT_ATTRIBUTES, &( myDmaEnabler ) ) };
if( !NT_SUCCESS( status ) )
{
    TraceEvents( TRACE_LEVEL_ERROR, DBG_INIT, "WdfDmaEnablerCreate() failed with status=[%!STATUS!]", status );
}

Driver Device Process IOCTL:

void EvtIoInCallerContext( IN WDFDEVICE  device, IN WDFREQUEST  request )
  // Get the outbuffer from the IOCTL request
  WDFMEMORY outputBufferObj{ nullptr };
  WdfRequestRetrieveOutputMemory( request, &outputBufferObj );

  // Get the outbuffer size. Note: outBuffP not actually needed, we just want the size.
  PVOID outBuffP = WdfMemoryGetBuffer( outputBufferObj, &outBuffByteSize );

  // Get the MDL from the outbuffer
  WdfRequestRetrieveOutputWdmMdl( request, &hostMemoryMdlP );

  // Get the DMA adapter.
  PDMA_ADAPTER dmaAdapter = WdfDmaEnablerWdmGetDmaAdapter( myDmaEnabler,
                                                           WdfDmaDirectionReadFromDevice );

  // Get the Kernel Virtual Address for the MDL
  hostMemoryKvaP = MmGetSystemAddressForMdlSafe( hostMemoryMdlP, 
                                                                 HighPagePriority );

  // Get the SG List size
  dmaAdapter->DmaOperations->CalculateScatterGatherList( dmaAdapter,
                                                         hostMemoryMdlP, 
                                                         hostMemoryKvaP,
                                                         hostMemoryMdlP->ByteCount,
                                                         &sgByteSize, 
                                                         &numMapRegisters );
  // Allocate memory for the SG List
  PVOID scatterGatherListBuffP{ nullptr };
  WdfMemoryCreate( &buffAttributes, PagedPool, 0,
                   static_cast<size_t>( sgByteSize ),
                   &scatterGatherMemoryObj, &scatterGatherListBuffP );

  // Get the Device Object
  PDEVICE_OBJECT deviceObjectP = WdfDeviceWdmGetDeviceObject( device );

  // Get the SG List
  dmaAdapter->DmaOperations->BuildScatterGatherList( dmaAdapter, deviceObjectP,
                                                                hostMemoryMdlP, hostMemoryKvaP, 
                                                                static_cast<ULONG>( outBuffByteSize ),
                                                                &(MyAdapterListControl), deviceContextP, FALSE,
                                                                scatterGatherListBuffP, sgByteSize );