GetScatterGatherListEx() and ownership/lifetime semantics

Hi All,
So I’m slowly eeking my way towards getting scatter/gather DMA working on my bus/master PCIe device. I’m trying to use GetScatterGatherListEx() with the DMA_SYNCHRONOUS_CALLBACK flag enabled. I’d prefer to use it without the callback (parameter ExecutionRoutine == NULL, parameter ScatterGatherList != NULL), but using it with the callback (parameter ExecutionRoutine != NULL, parameter ScatterGatherList == NULL) is OK too. I’ve got it working both with and without the callback and I’ve verified that the SCATTER_GATHER_LIST returned makes sense for the MDL I’m passing in for both cases. The problem I’m having is that after using this function, my driver blue screens the system when it is unloaded (when I call PutDmaAdapter() to deallocate the adapter I created with IoGetDmaAdapter()). I presume that means I am leaking some resource. That’s not terribly surprising as I’m finding the documentation for GetScatterGatherListEx() really confusing in terms of what I have to deallocate and what is and isn’t a singleton per DMA_ADAPTER object. So does anybody know (for the case of no callback, with callback or both):

  • Do you need to call PutScatterGatherList()?
  • Do you need to call FreeAdapterObject(), and if so, with what value for the AllocationAction parameter?
  • Does using FreeAdapterObject() absolve you of the responsibility of calling PutDmaAdapter()? Or must you still call PutDmaAdapter()?
  • Does using FreeAdapterObject() reset the DMA_ADAPTER object such that it’s in the same usable state as right after calling IoGetDmaAdapter()?
  • Must you call InitializeDmaTransferContext() after each DMA transaction before you reuse the DMA transfer context buffer again or just once to initialize it?
  • If you are using multiple DMA channels simultaneously (for example, I have 4 H2C and 4 C2H channels on my device), is it ok to call GetScatterGatherListEx() multiple times on the same DMA_ADAPTER before completing a previous DMA transaction started with GetScatterGatherListEx() (ie calling PutScatterGatherList() and/or FreeAdapterObject() to complete the transaction)?
  • If not, can you call IoGetDmaAdapter() multiple times and have a DMA_ADAPTER for each DMA channel (8 in my case) and use these separate DMA_ADAPTER objects simultaneously for multiple concurrent DMA transfers.

Thanks,
Omri

You really, really, should be using WDF, instead of WDM. Really. I promise you’ll eventually thank me for this advice.

Peter

Side comment 1: Apologies about the docs on the HAL’s DMA routines. I completely agree that they’re not easy to understand. The synchronous mode (DMA_SYNCHRONOUS_CALLBACK) was shoehorned into the existing async API, and it’s not as graceful as it could have been.

Side comment 2: Peter is right in general about WDF: it’s a lot easier to use, better documented, and you’ll waste less time chasing mysterious issues. However, I don’t believe that WDF supports the synchronous scatter-gather mode yet. So there’s a faint case to be made that a driver that needs ridiculously high IOPS (40G+ networking or possibly some of the fancy new storage controllers) would want to go down to the HAL layer to shave off a few more cycles per I/O. However, if you’re writing something that’s doing less than 100k IOPS, then WDF will be a great fit, and you’ll save your employer lots of money if you use WDF instead. Not to mention you’ll save yourself some grey hairs.

Anyway, here’s a recipe for maximizing IOPS using synchronous DMA. You must have this exact sequence of calls for each transfer:

  1. InitializeDmaTransferContext()
  2. BuildScatterGatherListEx(DMA_SYNCHRONOUS_CALLBACK)
  3. // copy SCATTER_GATHER_LIST to your hardware’s descriptor here
  4. FreeAdapterObject(DeallocateObjectKeepRegisters)
  5. // wait for your hardware to signal that it’s done using the mapped memory, e.g., interrupt or polling the descriptor for completion
  6. PutScatterGatherList()

This looks a little bit baroque, because you could argue that it should be possible to do synchronous DMA in 2 function calls, not 4. If you made that argument, I’d agree with you :wink:

For step 2, I put BuildScatterGatherListEx instead of GetScatterGatherListEx. They do the same thing, but BuildSGLE lets you provide the memory buffer for the SGL, while GetSGLE causes HAL to allocate & cache SGL buffers internally. If you really care about IOPS enough to use DMA like this, you almost certainly also want to manage your own SGL buffers. (HAL’s SGL buffers aren’t somehow “better” than yours: it just goes to ExAllocatePool like you would. But HAL doesn’t know how to tune for your IO patterns as well as you do.)

Steps 2 - 4 cannot be parallelized. So once you call BuildScatterGatherListEx, you need to copy FreeAdapterObject before you can call BuildScatterGatherListEx again. (This limit is enforced per DMA adapter, so you could circumvent it by getting multiple adapters.)

Below are explicit answers to your questions, in case the general description above doesn’t cover it all:

  • Yes.
  • Yes, DeallocateObjectKeepRegisters.
  • No. Yes.
  • Almost. There’s still a refcount on the DMA_ADAPTER until you call PutScatterGatherList. But otherwise, the DMA_ADAPTER is ready to start another transaction.
  • Yes.
  • Yes. In fact, pipelining multiple transactions is critical to reaching high IOPS. To do that, repeat steps 2-4 in a loop. I suggest looping until you get back STATUS_INSUFFICIENT_RESOURCES (or run out of I/Os to transfer). Resume looping once you call PutSGL (or get more I/Os from above). If you want truly independent I/O paths, consider opening multiple DMA adapters, so there’s less contention on HAL’s internal spinlocks.
  • Yes.

All this discussion is experience we gained building the DMA engine for NetAdapter. That means two things:

@“Jeffrey_Tippet_[MSFT]” said:

All this discussion is experience we gained building the DMA engine for NetAdapter. That means two things:

great link. i absolutely missed this development@ms. thx for that. mp

Thanks for the advice guys. I did indeed switch to using WDF based DMA.
So I now create a pool of WDFCOMMONBUFFER objects that I reuse from transaction to transaction. I use these to populate the DMA descriptors my device needs for scatter gather based on the SCATTER_GATHER_LIST parameter passed into my EVT_WDF_PROGRAM_DMA callback. I made sure to use WdfCommonBufferGetAlignedLogicalAddress() to get the device side addresses to reference these buffers and set the required alignment both with WdfDeviceSetAlignmentRequirement() and WDF_COMMON_BUFFER_CONFIG::AlignmentRequirement.
At the moment my device fails with “incorrect magic number”. This magic number is a hardcoded constant that must appear in one of the data members of every DMA descriptor I create as a hardware based sanity check. My initial guess is that this is some cache coherency issue with the WDFCOMMONBUFFERs I’m using to store the DMA descriptors that the driver must write to and the device must read from. Is there something you must do to ensure any CPU side changes writing to the WDFCOMMONBUFFER are flushed from the cache before I write the registers on the device to actually start the DMA transfer?
Also, I noticed that SCATTER_GATHER_ELEMENT::Address is documented as a physical address, not a logical one. How do I convert it into a logical address that the DMA controller on my device can use?
Thanks!!!

noticed that SCATTER_GATHER_ELEMENT::Address is documented as a physical address

Quick answer in case you’re hung-up on this (I’ll try to answer more tomorrow): this is just a document terminology issue. The addresses provided you in the SGL are indeed directly usable for DMA. So they are actually Device Bus Logical Addresses.

Peter

Alright, I suspected as such but thanks for the clarification Peter. FYI, I did get my DMA working now (for the most part), there was a bug in how I filled in the DMA descriptors for the DMA controller. It still fails intermittently, but I haven’t looked into it deeply enough yet to suspect if it’s a timing issue, misusing the WDF APIs, cache coherency issue or another simple bug with mispopulating the DMA descriptors.

I most heartily apologize for not being able to get back before now. I’ve been working furiously on a prototype for a client, plus in our spare time here everyone has been working on getting “live streaming” to work in the OSR Seminar Space so that (hopefully) starting next month people will be able to attend our seminar EITHER by coming here to visit us or remotely over the Internet.

But, be that as it may: I can provide you some comfort that your WDF driver does not need to worry about cache coherency and DMA. This is “taken care of” for you by WDF, assuming your fully using the WDF API set… and x86 and x64 architecture machines are inherently cache coherent when it comes to DMA in any case (NOT the case for ARM CPUs, though…).

Let me know if/when you get this solved or you have further difficulties… I’ll try to be more attentive.

Peter

Thanks Jeffrey_Tippet_[MSFT]. I switched away from using the WDF based DMA back to the old style BuildScatterGatherListEx() without a callback. The steps you mentioned work perfectly. Although I agree that the ownership semantics of WDF are simpler it gives you less control since it’s callback based instead of object based which forgoes some potential performance improvements. I set up my user space app to have 3 worker threads for doing DMA transfer now:

  • A pre transfer thread to map the host memory to DMA from user to kernel space, build the scatter gather list and populate the common buffer with the scatter gather list in the Xilinx DMA controller’s data structure format
  • An actual transfer thread to start the transfer and sleep until the WDFREQUEST for the DMA transfer completes via an IRQ signal from the DMA engine
  • A post transfer thread to put/deallocate the scatter gather list and unmap the DMAed buffer

This boosted the DMA data rate by 20% and is close to the theoretical maximum my device’s PCIe bus supports and is about the same performance I get on my Linux driver. The idea is you aren’t wasting any time where the Xilinx DMA engine sits idle because you are doing the relatively expensive memory mapping and scatter gather list building operations concurrently in another thread for the next DMA operation.