Permanently mapped buffers with static scatter gather list

Hi all,

I have written and am maintaining a PCIe driver using WDF with DMA scatter/gather support for a PCIe Gen3 device; mainly using Win 10, but also using Windows Server OS.
It works well; the user passes a virtual address into the driver which then creates the scatter/gather table and off we go.
However, I would like to optimise this process by not having to create the scatter/gather list each time for each DMA transfer.

This is how far I got:
I am getting the user to “register” their buffers (about 10 buffers at about 70MB each) at the start of their app (calling via IOCTL); the driver maps the buffer (WdfRequestRetrieveOutputWdmMdl) and is not completing the request before returning back to the user to keep the buffer mapped.
When the user wants to do a DMA I currently get the OS scatter/gather table for each DMA (and current buffer in question) by calling WdfDmaTransactionInitialize and WdfDmaTransactionExecute; but I’d like to avoid this overhead.
Can I call these two functions when the user registers their buffer already and save the OS scatter/gather list for later DMA transactions or am I going to run out of map registers to keep the OS scatter/gather list active for all the buffers they’ve specified? Potentially I’d like to be able to increase the number of buffers to maybe about buffers (I know that this is an insane amount of memory, but it’s for a very specific use case).

The main reason for this optimisation is that creating the scatter/gather list these days is in the realms of time required to do the actual DMA transfer (considering a device with PCIe gen 3 x16 lanes, PCIe gen 4 already being available as well). So I’d like to get rid of as much overhead as possible to maximise DMA bandwidth (considering that the bandwidth the user sees is limited by the setup cost).
I am open to any suggestions if there are other approaches for optimisations.
I guess kernel memory might be another approach and mapping that to the user, but users prefer to do their own allocations, so I’d like to avoid that approach.

Also, I am not sure what the exact definition of pinned memory is, e.g. NVIDIA CUDA uses pinned memory.
Does that only mean the first step (keeping the MDL around) or does that include the scatter/gather list as well?

Thanks,
Timber

The main reason for this optimisation is that creating the scatter/gather list these days is in the realms of time required to do the actual DMA transfer

Are you just guessing, or have you done any actual timing? There is very little cost to constructing a scatter/gather list, and I find it difficult to imagine it is a significant bottleneck.

I second what Mr. Roberts said.

Let me see if I can answer your questions, though:

  1. Can I “statically” allocate the DMA mappings?

First, please be sure you understand the Windows DMA Abstraction. There is an ancient, but still correct, article from The NT Insider here. Which makes me wonder why that article wasn’t updated and moved to OSR.COM… but I digress.

Second, understand that the Abstraction isn’t always pure fantasy for Express devices that do busmaster DMA. We live in a world where there are IO MMUs. Because we have the general Windows DMA Abstraction, IO MMU implementation should be entirely painless (and transparent) for devices and their properly written drivers. So, while we know that the IO MMU is used to implement things like kernel DMA protection for Thunderbolt, we don’t necessarily know when/if it’s more universally enabled.

Windows does not offer any sort of option to “statically” allocate Map Registers. Map Register allocation is always done dynamically, per transfer. Can you deliberately bypass this if you know in advance you’ll be using the same buffers over-and-over again? It’s always been against the rules… but “back in the day” for 64-bit enabled busmaster DMA PCI family devices, you could do it with some confidence (and it did work… I’ve done it). HOWEVER now, with real IO MMU support… I’m not so sure it’s a safe thing to do.

  1. I am not sure what the exact definition of pinned memory is

The definition is in the view of the writer… but in general use, “pinned” in other operating systems (or in generic uses) simply means “probed and locked” in Windows parlance. It means the page in question can neither be moved nor removed from physical memory.

I hope that long reply helps,

Peter

Thanks for your answers, much appreciated.

I did some benchmarking to find out where the “bottlenecks” are.
WdfDmaTransactionInitialize and WdfDmaTransactionExecute take about 160 us (about 8 us for init and 152 us for exec).
The actual DMA takes about 360 us; the times are pretty small, but put in relation, setup vs actual DMA the setup cost is about 44% for the whole transaction from the user’s point of view.
The user(s) tend to burst about 8 - 24 DMAs in parallel which all need to be serialised which means the setup cost of 160 us now blows up to at least 1.28 ms in the case of just doing the 8 DMAs; which means it adds a latency of 1.28 ms to the last DMA. It doesn’t sound like much, but in the video world with frame rates of 120 or even 240 Hz this does become a bottleneck.

I think I kind of new the answer about creating and keeping a “static” Scatter/Gather list for all the buffer, i.e. it’s not going to work on all systems and therefore I shouldn’t really try to attempt this. I wasn’t sure if the warning about “running out of map register” still applies these days with 64 bit CPUs being the standard (for Win 10 anyway), but I guess some servers will have restrictions.

Thanks for clarifying “pinned memory”; I was under the impression this would go as far as keeping the SG list static as well; the buffer is locked and its physical pages will not change, so I assumed I could safely store the SG list.
I guess the only optimisation in this case would be to try to create the next DMA’s SG list while the previous DMA transfer is still running to hide the setup cost.

Do my conclusions make sense? Happy to be proven wrong :slight_smile:

Thanks,
Timber

therefore I shouldn’t really try to attempt this.

I agree. I wouldn’t consider doing it myself, unless I was building a “closed” system where I could 100% qualify all the hardware and software components.

guess the only optimisation in this case would be to try to create the next DMA’s SG list while the previous DMA transfer is still running to hide the setup cost.

Oh, if perf is really super important, I would definitely look to overlap the setup with the transfer. You can do it for multiple pending requests, even, and keep a list of SGL that are “ready to go”… If the device allows it, you can even get super-clever and reprogram the device directly from within the DMA complete ISR (thus avoiding the ISR to DPC latency before starting the transfer). Look at ExInterlockedInsertTailList and friends to maintain the list of things to start in the ISR.

Good luck!

Peter

Thanks, Peter.
For completeness sake, if the driver was to allocate Kernel buffers and therefore the “SG table” would only contain a single entry (whole buffer is physical contiguous memory), do you think this would then be viable? Keeping the “SG tables” around for all the buffers?

Also, on a separate but related note, would keeping a memory buffer pinned (locked) have any impacts on the speed of a (userland) memcpy? Both the source and destination buffers being pinned?

if the driver was to allocate Kernel buffers and therefore the “SG table” would only contain a single entry

That’s only true if you allocate a common buffer. It’s perfectly valid to allocate a common buffer when your driver starts, have all your DMA be to and from the common buffer, and do copies to and from the user buffers. Modern processors do copies very, very quickly.

would keeping a memory buffer pinned (locked) have any impacts on the speed of a (userland) memcpy?

It’s just memory. Memory doesn’t change speed by being locked. That is, assuming it isn’t paged out.

And, of course, kernel code and user code runs at the exact same speed.

Hi Tim,

I think I still have a misunderstanding about concepts and terminology, sorry.
I thought that when a memory buffer is locked that it means it can’t get paged out. Is that not correct? What else do I have to do to make sure it doesn’t get paged out?

The reason I brought up memcpy() was that if the user allocates virtual memory and it’s not pinned/locked it could get paged out, so the memcpy might sometimes take longer if the memory has to get paged in first; that’s why I think pinning/locking the involved memory might guarantee best performance.

Modern processors are quick at copying memory, but it will still take about 2-3 ms to copy (AVX512) a 30 MB buffer (UHD video buffer) and if you are running a frame rate at 120 or 240 Hz your frame period is about 8 or even just 4ms; so in this case you do want to avoid any memory copies as this would as latency. So if using kernel memory (common buffer) to get the best performance would be to map the kernel buffer to userland and giving the user direct access. I don’t have a feeling for how risky this is (if at all) from a security point of view? What do you think?

> Is that not correct? Yes. That is correct. > What else do I have to do to make sure it doesn’t get paged out? Nothing. You’re good. > So if using kernel memory (common buffer) to get the best performance would be to map the kernel buffer to userland and giving the user direct access. I don’t have a feeling for how risky this is (if at all) from a security point of view? It’s a very common question. Search the forum for this topic. It’s actually rather complicated to implement because there ARE interesting security issues involved and not all of them are obvious. Peter

Thanks, Peter; that’s great.
I’ll dive into forums for the potential security issues if I decide to use kernel buffers.
For now I think I’ll stick with pinning memory and using SG DMA.
Thanks for all your help :slight_smile:

… if you are running a frame rate at 120 or 240 Hz …

240 Hz is a marketing buzzword with no practical use. That’s 58 Gbps. What wire are you going to run that across?

Tim, I agree, UHD 240 Hz is not practical yet (ever?).
I was just trying to make the point that speed is always relative and when deciding on creating an architecture it’s important to keep that in mind. For a lot of applications 1 ms is nothing and totally ok, but in some industries 1 ms is a long time and if you can eliminate this extra time then that’s a good thing. I guess NIC vendors are going for zero copy exactly for that reason.

Zero copy is overrated. I have seen some remarkably convoluted architectures created so some Vice President who fancies himself technical can say the net stack is zero copy.

Utterly fucking ridiculous. Copies haven’t been expensive since, oh, 2005 or so.

Peter

Ok, the network example was probably not a good example (just something I picked up during research and didn’t look too much into it).
My argument still stands, though, and I think you would have to agree, that if you can avoid a memory copy then that’s a good thing; obviously only if the the non-memory-copy implementation is taxing in any other way.
I had colleagues arguing to use kernel memory and then copy out of it into userland memory, just because the fw would be more optimised dealing with contiguous physical memory. I think avoiding kernel memory and the memory copy by using SG DMA directly into userland memory is better; or is that just preference?

I think your fundamental design of overlapping the SGL creation with I/O is a sound one. It’s how I’d most likely do it.

Peter