Allocate huge amounts of continous physical memory for DMA /wo scatter gather

DriverThomas · December 7, 2021, 8:25pm

Thank you for your clear response.
Honestly I have to admit that the legacy code uses MmGetPhysicalAddress. I’ve read about the warning on MS documentation, too and also wondered and was scared about it during onboarding on the project. But it works since many years. I know, no excuses. I have that on my list…

So I will move on with using IoGetDmaAdapter, AllocateCommonBuffer on it and so on…

Peter_Viscarola_OSR · December 7, 2021, 9:15pm

I will move on with using IoGetDmaAdapter, AllocateCommonBuffer on it and so on.

Why not go straight into the 21st Century and use KMDF… and than you’ll be happier.

I recently did a project where I allocated a whole bunch of 2GB chunks of memory for “direct access” from an FPGA via an Address Translator (what we could call a Slave Bridge in non-politically-correct terms). Using WdfCommonBufferCreate worked perfectly.

(MmGetPhysicalAddress? Think about how that can’t work if DMAR (IOMMU) is enabled…)

Peter

Tim_Roberts · December 8, 2021, 7:51am

But I thought wouldn’t it be possible to allocate much more continual physical memory than the usual 4kB page size

Well, remember that 4kB is more than just “usual”. It’s part of the hardware. Intel CPUs support 4kB pages and 2MB pages, called “large pages”. You can actually request memory from the large page pool, but there’s little reason to do so.

DriverThomas · December 9, 2021, 1:11pm

I’ve played around with the KMDF common buffer functions with WdfDmaProfilePacket64 configured. With that I could allocate just about 1.5GB altogether. If I wanted more, WdfCommonBufferCreate returned an error. I’ve tried with several chunk sizes (16MB to 128KB) of continuous memory retrieving by WdfCommonBufferCreate… I could not get over approx. 1.5GB. I’ve not much experience on that topic, but I want to have a deeper understanding.

Peter_Viscarola_OSR · December 9, 2021, 1:55pm

It’s all about how much physical memory you have on the machine, right? And how much physically contiguous memory can be allocated.

Like I said, in one recent project I allocate in 2GB chunks… and successfully allocated at least16GB of memory (which was all that was supported in this project); ISTR testing it to see it allocated even 100GB, but I couldn’t swear that’s the case.

This was on a system with something like 300GB of physical memory…

Peter

Don_Burn · December 9, 2021, 2:02pm

I haven’t done this for many years, but the old approach to allocating contiguous memory or lots of memory was to create a helper driver that started early in boot time. I was successfully able to allocate a 10GB contiguous buffer on a 64GB system this way for a very dumb device.

Peter’s approach is better if the device can handle it (which it sounds like yours can) but I pass on the old approach if there is a specific need.

DriverThomas · December 9, 2021, 2:29pm

I have 128GB memory installed. And just want to have for each (out of 8) DMA channel maximum chunks of continuous memory. It would be ok if one chunk is just e.g. 4MB, or even only 512kB, but not 4K as it is now. One DMA channel of my card handles 1GB; with 4kB pages these are 262144 entries in the SGT which is just too much.

Peter_Viscarola_OSR · December 9, 2021, 2:49pm

@DriverThomas

AH! I just noticed something: Try using WdfDmaProfileScatterGather64Duplex, instead of WdfDmaProfilePacket64. Big difference there… You are ONLY doing your DMA to these common buffers, right?

Just as an experiment, try starting your driver early at Boot Time, to see if (as Don suggests) there’s MORE contiguous memory available at that time.

Try, as an experiment, adding another hundred or two gig of memory to the machine… to see if that makes a difference.

I can, however, tell you for certain: There’s no secret “limit” to the amount of memory you can allocate by WdfCommonBufferCreate – except the max is limited to (4GB - 4K)… and you can call WdfCommonBufferCreate multiple times.

Peter

Tim_Roberts · December 9, 2021, 6:37pm

If you’re wondering about why this happens, remember that physical memory is an EXTREMELY dynamic resource. When you first boot, there are large chunks of “pure” untouched memory, which will naturally be contiguous. But even after just a few minutes of normal operation, physical memory is wildly fragmented. There’s not an awful lot the system can do to coalesce those pages.

with 4kB pages these are 262144 entries in the SGT which is just too much

Are you checking whether these can be consolidated? If I remember correctly, the system doesn’t automatically combine adjacent pages in the S/G list.

DriverThomas · December 9, 2021, 7:09pm

I can imagine that ASLR foils my plan here – it can be deactivated, though. I also thought about reserving the memory with a very early starting “wingman” driver – need to find out how.

Tim_Roberts · December 9, 2021, 9:15pm

ASLR affects virtual addresses. It is completely irrelevant here.

DriverThomas · December 17, 2021, 9:33am

Ive played a lot around with WdfCommonBufferCreate and tried several chunk sizes and using multiple dmaEnablerHandles and so. As far as I came was allocating approx. 1.5GB in total not more… A little disappointing…

Peter_Viscarola_OSR · December 17, 2021, 1:54pm

Once you’ve used WdfDmaProfileScatterGather64Duplex, there should be no advantage to using more Enabler handles.

You have 128GB on the machine, but can allocate only 1.5GB Common Buffer? Hmmmm…. And you’re doing this at system startup time? That is, in a word, surprising.

Peter

DriverThomas · December 17, 2021, 2:32pm

Ahh… Thanks to your suggestions and some reconsidering I have an interesting result. When I’m using WdfDmaEnablerSetMaximumScatterGatherElements with setting max elements to 512 or even just one (sic!) and using WdfDmaProfileScatterGather64Duplex I can achieve my goal. But I’m a little suspiciously that the Scatter/Table is really just one entry then. How can I check that?

Peter_Viscarola_OSR · December 17, 2021, 6:25pm

using WdfDmaEnablerSetMaximumScatterGatherElements with setting max elements to 512 or even just one (sic!)

I don’t know what you mean here. Why would you call WdfDmaEnablerSetMaximumScatterGatherElements at all?? What you really want is “WDF_DMA_ENABLER_UNLIMITED_FRAGMENTS”, which is the default… since you’re not going to use ANY scattter/gather elements, the maximum number that you can handle is irrelevant.

Are you saying if you don’t call this with “1” (for example) the Common Buffer allocation you can get is DIFFERENT? Because I find that hard to believe.

But I’m a little suspiciously that the Scatter/Table is really just one entry then. How can I check that?

You’re allocating a Common Buffer, right? There IS no scatter/gather table… by definition. You get a single buffer, and WdfCommonBufferGetAlignedLogicalAddress gets you the address that you use for DMA for the entire expanse of that buffer. Period.

Peter

DriverThomas · December 17, 2021, 6:51pm

Now I’m confused. Whats the difference between WdfDmaProfileScatterGather64Duplex and WdfDmaProfilePacket64 if not handling the (full) transfer /w resp. /wo Scatter Gather pü?
So you say that the allocated 1GB common buffer is always continuous in physical memory at all? Maybe I dont have the whole picture yet and need to read more about the topic.

Peter_Viscarola_OSR · December 17, 2021, 7:16pm

WdfDmaProfileScatterGather64Duplex and WdfDmaProfilePacket64

It has to do with the Windows DMA Abstractions. If you’re using a Common Buffer approach, the Framework expects you to use the WdfDmaProfileScatterGatherXXX profiles, and not the PACKET profiles. That’s what they were created for.

For some inexplicable reason, we’ve yet to move this article to our modern web site… but you can read about the underlying Windows architecture DMA Abstrations here.

So you say that the allocated 1GB common buffer is always continuous in physical memory

Effectively, yes. Any Common Buffer is effectively contiguous for the purposes of DMA. And it ensures that the allocated range of memory is usable if an IOMMU (DMA Relocation) is enabled as well. This is the entire purpose of a Common Buffer… so you can DMA to it, without worry about setting up an S/G list on the host side.

In some version of Windows, the WDM function AllocateCommonBuffer simply calls MmAllocateContiguousMemory.

Peter

DriverThomas · December 18, 2021, 10:00am

Thanks! I have a read a lot about it esp. your linked article and things got much clearer now.

So the way to go to handle the common buffer in user space finally would be: WdfCommonBufferCreate → WdfCommonBufferGetAlignedVirtualAddress → IoAllocateMdlm → MmBuildMdlForNonPagedPool → MmMapLockedPagesSpecifyCache (User mode) (with EvtIoInCallerContext callback), correct?

Peter_Viscarola_OSR · December 18, 2021, 2:30pm

That is correct. But (and this is a big but)… when, where, and how do you do the unmap operation?

There are reasonably serious security complications involved in handling the UN-mapping. Ordinarily, you’d just do it in cleanup and be happy, because cleanup is called in the context of the process that’s calling close on the handle. But this raises the issue of what happens if the handle has been duplicated into another process. NOW, if the original process doesn’t happen to be the last to call CloseHandle, you’re screwed. At best the system crashes, at worst the offending process has access to a large chunk of what is now “random” memory.

You must deal with this somehow. Ten years ago, I used to argue (here in NTDEV, even) that this problem was so esoteric that it really wasn’t worth fixing. Now, in the 21st Century, where we have no end to cyber-attacks… you really cannot just ignore it.

All the ways to deal with this are beyond the scope of this post (I’ll write a proper article someday)… but you can either (alternative A) take a reference on the “original” mapping process. Then check in cleanup to see if it’s the original mapper is the one calling cleanup. If it’s not, call KeStackAttachProcess to get into the original mapper’s context and do the unmap there. Alternative B is to set a PsSetCreateProcessNotifyRoutine and do the unmap when the original process exits.

Either way you must handle this.

Peter

DriverThomas · December 21, 2021, 9:52am

Ok thank you for your help and hints. I will try to implement a fully working composition with my FPGA guys to get a loopback design running. Will take some time, though. I wish I had more FPGA colleagues on my site.