Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Home NTDEV

Before Posting...

Please check out the Community Guidelines in the Announcements and Administration Category.

More Info on Driver Writing and Debugging


The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.


Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/


Allocate huge amounts of continous physical memory for DMA /wo scatter gather

DriverThomasDriverThomas Member Posts: 12

Hi,

I have a PC equipped with 128GB of RAM. For a high-speed, high-bandwidth application transferring constantly tons of gigabytes over a self-developed PCIe card we've implemented a driver allocating 8GB in total of kernel space memory with MmAllocatePagesForMdlEx and friends. This memory is mapped to the user space for read/write access eventually.
The actual implementation of the DMA transfer to/from PCIe card is done with the help of a two level scatter/gather table as just having one level wouldn't fit in FPGAs block ram. This works so far quite good.

But I thought wouldn't it be possible to allocate much more continual physical memory than the usual 4kB page size. Say something in 256MB chunks or even 1GB (hope its not too naive). This would give us the chance to get rid of resource hungry two level scatter/gather table implementation on the FPGA. I've found using MmAllocateContiguousMemory for this purpose.

What do you think? Would it be a good idea?

Thanks, Thomas

Comments

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,963

    In a word, no.

    But let's step back for a minute. How are you determining the Host's "physical" address to give to the FPGA? Please tell me you're not using MmGetPhysicalAddress, cuz the docs are pretty clear about that:

    Do not use this routine to obtain physical addresses for use with DMA operations.

    The way to do this is to use Common Buffers, which are contiguous for the purpose of DMA. The maximum size of a Common Buffer that you can create is (MAXULONG - PAGE_SIZE).

    Does that help?

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • DriverThomasDriverThomas Member Posts: 12

    Thank you for your clear response.
    Honestly I have to admit that the legacy code uses MmGetPhysicalAddress. I've read about the warning on MS documentation, too and also wondered and was scared about it during onboarding on the project. But it works since many years. I know, no excuses. I have that on my list...

    So I will move on with using IoGetDmaAdapter, AllocateCommonBuffer on it and so on...

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,963

    I will move on with using IoGetDmaAdapter, AllocateCommonBuffer on it and so on.

    Why not go straight into the 21st Century and use KMDF... and than you'll be happier.

    I recently did a project where I allocated a whole bunch of 2GB chunks of memory for "direct access" from an FPGA via an Address Translator (what we could call a Slave Bridge in non-politically-correct terms). Using WdfCommonBufferCreate worked perfectly.

    (MmGetPhysicalAddress? Think about how that can't work if DMAR (IOMMU) is enabled...)

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,295

    But I thought wouldn't it be possible to allocate much more continual physical memory than the usual 4kB page size

    Well, remember that 4kB is more than just "usual". It's part of the hardware. Intel CPUs support 4kB pages and 2MB pages, called "large pages". You can actually request memory from the large page pool, but there's little reason to do so.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • DriverThomasDriverThomas Member Posts: 12

    I've played around with the KMDF common buffer functions with WdfDmaProfilePacket64 configured. With that I could allocate just about 1.5GB altogether. If I wanted more, WdfCommonBufferCreate returned an error. I've tried with several chunk sizes (16MB to 128KB) of continuous memory retrieving by WdfCommonBufferCreate... I could not get over approx. 1.5GB. I've not much experience on that topic, but I want to have a deeper understanding.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,963

    It's all about how much physical memory you have on the machine, right? And how much physically contiguous memory can be allocated.

    Like I said, in one recent project I allocate in 2GB chunks.... and successfully allocated at least16GB of memory (which was all that was supported in this project); ISTR testing it to see it allocated even 100GB, but I couldn't swear that's the case.

    This was on a system with something like 300GB of physical memory...

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • Don_BurnDon_Burn Member - All Emails Posts: 1,760

    I haven't done this for many years, but the old approach to allocating contiguous memory or lots of memory was to create a helper driver that started early in boot time. I was successfully able to allocate a 10GB contiguous buffer on a 64GB system this way for a very dumb device.

    Peter's approach is better if the device can handle it (which it sounds like yours can) but I pass on the old approach if there is a specific need.

  • DriverThomasDriverThomas Member Posts: 12

    I have 128GB memory installed. And just want to have for each (out of 8) DMA channel maximum chunks of continuous memory. It would be ok if one chunk is just e.g. 4MB, or even only 512kB, but not 4K as it is now. One DMA channel of my card handles 1GB; with 4kB pages these are 262144 entries in the SGT which is just too much. :(

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,963

    @DriverThomas

    AH! I just noticed something: Try using WdfDmaProfileScatterGather64Duplex, instead of WdfDmaProfilePacket64. Big difference there.... You are ONLY doing your DMA to these common buffers, right?

    Just as an experiment, try starting your driver early at Boot Time, to see if (as Don suggests) there's MORE contiguous memory available at that time.

    Try, as an experiment, adding another hundred or two gig of memory to the machine... to see if that makes a difference.

    I can, however, tell you for certain: There's no secret "limit" to the amount of memory you can allocate by WdfCommonBufferCreate -- except the max is limited to (4GB - 4K)... and you can call WdfCommonBufferCreate multiple times.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,295

    If you're wondering about why this happens, remember that physical memory is an EXTREMELY dynamic resource. When you first boot, there are large chunks of "pure" untouched memory, which will naturally be contiguous. But even after just a few minutes of normal operation, physical memory is wildly fragmented. There's not an awful lot the system can do to coalesce those pages.

    with 4kB pages these are 262144 entries in the SGT which is just too much

    Are you checking whether these can be consolidated? If I remember correctly, the system doesn't automatically combine adjacent pages in the S/G list.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • DriverThomasDriverThomas Member Posts: 12

    I can imagine that ASLR foils my plan here -- it can be deactivated, though. I also thought about reserving the memory with a very early starting "wingman" driver -- need to find out how. :)

  • Tim_RobertsTim_Roberts Member - All Emails Posts: 14,295

    ASLR affects virtual addresses. It is completely irrelevant here.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • DriverThomasDriverThomas Member Posts: 12

    Ive played a lot around with WdfCommonBufferCreate and tried several chunk sizes and using multiple dmaEnablerHandles and so. As far as I came was allocating approx. 1.5GB in total not more... A little disappointing...

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,963

    Once you’ve used WdfDmaProfileScatterGather64Duplex, there should be no advantage to using more Enabler handles.

    You have 128GB on the machine, but can allocate only 1.5GB Common Buffer? Hmmmm…. And you’re doing this at system startup time? That is, in a word, surprising.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • DriverThomasDriverThomas Member Posts: 12

    Ahh... Thanks to your suggestions and some reconsidering I have an interesting result. When I'm using WdfDmaEnablerSetMaximumScatterGatherElements with setting max elements to 512 or even just one (sic!) and using WdfDmaProfileScatterGather64Duplex I can achieve my goal. But I'm a little suspiciously that the Scatter/Table is really just one entry then. How can I check that?

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,963

    using WdfDmaEnablerSetMaximumScatterGatherElements with setting max elements to 512 or even just one (sic!)

    I don't know what you mean here. Why would you call WdfDmaEnablerSetMaximumScatterGatherElements at all?? What you really want is "WDF_DMA_ENABLER_UNLIMITED_FRAGMENTS", which is the default... since you're not going to use ANY scattter/gather elements, the maximum number that you can handle is irrelevant.

    Are you saying if you don't call this with "1" (for example) the Common Buffer allocation you can get is DIFFERENT? Because I find that hard to believe.

    But I'm a little suspiciously that the Scatter/Table is really just one entry then. How can I check that?

    You're allocating a Common Buffer, right? There IS no scatter/gather table... by definition. You get a single buffer, and WdfCommonBufferGetAlignedLogicalAddress gets you the address that you use for DMA for the entire expanse of that buffer. Period.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • DriverThomasDriverThomas Member Posts: 12

    Now I'm confused. Whats the difference between WdfDmaProfileScatterGather64Duplex and WdfDmaProfilePacket64 if not handling the (full) transfer /w resp. /wo Scatter Gather pü?
    So you say that the allocated 1GB common buffer is always continuous in physical memory at all? Maybe I dont have the whole picture yet and need to read more about the topic.

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,963

    WdfDmaProfileScatterGather64Duplex and WdfDmaProfilePacket64

    It has to do with the Windows DMA Abstractions. If you're using a Common Buffer approach, the Framework expects you to use the WdfDmaProfileScatterGatherXXX profiles, and not the PACKET profiles. That's what they were created for.

    For some inexplicable reason, we've yet to move this article to our modern web site... but you can read about the underlying Windows architecture DMA Abstrations here.

    So you say that the allocated 1GB common buffer is always continuous in physical memory

    Effectively, yes. Any Common Buffer is effectively contiguous for the purposes of DMA. And it ensures that the allocated range of memory is usable if an IOMMU (DMA Relocation) is enabled as well. This is the entire purpose of a Common Buffer.. so you can DMA to it, without worry about setting up an S/G list on the host side.

    In some version of Windows, the WDM function AllocateCommonBuffer simply calls MmAllocateContiguousMemory.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • DriverThomasDriverThomas Member Posts: 12
    edited December 2021

    Thanks! I have a read a lot about it esp. your linked article and things got much clearer now. :smile:

    So the way to go to handle the common buffer in user space finally would be: WdfCommonBufferCreate --> WdfCommonBufferGetAlignedVirtualAddress --> IoAllocateMdlm --> MmBuildMdlForNonPagedPool --> MmMapLockedPagesSpecifyCache (User mode) (with EvtIoInCallerContext callback), correct?

  • Peter_Viscarola_(OSR)Peter_Viscarola_(OSR) Administrator Posts: 8,963

    That is correct. But (and this is a big but)… when, where, and how do you do the unmap operation?

    There are reasonably serious security complications involved in handling the UN-mapping. Ordinarily, you’d just do it in cleanup and be happy, because cleanup is called in the context of the process that’s calling close on the handle. But this raises the issue of what happens if the handle has been duplicated into another process. NOW, if the original process doesn’t happen to be the last to call CloseHandle, you’re screwed. At best the system crashes, at worst the offending process has access to a large chunk of what is now “random” memory.

    You must deal with this somehow. Ten years ago, I used to argue (here in NTDEV, even) that this problem was so esoteric that it really wasn’t worth fixing. Now, in the 21st Century, where we have no end to cyber-attacks… you really cannot just ignore it.

    All the ways to deal with this are beyond the scope of this post (I’ll write a proper article someday)… but you can either (alternative A) take a reference on the “original” mapping process. Then check in cleanup to see if it’s the original mapper is the one calling cleanup. If it’s not, call KeStackAttachProcess to get into the original mapper’s context and do the unmap there. Alternative B is to set a PsSetCreateProcessNotifyRoutine and do the unmap when the original process exits.

    Either way you must handle this.

    Peter

    Peter Viscarola
    OSR
    @OSRDrivers

  • DriverThomasDriverThomas Member Posts: 12

    Ok thank you for your help and hints. I will try to implement a fully working composition with my FPGA guys to get a loopback design running. Will take some time, though. I wish I had more FPGA colleagues on my site. ;)

Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. Sign in or register to get started.

Upcoming OSR Seminars
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead!
Internals & Software Drivers 7 February 2022 Live, Online
Kernel Debugging 21 March 2022 Live, Online
Developing Minifilters 23 May 2022 Live, Online
Writing WDF Drivers 12 September 2022 Live, Online