DMA to SysMem using Large Pages

OneKneeToe · April 23, 2019, 6:22pm

Newbie - so please bare with me. I have read some of the other discussion on the topic, but I feel they have not completely answered my question, at least.

Windows 10 system. I have user code that allocates 4GB of Large Page memory (VirtualAlloc ). No problems there.
I have an FPGA that is DMA’ing data based on an MDL I program it with.

Data goes from FPGA → SysMem; no plan to go from SysMem–>FPGA.
The FPGA is assuming 64KB page sizes.

Once the data is in SysMem, the user code will process that data.

Basically, The 4GB buffer is intended to by a type of circular buffer.
FPGA generated interrupts will signal user code via the FPGA kernel driver when new data is available.
But lets put this piece aside for now, I want to focus on how to properly DMA the data into the 4GB buffer.

Since the 4GB is contiguous, and 64KB is a multiple of 2MB, my plan is to:

Create an page-table of virtual addresses that are in increments of 64KB.
– 4GiB / 64KiB = 65,536 entries.
Have my FPGA kernel driver convert the virtual addresses to physical addresses ( use MmGetPhysicalAddress ).
Program the FPGA with the “translated” physical addresses.

However, reading some other discussions on the topic of DMA, this is not the right way of doings this - even if I get lucky and this happens to work.

Question: I have a pointer (virtual address) to my 4GB buffer of Large Page Memory, what is the correct way of

Breaking up the 4GB buffer into 64KB “chunks”.
Converting the virtual addresses into physical addresses.
Setting up the physical addresses to be DMA recipients.

Thank you, all, in advance.
OneKneeToe

Peter_Viscarola_OSR · April 23, 2019, 9:37pm

this is not the right way of doings this - even if I get lucky and this happens to work.

Yes. What you’ve described might work, it might not work, but in any case it’s wrong architecturally.

Soooo… What do you mean that the FPGA is "assuming 64KB page sizes? The size of pages is actually 4K, not 64K.

In any case, this can get a lot more complicated that what I’m likely to answer in a forum. But, I’ll try:

You need to use the DMA APIs to do what you want. So, WdfDmaEnablerCreate, then create a WDF Transaction. Have the user-mode app send the region up to you using DirectI/O. Keep that Direct I/O “hanging” (in progress)… and use that segment that you’ve created with the VirtualAlloc as the communication region between the app and the driver.
ALTERNATIVELY you could create a common buffer (in your driver) from which to do the DMA. This would guarantee you a segment of memory that’s contiguous for the purposes of DMA (which is not necessarily the same as being physically contiguous). You’ll get back both a Kernel Virtual Address and a “physical address” that’s usable for DMA (what we could call a Device Bus Logical Address). You can then “do the math” and program your device appropriately with physical page pointers using the Device Bus Logical Address… you can also map the segment into the virtual address space of the User Applicaiton using MmMapLockedPagesSpecifyCache.

I would not recommend either of these approaches for a new Windows driver dev. But, I guess you have to start somewhere…

Peter

OneKneeToe · April 23, 2019, 10:19pm

Thank you Peter:

@“Peter_Viscarola_(OSR)” said:
Soooo… What do you mean that the FPGA is "assuming 64KB page sizes? The size of pages is actually 4K, not 64K.

The FPGA is setup to DMA data to a GPU, whose page size is 64KB.

To avoid changing the FPGA, allow the FPGA to continue to DMA using 64KB pages.
Instead of giving it addresses for the GPU, give it addresses for SysMem.

By using Windows Large Pages, the page-size on windows for that region of memory is 2MB.

We don’t take full advantage of the larger 2MB page-size, but we are still more efficient over the 4KB page-size.

That is my, naive, thinking anyway…

With approach 1, at what point would I get the physical address to program the FPGA with?

I understand that user-code has to continue to use the virtual address.
I still need physical address to configure the FPGA so it can start DMA’ing.

Much respect for driver developers - this subject matter is quite an art-form.

Thank you, again, Peter.

OneKneeToe (Naive Novice)

Peter_Viscarola_OSR · April 24, 2019, 12:22am

I think we’re not on the same page (ha! Joke there!)… what does “page size” mean to you, please, and how does it impact this project in your view?

When we speak of page size, we typically talk about the allocation and memory mapping granularity. So, for example, you have 32K data buffer… that buffer could comprise as many as 9 discrete fragments, because the page size in Windows is 4K (and the page may not be aligned on an exact boundary).

Large pages are fine…, but they’re an aside, and not that entirely relevant in this situation, except for the fact that if you allocate a buffer that’s within a single large page it will be physically contiguous. Which is nice, because when you get your scatter/gather list you’ll have one big fragment, right?

The things that are important here are the DMA capabilities of your device (does it support scatter/gather, are there a a max number of fragments) and how you want your user to interact with your device.

The. You can choose your driver design.

Before we go much further down this road, a question: Is this a hobby or school project?

P

OneKneeToe · April 24, 2019, 6:22pm

Hello Peter:

@“Peter_Viscarola_(OSR)” said:
I think we’re not on the same page (ha! Joke there!)… what does “page size” mean to you, please…

hahaha - Page Fault…

A Page is a contiguous block of memory. When memory is allocated (reserved), it is allocated in Pages (blocks). The Page size, then, defines how big these blocks are.

When memory that is larger than a page is allocated, that larger allocation may not necessarily be contiguous in physical memory, from page to page, due to fragmentation, for example. It will be contiguous in virtual memory, however.

…, and how does it impact this project in your view?

Here the page size matters since the FPGA is working with the understanding that its MDL is pointing to 64KB pages (contiguous blocks of memory). If, however, on Windows a page is 4KB, then even though enough memory has been allocated, it may not be contiguous from one page to another. The FPGA will attempt to DMA 64KB in consecutive order and bad things can start to happen.

Since I am using Large Pages, I have 2MB of contiguous memory. 64KB evenly divides 2MB and so I can split up the 2MB into chunks of 64KB and the FPGA is safe to DMA into those smaller 64KB chunks (Pages to the FPGA).

The things that are important here are the DMA capabilities of your device (does it support scatter/gather, are there a a max number of fragments) and how you want your user to interact with your device.

I am not an FPGA person, I do know that it does does support scatter/gather and I believe this is how the MDL is being used, to populate the scatter/gather list (but I could be wrong).

(ready for a face-palm) In the context of just moving data, I realize the benefits of S/G over Common, why does the “method” of DMA’ing matter on the system side?
If the physical memory has already been allocated, the memory is there for the FPGA to DMA into.
Once the data is in Physical Memory, User SW, via Virtual Memory, has access.
Do we even need the driver in the mix, with regards to DMA, at this point?
The FPGA Driver will signal when new data is available and at what physical address.

Before we go much further down this road, a question: Is this a hobby or school project?

This is technically “work related” in that I am using work resources (software, hardware, computers, lab, etc.). But I am working, “on my own time”, on this mini side-project / hobby.

One additional note:

The FPGA generates data on its own.
Once I program it with an MDL and I set a flag in one of its registers, the FPGA starts DMA’ing test data.
It also signals an interrupt after DMA’ing 64MB (every 64MB, I get an interrupt).

Thank you, Peter.

OneKneeToe

anton_bassov · April 25, 2019, 1:02am

So, for example, you have 32K data buffer… that buffer could comprise as many as 9 discrete fragments, because the page size
in Windows is 4K (and the page may not be aligned on an exact boundary).

???

What’s wrong with you today, Peter???

First of all, the page size has absolutely nothing to do with the software matters - it is architecturally defined by the CPU itself. The only thing that the OS software can decide upon is the particular page size that it may choose out of the options that the given CPU architecture offers. On x86 and x86_64 it would mean choosing between the large and small pages.

Furthermore, a physical page is just bound to be aligned on the page size boundary. Just think about how the virtual-to-physical address translation actually works. For the sake of simplicity let’s assume a 32-bit x86-based system with no support for PAE,
i.e. with only 3 levels of translation.

First the CPU gets the physical address of the page directory from 32-bit CR3 register. Once only 20 bits of this register are used for storing the page directory address, it is obvious that the page directory has to be aligned on the 4096-byte boundary. Otherwise, if any of 12 lower bits of this address were allowed to be non-zero, 20 bits would be simply insufficient for storing the address of the page directory.

Then the CPU reads the corresponding page directory entry, which is,again, a 32-bit value that stores the physical address of the page table in its higher 20 bits. Therefore, for the reasons explained above, the page table has to be aligned on the the 4096-byte boundary as well. Finally, the CPU reads the corresponding page table entry, which is, again, a 32-bit value that stores the physical address of the target page in its higher 20 bits. Therefore, again, the actual physical page has to be aligned on the 4096-byte boundary…

Anton Bassov

Peter_Viscarola_OSR · April 25, 2019, 1:40am

Why are you explaining this to me, Mr Bassov?

When nothing you’re saying has anything to do with DMA
When you clearly know that I must know how memory management works
When nothing you wrote contradicts anything I said
When nothing you wrote adds to the discussion?

We’re not discussing software or memory management. We’re discussing DMA. Go back to sleep and let the grown ups have a conversation.

Peter

Peter_Viscarola_OSR · April 25, 2019, 2:14am

the FPGA is working with the understanding that its MDL

Well, the MDL is Windows OS concept, not something your FPGA has. So, I think you’re talking about the SGL (Scatter/Gather List) used by the FPGA… and that it assumes each pointer it receives points to a physically contiguous segment that’s 64K long? That’s a mighty unusual piece of hardware… and I’ve seen a few DMA devices in my time. How about fragments that are less than 64K? Are fragments assumed to start on a 64K boundary?

I have to apologize, but I’m afraid this is just too large a range of topics for me to work through in a forum with you. First, we need to agree on our terminology and on the basics of how DMA works in general. Then we need to decide how we want data to be shared between the app and the device, balancing this with how we might implement this in a Windows Driver and the various trade-offs involved. It’s particularly difficult to do this in short posts.

The best I can do at this point is to point you back to my first reply to you for the basics of how to approach the implementation question.

Maybe one of our other members here is better able to work you through these issues than I am.

If you wanna come up, buy me a coffee, and discuss this on my white board, my office is on the sixth floor, between Dan’s office and Scott’s office . But that’s the best I can offer at this point.

Peter

anton_bassov · April 25, 2019, 4:06am

When nothing you’re saying has anything to do with DMA

Hold on - if we are speaking strictly about DMA, why we do we even have to make ANY reference to the pages and their sizes, in the first place??? Let’s face it - a physical memory (i.e something that DMA deals with) has no concept of the page and its size (for the sake of simplicity let’s keep IOMMU out of this discussion), does it…

When you clearly know that I must know how memory management works

Actually, this is EXACTLY the reason why I was so shocked when I read your previous post. In fact, I just could not believe my own eyes when I was reading it…

When nothing you wrote contradicts anything I said

Well, as long as we are speaking about a 32K buffer that happens to be virtually aligned on the page boundary, the immediate implication of my previous post is that such a buffer may not be comprised of more than 8 fragments. Certainly, we can decide to be perverse enough to assume that our target 32K range is just a part of a larger one and that, for the reasons that are not so obvious even to ourselves (apparently, we are just trying to show that a 32K buffer may be potentially comprised of 9 fragments,right) we have decided to make it non-aligned. In such case there seems to be nothing that stands in our way, at least from the technical standpoint - we can probe and lock pages of an MDL that describes the entire buffer, map it to the system address space, and then use IoBuildPartialMdl() - MmGetSystemAddressForMdlSafe() sequence in order to get a partial 32K buffer that is not aligned on the page boundary, effectively occupying 9 fragments…

Anton Bassov

SweetLow · April 25, 2019, 9:52am

Peter_Viscarola_(OSR)

does it support scatter/gather, are there a a max number of fragments
I saw one suspicious statement:

OneKneeToe

4GB is contiguous

What does it mean? It’s hard to have 4Gb of contiguous physical memory.

And I suspect that we don’t have classical PCI bus master with scatter-gather model. Looks like old ISA DMA with permanent round buffer model as in old sound cards.

Peter_Viscarola_OSR · April 25, 2019, 11:57am

as long as we are speaking about a 32K buffer that happens to be virtually aligned on the page boundary

Nobody said anything about the buffer being aligned. You are holding forth at great length about things that are not relevant to this thread, apparently in a vain attempt to show us how much you know. Please stop. You’re ruining your record of several contiguous weeks with almost universally helpful posts (or math errors).

Peter

Peter_Viscarola_OSR · April 25, 2019, 12:02pm

It’s hard to have 4Gb of contiguous physical memory.

Not so much anymore. My desktop machine today has 32GB of ram. How about on a server with 100GB+ of memory… which are increasingly common?

Looks like old ISA DMA with permanent round buffer model as in old sound cards.

Nah, he’s got an FPGA that does… something. The IP blocks in these cards can do sometimes do pretty funky shit, limited only by the imagination of the designer. Note, also, he mentioned the device was initially intended to support peer-to-peer transfers and he wants to change that to device to host. That’s why I suspect he’s conflating page size with DMA fragment size or fragment limit. You don’t worry about non-contiguous memory during peer-to-peer transfers. You do when you need to fill a Host buffer.

Peter

anton_bassov · April 25, 2019, 2:00pm

Nobody said anything about the buffer being aligned.

Probably, just because it is assumed to be the case by default, and, hence, is not worth being mentioned explicitly?

You are holding forth at great length about things that are not relevant to this thread, apparently in a vain attempt
to show us how much you know.

Oh, come on…

Please stop.

OK, I shut up now - I don’t really want to test the “scary” (at least for me) functionality of this hosting platform, do I…
Luckily for me, “The Hanging Judge’s’” dog does not seem to have broken wind in so far, but, according to the post below, it may happen at any moment, which, according to the same post, may have truly disastrous consequences as far as Mr.Bassov is concerned…

https://community.osr.com/discussion/comment/291454#Comment_291454

BTW, once we are at it, I am reading “The Book of Why” by Judea Pearl and Dana Mackenzie right now. The authors don’t seem to be NTDEV regulars, so that they don’t seem to realise that ‘cause-effect’ relationship may get THAT complex…

… (or math errors).

OMG…

I suspect that the “mathematical breakthrough” that I made on the " Increase Minimum Timer Resolution" thread as recently as yesterday is going to haunt me for quite a while…

Anton Bassov

Peter_Viscarola_OSR · April 25, 2019, 3:02pm

Probably, just because it is assumed to be the case by default, and, hence, is not worth being mentioned explicitly?

Huh? When a user allocates a data buffer, and there’s a DMA operation that takes place using that data buffer, there is no “by default” page alignment. Users do all sorts of stuff.

Just cuz the OP said allocated the buffer using VirtualAlloc, and specified MEM_LARGE_PAGES, doesn’t mean that he or she will send this buffer to their driver starting at offset zero.

Luckily for me, “The Hanging Judge’s’” dog does not seem to have broken wind in so far
…
they don’t seem to realise that ‘cause-effect’ relationship may get THAT complex

Even more complex than you imagine! Said dog, since the time of the referenced posting, has reached the end of his journey here on earth. For him to create the criteria triggering your banning would be an interesting occurence indeed.

Peter

anton_bassov · April 26, 2019, 1:28am

Just cuz the OP said allocated the buffer using VirtualAlloc, and specified MEM_LARGE_PAGES, doesn’t mean that he
or she will send this buffer to their driver starting at offset zero.

Well, they don’t have to send it at the offset zero in order to ensure that the whole thing fits into 8 fragments - as long as the buffer is aligned on the page boundary everything is going to fit into 8 fragments no matter what and no matter how. The only situation when it does not apply is the one when the user specifies the start of the buffer somewhere in the middle of the page. In this particular case you may,indeed, need 9 fragments to describe a 32K buffer.

I know that userland code may do any kind of thing, but in this particular case the whole thing seems to be designed specifically for the purposes on the particular application. Therefore, everything is coupled pretty tightly, with an app and a driver getting developed by exactly the same people, so that some assumptions may be made. The OP made it clear that he breaks his 4G buffer into 64K-sized virtual chunks (in his words, “in 64K increments”), which means their virtual addresses are going to be aligned on the page boundary.

Said dog, since the time of the referenced posting, has reached the end of his journey here on earth.

Sorry, Peter - I just did not know it. My condolences…

Anton Bassov