DMA - Exposing the commonbuffer to user sw

@“Peter_Viscarola_(OSR)”
@Tim_Roberts
(Tim: I was in the middle of writing when you posted - Not a dumb question at all :smiley: )

I found my answer, after some lunchtime googling:

https://osr.com/blog/2014/04/15/evtioincallercontext-callback-called-even-io-operations-dont-queue/

… The other point of view was getting what are essentially unexpected (and, to your driver, unsupported) Requests in EvtIoInCallerContext was an annoyance…

As Tim pointed out above, the context of when EvtIoInCallerContext was called and MmMapLockedPagesSpecifyCache executed, was not the same context as my User SW.

To ensure MmMapLockedPagesSpecifyCache was executed for my IOCTL_GET_DMA_USER_ADDRESS request I added the below at the beginning of EvtIoInCallerContext:

EvtIoInCallerContext

WDF_REQUEST_PARAMETERS requestParams;
WDF_REQUEST_PARAMETERS_INIT( &requestParams );
WdfRequestGetParameters( request, &requestParams );

if( IOCTL_GET_DMA_SYSTEM_MEMORY != requestParams.Parameters.DeviceIoControl.IoControlCode )
{
    WdfDeviceEnqueueRequest( device, request );
    return;
}
...
... (see previous posts for more details)
...
WdfDeviceEnqueueRequest( device, request );

I re-ran my driver and the trace prints are printing, with data after the FPGA was configured to run (not sure on quality of the data, but that’s a future topic).

Thank you Tim & Peter

Juan

On to the next “battle” and maybe one of the “hazard” areas of this approach - unmapping.
https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-mmunmaplockedpages

Note that if the call to MmMapLockedPages or MmMapLockedPagesSpecifyCache specified user mode, the caller must be in the context of the original process before calling MmUnmapLockedPages. This is because the unmapping operation occurs in the context of the calling process, and, if the context is incorrect, the unmapping operation could delete the address range of a random process.

I now need to add an IOCTL to support unmapping. Without this, if I restart my User SW, the Virtual User Address is not valid for this new instance (as @Tim_Roberts pointed out previously).

And the hazard - What happens if my User SW crashes or closes without unmapping? Would I even be able to perform another mapping when a previous one already exists?

For now, I will march ahead and add an IOCTL to support unmapping. (Though, I will give VirtualAlloc with Large Pages another try later on.)

Juan

At the moment, I am looking at allocating 500MB, but may want to up that to 1GB or even 2GB later. The data size can vary,
but at the moment I have the FPGA configured to push 64MB packets, 2 packets a second. But the size and rate can also change.

I need the guaranteed contiguous logical addresses provided by common buffer.

Before you proceed with your “conquest” any further, I would rather recommend you to take the following points into consideration

In case if the target system does not support IOMMU, the ‘(logical_address==physical_address)’ statement is always going to be
evaluated to TRUE. Don’t you see any potential problem with finding a physically contiguous buffer in a GB range???

Certainly, as long as the target machine is equipped with IOMMU, the system is in a position to present a physically non-contiguous buffer as
a logically contiguous one to your device. However, you should bear in mind that not every machine in existence is going to support VT-d or AMD-VI.

For example, if you check the link below you will see that the assumption of IOMMU presence on the target machine is a way too bold

https://en.wikipedia.org/wiki/List_of_IOMMU-supporting_hardware

As you can see it yourself, the list is not THAT long.

Certainly, it is up to you to make a decision, but I think you may find it frustrating to find out that the design that you have
spent so much time and effort on is, for all practical purposes, simply infeasible…

Anton Bassov

I now need to add an IOCTL to support unmapping. Without this, if I restart my User SW, the Virtual User Address is not valid for this new instance (as @Tim_Roberts pointed out previously).

And the hazard - What happens if my User SW crashes or closes without unmapping?

Right. This why you have to handle EvtFileCleanup / EvtFileClose events, so you can clean up your dirty work when the app crashes or closes without cleanly shutting down. You CANNOT rely upon the application to clean up for you. You need to assume that all application writers are bozos, and malicious bozos at that.

Would I even be able to perform another mapping when a previous one already exists?

Not as the driver is currently written, but that’s a driver problem, not an architectural problem. The operating system doesn’t care how many times you map a piece of memory, but your driver certainly does.

Juan, just to be clear: You’re getting further and further down the road into a design that is going to wind-up either having serious edge-condition issues with security implications… OR one that’s going to need to include some clever code to take into account and handle these edge conditions. It’s a lot to ask to try to get this right one forum post at a time, without a good background in Windows kernel mode software.

So… think about it. Do you really want to be doing this? You’re venturing into an area that I advise the students in my Advanced WDF seminar to avoid.

Having said that, Mr, Roberts is right on target: You need to do the Unmap operation in your EvtFileCleanup Event Processing Callback. This is called in the context of the process that called CloseHandle. You don’t need or want a separate IOCTL for this. If you do it right, it’ll handle the unmap “automatically” even during abnormal thread termination.

Peter

Hello @anton_bassov

Yes, being able to secure GB(s) of physically contiguous memory is a concern, and one reason for having the driver perform the allocation as I assume (maybe wrongly) that the drive will have a better chance at getting that memory during start-up than User SW.

The alternative is using VirtualAlloc Large Pages in User SW. The Large Page size would be a multiple of the FPGA page size. This way, even if the logical pages are not contiguous, one+ FPGA pages can be DMAd into one System Large Page.

I know I am not completely in the clear with Large Pages as the system will still need to find enough contiguous memory for all the Large Pages needed to make up my desired buffer size. Since this is all happening during system start-up, and with 32GB of physical RAM, I would hope this is not a problem.

FYI: when I attempted to allocate 1GB using the CommonBuffer approach, the allocation failed. I was successful in allocating a 750MB CommonBuffer, however.

Thanks, Anton.
Juan

Hello @“Peter_Viscarola_(OSR)” and @Tim_Roberts

Thank you both again for your help and patience.

Regarding the unmapping, I went ahead with the IOCTL approach, but not because I disagreed with anything you two said, but because I decided to end my experiment with the CommonBuffer approach.

I had success allocating 500MB and having my FPGA DMA the data both system memory and the second external device, ping-ponging back and forth between External and System Memory. I ended up having SW perform a copy to the external device, to allow that device process the data.

The results were seamless and I could not tell the difference between processed data comging directly from FPGA or via the System Memory.

  • A little change in plans, but it was enough to prove using System Memory as a landing spot it doable.
  • Not to mention the warning you both gave about this approach.
  • BTW: performing a print-screen caused a BSOD - a corner case :wink:

My next step is to attempt allocating the buffer in User SW via VirtualAlloc and using Large Pages. Then feed that address to the Driver. Still not the conventional appraoch that Peter pointed me to, but I suspect better than the CommonBuffer approach I am using now.

  • I suspect I would still need to use EvtFileCleanup/EvtFileClose if my SW were to end abruptly.

Thank you again! I am having a good time and learning a little bit as I go.
Juan

Hello All:

Just to close off this thread - the below code snippets is what I used in my driver to allocate a common buffer for DMA use and expose that common buffer to User SW for direct access.


[Mods: With his permission, and in no way intending to show any disrespect, we have removed the code example Mr. OneKneeToe provided here. We did this because, in our judgement, it was likely to create more issues than it solved for future devs who encounter this thread. We understand that Mr. OneKneeToe had a very specific need, and that the solution he posted here met that need for him.

Indeed, we are grateful to Mr. OneKneeToe for taking the time to “give back” to the Community by posting the code that worked for him, in his specific situation. We just don’t want people to copy it in the future without being aware of its limitations.]

The alternative is using VirtualAlloc Large Pages in User SW. The Large Page size would be a multiple of the FPGA page size.
This way, even if the logical pages are not contiguous, one+ FPGA pages can be DMAd into one System Large Page.

If you don’t mind, could you please expand it a bit. There may be the case of simply mis-phrasing your idea here, but, judging from the above statement as it has been presented, you’ve got to learn quite a bit of system-level basics before even thinking about writing drivers…

Anton Bassov

Sigh! I’m really struggling with leaving your code examples inline in this thread… even given the time/effort you obviously put into posting them and formatting them properly.

The code you’ve posted is all really just prototype code that shows how to call some APIs… but is in no way production quality or ready for use outside a lab/testing setting.

In fact, you seem to have ignored just about every piece of advice I gave you in this thread.

I’m not encouraged.

Peter

Good Morning @“Peter_Viscarola_(OSR)”

Ignoring your advice:

When I read “Ignored” I felt it carried a negative connotation. I like to think that I listen to your advice, look to see if I could make use of it given the task at hand, and found that I could not take it. It definitely was not a case of, “what you say doesn’t matter”. Quite the opposite. The above approach was a means-to-an-end, if you will.

In fact, since I met my intended goal with the above, I’ve stopped work on this approach. I will attempt the VirtualAlloc approach with a special Direct I/O call that stays uncompleted for the life of the program (something along those lines).

Leave or Remove:

I would not be hurt if you were to remove the post. My intention was to give back. I did add a caution section echoing the concerns and asking readers to read the thread. Nevertheless, you have been at this far longer than I and, after all, you are a moderator.

Appreciation:

Your help and advice is appreciated. There really is no other resource out there. Documentation exits, but it usually doesn’t go into these details and nothing beats having an experienced person to talk things through with. Thanks to you and @Tim_Roberts.

Best Regards,
Juan

I would not be hurt if you were to remove the post

Thanks for that. With your permission, I did ultimately decide that the code should be removed.

Maybe I’m overly concerned, but especially for this particular topic which is such a commonly encountered one I really didn’t want to have yet another solution posted that doesn’t really address any of the complex issues inherent in mapping memory into a user’s address space.

Thanks for your support,

Peter

1 Like

I will attempt the VirtualAlloc approach

Well, judging from your latest post, you don’t seem to be able to tell the difference between the physical, logical and virtual addresses.

Therefore, what I would advise you to do is to take a step back, and to review the basics. I’ve got a weird feeling that, after having reviewed them, you may abandon the common buffer approach altogether, because you may realise that your target hardware is, in actuality, perfectly SG- capable.

If you do, I can assure you that you are not going to be the first poster in this NG wto do it . I vaguely recall a thread where the OP was bullshitting us about his common buffer requirement for about a couple of weeks or so, but then, after our explanations, had eventually “discovered” that his target board was, in actuality, SG- capable…

Anton Bassov

There are very few modern devices (including FPGA IPs) that do not support S/G for moving user data, simply because physically fragmented user data buffers are “the way of the world” and IOMMUs are not universally available.

When I DO see requirements for logically contiguous blocks of memory for DMA, requiring the use of Common Buffers, this is most frequently due to the device using a “continuous” DMA scheme in which the Common Buffer holds a series of descriptors, each of which contains the logical address and length of a user data buffer fragment (and thus giving you a S/G capability). These descriptors are often grouped into one or more rings.

When I see bus master DMA devices, FPGAs or otherwise, that do not support S/G for user data transfers they are almost always the result of a home-grown design (in which the device interface requirements somewhat exceeded the designer’s capabilities… either in terms of available time, insight, or experience). Or, sometimes, the device is a prototype or PoC that is never intended for use in production.

Peter

Hello @anton_bassov and @“Peter_Viscarola_(OSR)”

Maybe this deserves a new thread; I don’t see a DM option.

Get that dirty shoe ready:

The FPGA does support SG, however, the legacy external device the FPGA DMAs to does not. SW, by way of the external device’s driver, allocates a buffer and in return gets a Page List suitable for DMA - for the sake of discussion, lets say the page size on the external device is 32KB. SW then passes that list onto the FPGA by programming a Page Table in the FPGA. The table has a max size of 64K entries, enough entries (pages) for 2GB. FPGA DMAs data, a page at a time, going down the Page Table and circling back at the end. There is hand-shaking going on with SW via a windows driver (interrupts), but that is the gist of it.

For this exercise, “all I wanted to do” (hahaha) is move this buffer from the external device to system memory. I cannot change how the FPGA works; The data path out of this FPGA is via that page table.

With the common buffer approach, since I am getting logical addresses that are guaranteed to be contiguous, it doesn’t matter if physical page sizes are 4K, SW does the math using the base logical address to come up with a new list of logical addresses to align with 32KB pages.

Going with the SG List approach, I suppose if the list of entries are contiguous and uniform, then it’s no different from the common buffer approach.

If, however, I go with VirtualAlloc, I have no guarantee that the logical addresses will be contiguous. So I would need to go with Large Pages (2MB). Then SW would need to do the math to create a new list, for the FPGA to use, of logical addresses, splitting each logical address (for a 2MB Page) into 64 logical addresses (for 32KB pages the FPGA thinks the destination has).

So, how far into the abyss have I fallen? How far can you throw that dirty shoe? :smiley:

Juan

Here’s an older article that talks about the two methods mentioned in this thread:

https://www.osronline.com/article.cfm^article=39.htm

That article is 20 years old… and refers to WDM.

Before editing this post, I went in to say that the article was not useful. However… I see, looking at it further, that it was updated at some point, and it at least briefly describes the duplicated handle problem. Which is good, I guess, as far as it goes.

Still, not what I would call a terrific reference in the 21st Century (having written both the original article and the update, I’m kinda in a unique position to be able to say this).

Peter

The FPGA does support SG, however, the legacy external device the FPGA DMAs to does not.

But, but, but, but, but… the “legacy external device” shouldn’t care, should it? It’s the FPGA that moves the data from the host to “internal” memory that the “legacy internal device” accesses, right? So… you only care that the buffers accessed by your “legacy external device” are contiguous. You don’t care about the host buffers, right?

Of course, you haven’t really given us an architectural description of what you’re trying to do… so we are left to guess.

Peter

If, however, I go with VirtualAlloc, I have no guarantee that the logical addresses will be contiguous.
So I would need to go with Large Pages (2MB). Then SW would need to do the math to create a new list, for the FPGA to use,
of logical addresses, splitting each logical address (for a 2MB Page) into 64 logical addresses (for 32KB pages
the FPGA thinks the destination has).

Assuming that your FPGA really,really,really needs to do the transfers in 32K chunks that are physically contiguous this approach,indeed,makes a perfect sense In the end of the day, it is much easier to find a contiguous 2M buffer rather than 2G one,right.
In fact, in order to improve your chances of success you can even try splitting your allocations in multiple 2M chunks,
with each of them being backed up by a single large page, rather than allocating a single buffer.

However, for the reasons that have been already mentioned by Peter, I seriously doubt that “contiguous 32K chunks is an absolute must” requirement is justified in your case…

Anton Bassov

There is a limit on the page table in the FPGA. If the pages are 4K, 2GB / 4K pages, will exceed the capacity of the page table in the FGPA.

Also, the FPGA basically DMAs data continuously. The driver is not processing SW read request for data ( SW is not pulling data). The FPGA is pushing the data and SW needs to keep up, that’s the gist anyway; there is a stop/start mechanism, and the FPGA is programmed to push data in blocks - block size is configurable.

I don’t know if that helps.

When I read about DMA, it all seems to be from the point of view that my SW wants to do a write-to or read-from. So SW send a request to the Driver and the driver sets up a DMA Transaction to execute a write or read. But that is not what the SW is trying to do. SW needs to allocate a landing spot for the FPGA to write-to. Once the FPGA is told to go, it is the FPGA performing the DMA writes, not the SW. I suppose my windows driver doesn’t need to be involved in the DMA process, other than providing logical addresses for the buffer SW allocated, the landing spot for the FPGA to write into.

Sorry to run you guys around in circles. I have said that I am thick-headed - its takes me a while :neutral:

Thanks @“Peter_Viscarola_(OSR)” and @anton_bassov (and @Erik_Dabrowsky for the link).