DMA - Exposing the commonbuffer to user sw

What changed in the last 10 years in the area which is discussed here? Seriously. Windows development isn’t my main focus
for about 6 years now so I wonder what I have missed

Although there are no dramatic changes anywhere in sight, but there may be some relatively minor modifications/improvements that were unavailable in the earlier days of Windows. As a result, more options may be available in the more recent OS versions.

For example, with the advent of the kernel sockets and Windows filtering platform in Vista you can be 100% sure that all your TDI experience can go down the drain. Furthermore, the advent of NDIS 6+ LWFs made NDIS 5 IM filter drivers sort of obsolete as well.

Another example is KMDF. For example, if you check WDK samples you will see that practically all WDM samples have been removed from it. Therefore, if you advise someone to do things in WDM-like fashion you will be looked upon as a dinosaur who got brought back to life.

You should not forget about the progress in the field of the development tools/environment either. For example, you can be sure that you are not going to be either called a “STUPID IDIOT” or requested to "PUBLISH YOUR NAME…(etc) " if you say that you want to build drivers with VC. Even more, these days you are sort of encouraged to use C++ in your drivers, so that you are quite unlikely
to hear Linus-like anti-C++ diatribes from our hosts. In fact, I would not bee too surprised if they actually start promoting the use of C++.

In context of Peter’s “argument” the very obvious example is process creation callbacks. For example, the NotificationRoutin() (i.e the only option was available under the XP) does not allow you to block process creation right of the spot in context of a callback, but the NotificationRoutineEx() that turned up in the more recent versions offers this functionality. IIRC, I put a foot in my mouth once because of this particular improvement.

Anton Bassov

What changed in the last 10 years in the area which is discussed here? Seriously.

One very real and impactful change, aside from the bus drivers being totally rewritten to exhibit different practical behaviors, is the ever increasing presence of IOMMUs. There are other changes as well, including many, many, changes in power management, right? And that’s always impactful.

But, when it comes to recency of experience, it’s not a matter of “what’s changed” as much as it’s a matter of “how have I used these concepts lately… so I know/remember the true practicalities, the implementation ‘potholes’ as I like to call them, and not just some overall general concepts of how things might/should/did work.”

@anton_bassov You’re wrong, and your knowledge of Windows architecture is flawed. I’m not debating this with you.

Be aware that my patience with you is getting thin. You’re not providing helpful guidance for the questioners. And you’ve not amused me lately. You trolled this thread in your initial post and I, stupidly, rewarded you by taking your bait. If I had to guess, I would guess your time here grows short.

Peter

Peter,

You’re not providing helpful guidance for the questioners.

Well, judging from the OP’s post, he was well aware of the potential issues that may arise if you map the kernel memory to the user space, but was still going to proceed this way. Therefore, I just showed him the function that allowed him to reach his objective, and dropped
an extra hint (although, probably, not in a best way - more on it below) that this practice may be frowned upon in the Windows world…

You trolled this thread in your initial post

This part was, indeed, totally unnecessary on my behalf. I’m sorry for that…

it’s not a matter of “what’s changed” as much as it’s a matter of “how have I used these concepts lately… so I know/remember
the true practicalities, the implementation ‘potholes’ as I like to call them, and not just some overall general concepts
of how things might/should/did work.”

True, but my NTDEV participation (apart from the “exciting” trolling side,of course) allows me not only to stay in shape but even to learn something new, or at least to “learn what I have to learn and re-learn” due to the OS changes…

You’re wrong, and your knowledge of Windows architecture is flawed. I’m not debating this with you.

Trolling issues aside, could you please explain to me what I have said wrong from the technical standpoint. I am not either trolling or trying to prove anything to anyone - I just want to learn things for myself…

Anton Bassov

Hello Peter:

Thanks for bearing with me and for the good info:

I attempted the common buffer approach last night and ran into an issue with my User SW getting an access violation. I had done everything from creating the DmaEnabler to MmMapLockedPagesSpecifyCache in my EvtDeviceAdd; Storing the addresses in the device context. Then I created an IOCTL for my User SW to call and retrieve the Virtual User Address.

Searching the forum, I found a post, by you actually, pointing out that the MmMapLockedPagesSpecifyCache call needs to be done in the correct Context and that a good location for this would be EvtIoInCallerContext and not EvtIoDeviceControl (which is where I was going to move it to next, as a part of the IOCTL).
https://community.osr.com/discussion/279797

@“Peter_Viscarola_(OSR)”
…EvtIoDeviceControl (in fact, EvtIoXxxx) is called in an arbitrary process and thread context. You need to use the EvtIoInCallerContext callback…

So, I added an EvtIoInCallerContext and moved the MmMapLockedPagesSpecifyCache to the method.
Unfortunately, I am still getting the access violation. Code Snippets below ( to give further context ).
I will play around with this some more; My first guess is it may be how I set up the enabler or the common buffer.

EvtDeviceAdd

    WDF_DMA_ENABLER_CONFIG dmaEnablerConfig;
    WDF_DMA_ENABLER_CONFIG_INIT( &dmaEnablerConfig, WdfDmaProfilePacket64, 128 );

    dmaEnablerConfig.EvtDmaEnablerFill = NULL;
    dmaEnablerConfig.EvtDmaEnablerFlush = NULL;
    dmaEnablerConfig.EvtDmaEnablerDisable = NULL;
    dmaEnablerConfig.EvtDmaEnablerEnable = NULL;
    dmaEnablerConfig.EvtDmaEnablerSelfManagedIoStart = NULL;
    dmaEnablerConfig.EvtDmaEnablerSelfManagedIoStop = NULL;
    dmaEnablerConfig.AddressWidthOverride = 0;
    dmaEnablerConfig.WdmDmaVersionOverride = 3;
    dmaEnablerConfig.Flags = WDF_DMA_ENABLER_CONFIG_REQUIRE_SINGLE_TRANSFER;

    NTSTATUS status{ WdfDmaEnablerCreate( wdfDevice, &dmaEnablerConfig, WDF_NO_OBJECT_ATTRIBUTES, &( deviceContextP->myDmaEnabler ) ) };
    if( NT_SUCCESS( status ) )
    {
        WDF_COMMON_BUFFER_CONFIG CommonBufferConfig;
        WDF_COMMON_BUFFER_CONFIG_INIT( &CommonBufferConfig, FILE_128_BYTE_ALIGNMENT );

        status = WdfCommonBufferCreateWithConfig( deviceContextP->myDmaEnabler,
                                    deviceContextP->myCommonBufferByteSize,
                                    &CommonBufferConfig,
                                    WDF_NO_OBJECT_ATTRIBUTES,
                                    &( deviceContextP->myCommonBuffer ) );
        if( NT_SUCCESS( status ) )
        {
            deviceContextP->virtualKernelAddr = WdfCommonBufferGetAlignedVirtualAddress( deviceContextP->myCommonBuffer );
            deviceContextP->LogicalAddr = WdfCommonBufferGetAlignedLogicalAddress( deviceContextP->myCommonBuffer );

            RtlZeroMemory( deviceContextP->virtualKernelAddr, deviceContextP->myCommonBufferByteSize );
            deviceContextP->pMdl = IoAllocateMdl( deviceContextP->virtualKernelAddr,
                                    ( ULONG ) deviceContextP->myCommonBufferByteSize,
                                    FALSE,
                                    FALSE,
                                    NULL );
            if( NULL == deviceContextP->pMdl )
            {
                status = STATUS_INSUFFICIENT_RESOURCES;
                TraceEvents( TRACE_LEVEL_ERROR, DBG_INIT, "IoAllocateMdl() failed with status=[%!STATUS!]", status );
            }
            else
            {
                    MmBuildMdlForNonPagedPool( deviceContextP->mySystemMemoryChannelDataBuffer.pMdl );
            }
        }
        else
        {
            TraceEvents( TRACE_LEVEL_ERROR, DBG_INIT, "WdfCommonBufferCreateWithConfig() failed with status=[%!STATUS!]", status );
        }
    }
    else
    {
        TraceEvents( TRACE_LEVEL_ERROR, DBG_INIT, "WdfDmaEnablerCreate() failed with status=[%!STATUS!]", status );
    }

EvtIoInCallerContext

if( ( NULL != deviceContextP->myCommonBuffer ) && 
    ( NULL == deviceContextP->virtualUserAddr ) )
{
    __try
    {
        deviceContextP->virtualUserAddr = MmMapLockedPagesSpecifyCache( deviceContextP->pMdl,
                                    UserMode,
                                    MmCached,
                                    NULL,
                                    FALSE,
                                    HighPagePriority );
        if( NULL == deviceContextP->virtualUserAddr )
        {
            TraceEvents( TRACE_LEVEL_ERROR, DBG_IOCTLS, "MmMapLockedPagesSpecifyCache() failed.");
        }
        else
        {
            TraceEvents( TRACE_LEVEL_INFORMATION, DBG_IOCTLS, "VirtualUserAddr=[0x%p]", deviceContextP->virtualUserAddr );
        }
    }
    __except( EXCEPTION_EXECUTE_HANDLER )
    {
        TraceEvents( TRACE_LEVEL_ERROR, DBG_IOCTLS, "MmMapLockedPagesSpecifyCache() threw an exception!]" );
    }
}
WdfDeviceEnqueueRequest( device, request );

EvtIoDeviceControl - IOCTL_GET_DMA_USER_ADDRESS

        if( NULL == deviceContextP->myCommonBuffer )
        {
            TraceEvents( TRACE_LEVEL_WARNING, DBG_IOCTLS, "NO COMMON BUFFER!!" );
        }
        else if( NULL == deviceContextP->virtualUserAddr )
        {
            TraceEvents( TRACE_LEVEL_WARNING, DBG_IOCTLS, "NO VIRTUAL ADDRESS" );
        }
        else
        {
            PDMA_USER_ADDRESS pDmaUserAddress{ nullptr };
            status = WdfRequestRetrieveOutputBuffer( request, outputBufferLength, ( PVOID* ) &pDmaUserAddress, NULL );
            if( NT_SUCCESS( status ) )
            {
                pDmaUserAddress->virtualAddr = reinterpret_cast<UINT64>( deviceContextP->virtualUserAddr ); 
                bytesTransferred = sizeof( DMA_USER_ADDRESS );
            }
            else
            {
                TraceEvents( TRACE_LEVEL_ERROR, DBG_IOCTLS, "WdfRequestRetrieveOutputBuffer() failed with status=[%!STATUS!]", status );
            }
        }
        ...
        ...
       WdfRequestCompleteWithInformation( request, status, bytesTransferred );

As always, thank you!

Juan

Hello Don:

I didn’t see your post before my reply to Peter.

@Don_Burn
How big is the buffer you need to allocate for the device? How big is the typical update to the data? And how frequent are the updates?
At the moment, I am looking at allocating 500MB, but may want to up that to 1GB or even 2GB later. The data size can vary, but at the moment I have the FPGA configured to push 64MB packets, 2 packets a second. But the size and rate can also change.

@Don_Burn
… while I am not in favor of mapping kernel memory to user space…
Yes, this is at the back of my mind as I am playing around with this. Even if I get this to work, what risks have I introduced, can I plug the holes, if not, how big are the risks and do they outweigh the rewards… etc.

It is clear from other threads and even here, that this approach is not ideal. Unfortunately, I need the guaranteed contiguous logical addresses provided by common buffer.

@Don_Burn
I have done it because a device needed it…
Unfortunately, I cannot change the way the FPGA is currently designed. The current design is having the FPGA DMA data directly to another device. SW then uses other means of processing that data on that device.

I am changing things a little, so that SW can process the data on system. In my mind I simply want to move that destination memory buffer from the other device to system memory (move the landing spot). To the FPGA the change is transparent as it only sees logical addresses.

Thanks for your interest and input Don.

Juan

One very real and impactful change, aside from the bus drivers being totally rewritten to exhibit different practical behaviors, is the ever increasing presence of IOMMUs. There are other changes as well, including many, many, changes in power management, right? And that’s always impactful.

Yes, but I didn’t mean general OS changes but changes related to things discussed in this thread (sharing memory between user and kernel mode and IOCTLs). To me it seems as things work the same way as before when I used them.

In context of Peter’s “argument” the very obvious example is process creation callbacks. For example, the NotificationRoutin() (i.e the only option was available under the XP)

If I count correctly, 10 years before we already had Win7 so XP is not a question. Also, there already was KMDF, WDM drivers were outdated and so on.

(Well, there still was legacy usbser driver not following even XP WDM rules and WDF version wasn’t available before Win10. Which fixed old bugs and introduced new ones. Real pain I had to handle recently…)

Michal

@“Peter_Viscarola_(OSR)”

As a test, I added trace code into the EvtIoInCallerContext to print the data at Virtual User Address. The print worked when EvtIoInCallerContext ran and performed MmMapLockedPagesSpecifyCache - the data being 0 as expected, due to RtlZeroMemory & FPGA not yet being configured.

However, subsequent calls to EvtIoInCallerContext shows the print throwing an exception - I suspect access violation, same as User SW.

EvtIoInCallerContext (added printing)

    // Check and if no Virtual User Address, call MmMapLockedPagesSpecifyCache (See previous post for details).
   ...
    if( NULL != deviceContextP->virtualUserAddr )
    {
        UINT64 *pBuffer = ( UINT64* ) deviceContextP->virtualUserAddr;
        for( auto x{ 0 }; x < 10; ++x )
        {
            __try
            {
                TraceEvents( TRACE_LEVEL_INFORMATION, DBG_IOCTLS
                    , "Addr[%u] @[0x%p] = [%llu]"
                    , x
                    , pBuffer
                    , *( pBuffer + x ) );
            }
            __except( EXCEPTION_EXECUTE_HANDLER )
            {
                TraceEvents( TRACE_LEVEL_ERROR, DBG_IOCTLS, "EXCEPTION in TraceEvent" );
            }
        }
    }

Dumb question: Is the second EvtIoInCallerContext from the exact same process as the first call? Remember that your virtualUserAddr is only valid for that one process. If you had a quick test app and then started another test app, that address is no longer valid.

And that points out a bug in your code, In your ioctl handler, you’re only doing the mapping if virtualUserAddr is null. That only works if the calling process never ends, unless you are zeroing out that field when the app exits. Personally, I’d just eliminate that check and ALWAYS do the mapping, even if the field already has a value. Windows won’t create a new mapping if one already exists.

@“Peter_Viscarola_(OSR)”
@Tim_Roberts
(Tim: I was in the middle of writing when you posted - Not a dumb question at all :smiley: )

I found my answer, after some lunchtime googling:

https://osr.com/blog/2014/04/15/evtioincallercontext-callback-called-even-io-operations-dont-queue/

… The other point of view was getting what are essentially unexpected (and, to your driver, unsupported) Requests in EvtIoInCallerContext was an annoyance…

As Tim pointed out above, the context of when EvtIoInCallerContext was called and MmMapLockedPagesSpecifyCache executed, was not the same context as my User SW.

To ensure MmMapLockedPagesSpecifyCache was executed for my IOCTL_GET_DMA_USER_ADDRESS request I added the below at the beginning of EvtIoInCallerContext:

EvtIoInCallerContext

WDF_REQUEST_PARAMETERS requestParams;
WDF_REQUEST_PARAMETERS_INIT( &requestParams );
WdfRequestGetParameters( request, &requestParams );

if( IOCTL_GET_DMA_SYSTEM_MEMORY != requestParams.Parameters.DeviceIoControl.IoControlCode )
{
    WdfDeviceEnqueueRequest( device, request );
    return;
}
...
... (see previous posts for more details)
...
WdfDeviceEnqueueRequest( device, request );

I re-ran my driver and the trace prints are printing, with data after the FPGA was configured to run (not sure on quality of the data, but that’s a future topic).

Thank you Tim & Peter

Juan

On to the next “battle” and maybe one of the “hazard” areas of this approach - unmapping.
https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/wdm/nf-wdm-mmunmaplockedpages

Note that if the call to MmMapLockedPages or MmMapLockedPagesSpecifyCache specified user mode, the caller must be in the context of the original process before calling MmUnmapLockedPages. This is because the unmapping operation occurs in the context of the calling process, and, if the context is incorrect, the unmapping operation could delete the address range of a random process.

I now need to add an IOCTL to support unmapping. Without this, if I restart my User SW, the Virtual User Address is not valid for this new instance (as @Tim_Roberts pointed out previously).

And the hazard - What happens if my User SW crashes or closes without unmapping? Would I even be able to perform another mapping when a previous one already exists?

For now, I will march ahead and add an IOCTL to support unmapping. (Though, I will give VirtualAlloc with Large Pages another try later on.)

Juan

At the moment, I am looking at allocating 500MB, but may want to up that to 1GB or even 2GB later. The data size can vary,
but at the moment I have the FPGA configured to push 64MB packets, 2 packets a second. But the size and rate can also change.

I need the guaranteed contiguous logical addresses provided by common buffer.

Before you proceed with your “conquest” any further, I would rather recommend you to take the following points into consideration

In case if the target system does not support IOMMU, the ‘(logical_address==physical_address)’ statement is always going to be
evaluated to TRUE. Don’t you see any potential problem with finding a physically contiguous buffer in a GB range???

Certainly, as long as the target machine is equipped with IOMMU, the system is in a position to present a physically non-contiguous buffer as
a logically contiguous one to your device. However, you should bear in mind that not every machine in existence is going to support VT-d or AMD-VI.

For example, if you check the link below you will see that the assumption of IOMMU presence on the target machine is a way too bold

https://en.wikipedia.org/wiki/List_of_IOMMU-supporting_hardware

As you can see it yourself, the list is not THAT long.

Certainly, it is up to you to make a decision, but I think you may find it frustrating to find out that the design that you have
spent so much time and effort on is, for all practical purposes, simply infeasible…

Anton Bassov

I now need to add an IOCTL to support unmapping. Without this, if I restart my User SW, the Virtual User Address is not valid for this new instance (as @Tim_Roberts pointed out previously).

And the hazard - What happens if my User SW crashes or closes without unmapping?

Right. This why you have to handle EvtFileCleanup / EvtFileClose events, so you can clean up your dirty work when the app crashes or closes without cleanly shutting down. You CANNOT rely upon the application to clean up for you. You need to assume that all application writers are bozos, and malicious bozos at that.

Would I even be able to perform another mapping when a previous one already exists?

Not as the driver is currently written, but that’s a driver problem, not an architectural problem. The operating system doesn’t care how many times you map a piece of memory, but your driver certainly does.

Juan, just to be clear: You’re getting further and further down the road into a design that is going to wind-up either having serious edge-condition issues with security implications… OR one that’s going to need to include some clever code to take into account and handle these edge conditions. It’s a lot to ask to try to get this right one forum post at a time, without a good background in Windows kernel mode software.

So… think about it. Do you really want to be doing this? You’re venturing into an area that I advise the students in my Advanced WDF seminar to avoid.

Having said that, Mr, Roberts is right on target: You need to do the Unmap operation in your EvtFileCleanup Event Processing Callback. This is called in the context of the process that called CloseHandle. You don’t need or want a separate IOCTL for this. If you do it right, it’ll handle the unmap “automatically” even during abnormal thread termination.

Peter

Hello @anton_bassov

Yes, being able to secure GB(s) of physically contiguous memory is a concern, and one reason for having the driver perform the allocation as I assume (maybe wrongly) that the drive will have a better chance at getting that memory during start-up than User SW.

The alternative is using VirtualAlloc Large Pages in User SW. The Large Page size would be a multiple of the FPGA page size. This way, even if the logical pages are not contiguous, one+ FPGA pages can be DMAd into one System Large Page.

I know I am not completely in the clear with Large Pages as the system will still need to find enough contiguous memory for all the Large Pages needed to make up my desired buffer size. Since this is all happening during system start-up, and with 32GB of physical RAM, I would hope this is not a problem.

FYI: when I attempted to allocate 1GB using the CommonBuffer approach, the allocation failed. I was successful in allocating a 750MB CommonBuffer, however.

Thanks, Anton.
Juan

Hello @“Peter_Viscarola_(OSR)” and @Tim_Roberts

Thank you both again for your help and patience.

Regarding the unmapping, I went ahead with the IOCTL approach, but not because I disagreed with anything you two said, but because I decided to end my experiment with the CommonBuffer approach.

I had success allocating 500MB and having my FPGA DMA the data both system memory and the second external device, ping-ponging back and forth between External and System Memory. I ended up having SW perform a copy to the external device, to allow that device process the data.

The results were seamless and I could not tell the difference between processed data comging directly from FPGA or via the System Memory.

  • A little change in plans, but it was enough to prove using System Memory as a landing spot it doable.
  • Not to mention the warning you both gave about this approach.
  • BTW: performing a print-screen caused a BSOD - a corner case :wink:

My next step is to attempt allocating the buffer in User SW via VirtualAlloc and using Large Pages. Then feed that address to the Driver. Still not the conventional appraoch that Peter pointed me to, but I suspect better than the CommonBuffer approach I am using now.

  • I suspect I would still need to use EvtFileCleanup/EvtFileClose if my SW were to end abruptly.

Thank you again! I am having a good time and learning a little bit as I go.
Juan

Hello All:

Just to close off this thread - the below code snippets is what I used in my driver to allocate a common buffer for DMA use and expose that common buffer to User SW for direct access.


[Mods: With his permission, and in no way intending to show any disrespect, we have removed the code example Mr. OneKneeToe provided here. We did this because, in our judgement, it was likely to create more issues than it solved for future devs who encounter this thread. We understand that Mr. OneKneeToe had a very specific need, and that the solution he posted here met that need for him.

Indeed, we are grateful to Mr. OneKneeToe for taking the time to “give back” to the Community by posting the code that worked for him, in his specific situation. We just don’t want people to copy it in the future without being aware of its limitations.]

The alternative is using VirtualAlloc Large Pages in User SW. The Large Page size would be a multiple of the FPGA page size.
This way, even if the logical pages are not contiguous, one+ FPGA pages can be DMAd into one System Large Page.

If you don’t mind, could you please expand it a bit. There may be the case of simply mis-phrasing your idea here, but, judging from the above statement as it has been presented, you’ve got to learn quite a bit of system-level basics before even thinking about writing drivers…

Anton Bassov

Sigh! I’m really struggling with leaving your code examples inline in this thread… even given the time/effort you obviously put into posting them and formatting them properly.

The code you’ve posted is all really just prototype code that shows how to call some APIs… but is in no way production quality or ready for use outside a lab/testing setting.

In fact, you seem to have ignored just about every piece of advice I gave you in this thread.

I’m not encouraged.

Peter

Good Morning @“Peter_Viscarola_(OSR)”

Ignoring your advice:

When I read “Ignored” I felt it carried a negative connotation. I like to think that I listen to your advice, look to see if I could make use of it given the task at hand, and found that I could not take it. It definitely was not a case of, “what you say doesn’t matter”. Quite the opposite. The above approach was a means-to-an-end, if you will.

In fact, since I met my intended goal with the above, I’ve stopped work on this approach. I will attempt the VirtualAlloc approach with a special Direct I/O call that stays uncompleted for the life of the program (something along those lines).

Leave or Remove:

I would not be hurt if you were to remove the post. My intention was to give back. I did add a caution section echoing the concerns and asking readers to read the thread. Nevertheless, you have been at this far longer than I and, after all, you are a moderator.

Appreciation:

Your help and advice is appreciated. There really is no other resource out there. Documentation exits, but it usually doesn’t go into these details and nothing beats having an experienced person to talk things through with. Thanks to you and @Tim_Roberts.

Best Regards,
Juan

I would not be hurt if you were to remove the post

Thanks for that. With your permission, I did ultimately decide that the code should be removed.

Maybe I’m overly concerned, but especially for this particular topic which is such a commonly encountered one I really didn’t want to have yet another solution posted that doesn’t really address any of the complex issues inherent in mapping memory into a user’s address space.

Thanks for your support,

Peter

1 Like