Allocate buffers for DMA

Hi, all experts:

I know there are many experts here, and you are warmhearted. From many threads I see.

Now I am using DMA engine on Xilinx ML505/ML555 demo board(PCIe), and I write driver using WDF.

I have read lots of threads in NTDEV and the famous article “A Common Topic Explained - Sharing Memory Between Drivers and Applications”.

Here are some common methods that discussed many many times.
ExAllocatePoolWithTag/MmAllocatePagesForMdl/MmAllocateContiguousMemory(in DeviceAdd) +MmMapLockedPagesSpecifyCache/MmMapLockedPages(in IOCTL)

And I know MmAllocateContiguousMemory+MmMapLockedPagesSpecifyCache is not a good practice, and Microsoft said"This approach will not work on all hardware, and your device and driver will not be cross-platform compatible. Serious errors can result."(http://www.microsoft.com/whdc/driver/tips/DMA.mspx), and MS recommend IoGetDmaAdapter(IN PDEVICE_OBJECT PhysicalDeviceObject)+AllocateCommonBuffer, however, I am not sure AllocateCommonBuffer can work under WDF, because there is no PDEVICE_OBJECT in WDF, is that right?

Here is the problem, when I use MmAllocateContiguousMemory+MmMapLockedPagesSpecifyCache, it works well in windows xp sp2, howerver, in windows server 2003 sp2, MmMapLockedPages crack down, BSOD. Windbg shows “MmMapLockedPages called when not at APC_LEVEL or below”, because I use MmMapLockedPages in IOCTL, the same with MmMapLockedPagesSpecifyCache, and I know the reason is the parameter AccessMode here is usermode which caused the failure, but if use kernelmode, the address returned can not be used in application.

If I insist to use MmAllocateContiguousMemory+MmMapLockedPagesSpecifyCache, how should I do.

Now I am trying to use WdfCommonBuffer+WdfCommonBufferGetAlignedVirtualAddress/WdfCommonBufferGetAlignedLogicalAddress, and WdfCommonBufferGetAlignedLogicalAddress can be used as physical address for FPGA(?), but I don’t know how transfer it to the address that can be used in application.
Should this work:
use the address that WdfCommonBufferGetAlignedLogicalAddress returned as input of MmMapIoSpace. and then
MmMapIoSpace+IoAllocateMdl+MmBuildMdlForNonPagedPool+MmMapLockedPages?

> MmAllocateContiguousMemory

Forget this function. It is provided only for the implementors of DMA adapter object, to implement ->AllocateCommonBuffer.

Use KMDF’s DMA facilities to allocate a common buffer.

Also note that sharing memory is nearly always a bad idea, for instance, there are no ways of protecting its updates by locks. If you need any update atomicity requirements to the structures placed in this memory, then abandon the idea of sharing.

It will be better to transport the data between user and kernel mode via IRPs.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

thanks, Maxim S. Shatskih, I am trying out the KMDF’s DMA facilities.

The logical address is the address your device will put out in the PCI
bus during it’s DMA operation. This is a physical address, not a
virtual address that a program will access

You can use the virtual address (which it seems you are already
getting) to create an MDL and then map that.

The logical address is the one you program into the device.

As others have pointed out you should only use shared memory if you
can’t get your job done using read/write/ioctl irps.

-p

Sent from my phone - please exquse typos and me bad grammmar

On Jun 4, 2009, at 1:56 AM, “idle911@163.com” wrote:

> Hi, all experts:
>
> I know there are many experts here, and you are warmhearted. From
> many threads I see.
>
> Now I am using DMA engine on Xilinx ML505/ML555 demo board(PCIe),
> and I write driver using WDF.
>
> I have read lots of threads in NTDEV and the famous article “A
> Common Topic Explained - Sharing Memory Between Drivers and
> Applications”.
>
> Here are some common methods that discussed many many times.
> ExAllocatePoolWithTag/MmAllocatePagesForMdl/
> MmAllocateContiguousMemory(in DeviceAdd)
> +MmMapLockedPagesSpecifyCache/MmMapLockedPages(in IOCTL)
>
> And I know MmAllocateContiguousMemory+MmMapLockedPagesSpecifyCache
> is not a good practice, and Microsoft said"This approach will not
> work on all hardware, and your device and driver will not be cross-
> platform compatible. Serious errors can result."(http://www.microsoft.com/whdc/driver/tips/DMA.mspx
> ), and MS recommend IoGetDmaAdapter(IN PDEVICE_OBJECT
> PhysicalDeviceObject)+AllocateCommonBuffer, however, I am not sure
> AllocateCommonBuffer can work under WDF, because there is no
> PDEVICE_OBJECT in WDF, is that right?
>
> Here is the problem, when I use MmAllocateContiguousMemory
> +MmMapLockedPagesSpecifyCache, it works well in windows xp sp2,
> howerver, in windows server 2003 sp2, MmMapLockedPages crack down,
> BSOD. Windbg shows “MmMapLockedPages called when not at APC_LEVEL or
> below”, because I use MmMapLockedPages in IOCTL, the same with
> MmMapLockedPagesSpecifyCache, and I know the reason is the parameter
> AccessMode here is usermode which caused the failure, but if use
> kernelmode, the address returned can not be used in application.
>
> If I insist to use MmAllocateContiguousMemory
> +MmMapLockedPagesSpecifyCache, how should I do.
>
> Now I am trying to use WdfCommonBuffer
> +WdfCommonBufferGetAlignedVirtualAddress/
> WdfCommonBufferGetAlignedLogicalAddress, and
> WdfCommonBufferGetAlignedLogicalAddress can be used as physical
> address for FPGA(?), but I don’t know how transfer it to the address
> that can be used in application.
> Should this work:
> use the address that WdfCommonBufferGetAlignedLogicalAddress
> returned as input of MmMapIoSpace. and then
> MmMapIoSpace+IoAllocateMdl+MmBuildMdlForNonPagedPool+MmMapLockedPages?
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>

> If you need any update atomicity requirements to the structures placed in this memory, then

abandon the idea of sharing. It will be better to transport the data between user and kernel mode via IRPs.

Actually, these two ideas are not incompatible - you can make memory available at any given moment to either device or application but never both at the same time, and use IOCTLs in order to decide whether memory should be available to an app, i.e. map and unmap memory into/from app’s address space in response to IOCTLs.

Certainly, it makes sense only if we are speaking about HUGE buffers - otherwise, you will be better off with METHOD_BUFFERED, i.e. simply copying data by CPU…

Anton Bassov

> Actually, these two ideas are not incompatible - you can make memory available at any given moment

to either device or application but never both at the same time, and use IOCTLs in order to decide

Or: map read-only and use IOCTLs for updates.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

> Or: map read-only and use IOCTLs for updates.

Doing so does not ensure that applications will always have a consistent view of data in the buffer. The only thing it ensures is that an app is unable to screw up device operations. However, in order to ensure that application always has a consistent view of data in the buffer some additional synch/notification mechanism
(i.e. IOCTL, event. or, if you want to go for rather perverse solution, polling of a condition variable/flag that gets modified by a driver and periodically checked by an app) is needed. This seems to defeat the purpose of RO mapping…

Anton Bassov

>> Or: map read-only and use IOCTLs for updates.

Doing so does not ensure that applications will always have a consistent view of data in the buffer.

Similar thing is IIRC implemented in user32/win32k for the window table. All GetXxx routines in user32 use the mapped view.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

> Similar thing is IIRC implemented in user32/win32k for the window table. All GetXxx routines

in user32 use the mapped view.

This is a bit different story - this is a subsystem specifically designed to work this way. Similarly, if Windows
supported mmap() method in drivers and invoked driver-provided page fault handler whenever a page fault occurs sharing a buffer would be much easier and efficient than it is under the existing Windows model…

Anton Bassov

I’m probably going to regret this but …

Sharing memory between app and driver is pretty easy under Windows. You can map a section object if you want page-file backed memory, or lock an arbitrary section of memory and then map it into the kernel. The same underlying physical pages (or prototype PTEs pointing to the backing store) are used in each address space. Easy, peasy.

What is not in Windows is an easy way to memory map a driver “file”. However I’m not sure what about having a driver specified page fault handler would make this easier or more efficient.

Isn’t it more efficient to just have the driver and the app use the same page than to have the driver conjure up a page when the application doesn’t happen to have it resident? As for easier … if you’re using the same page you can use existing synchronization primitives to allow the two sides to maintain data consistency. If one side is doing writes that the other won’t see until the page it current has gets ejected from its working set … that seems harder to deal with.

Can you expand on how allowing mmap in drivers would make it “much easier and efficient”? I’m curious.

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
Sent: Friday, June 05, 2009 4:30 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Allocate buffers for DMA

Similar thing is IIRC implemented in user32/win32k for the window
table. All GetXxx routines in user32 use the mapped view.

This is a bit different story - this is a subsystem specifically designed to work this way. Similarly, if Windows supported mmap() method in drivers and invoked driver-provided page fault handler whenever a page fault occurs sharing a buffer would be much easier and efficient than it is under the existing Windows model…

Anton Bassov


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Peter,

I’m probably going to regret this but …

Don’t worry - after all, if we reach the point when we are about to stray too far away from the original topic we can always move a discussion to NTTALK…

Sharing memory between app and driver is pretty easy under Windows.

Well, “ease” is relative thing . More on it below…

You can map a section object if you want page-file backed memory, or lock an arbitrary
section of memory and then map it into the kernel. The same underlying physical pages
(or prototype PTEs pointing to the backing store) are used in each address space. Easy, peasy.

This part is, indeed, easy. What you have not mentioned is that an app and driver have to synchronize an access to the buffer somehow. In addition to that, don’t forget that shared memory is a potential security risk - you’ve got to a fair amount of work to ensure that an app does not screw up your driver - for example, if it “forgets” to signal the event. that your driver is waiting on ( or just terminates abnormally). I don’t want to say it is incredibly complex, but still you need to do some work that you could otherwise avoid…

What is not in Windows is an easy way to memory map a driver “file”. However I’m not sure what
about having a driver specified page fault handler would make this easier or more efficient.

If you could handle the whole thing with MapViewOfFile()( and FlushViewOfFile() if you want to flush data to device), don’t you think it would be easier??? Forget about the security concerns - you can make these mapping private to the target process if you wish. Forget about synchronization with an app - whenever you want to make a page unavailable to an app you can simply mark the target page as not present in PTE, and handle synchronization 100% in a driver’s code (for example, between page fault handler that wants to map a page to the user space and DPC routine that gets invoked when DMA is complete)…

Isn’t it more efficient to just have the driver and the app use the same page than to have
the driver conjure up a page when the application doesn’t happen to have it resident?

Well, think yourself. Let’s say you’ve got a buffer of few dozen megs, but you don’t know to which extent it will be actually used, because the rate at which your app produces data may vary wildly. What do you think is more efficient - to allocate memory in advance and lock it in RAM, or to allocate pages only when (and if) they are needed??? After all, this is what the whole concept of demand paging is about…

Furthermore, using the same page and “conjuring up” the one can be just two sides of the same thing .
For example, if you want don’t want a page to be available to an app at the moment (for example, when you do streaming DMA to a device) your driver can mark it as non-present in PTE. In such case you will have to conjure it up if an app accesses it. …

Anton Bassov

Ease:

Memory mapping is neat and has its place. However trying to manage concurrent access & modification to a data structure using *only* page invalidation doesn?t seem feasible. Without some form of mutual exclusion between the reader and writer the data being shared can still be accessed in an inconsistent state. That’s bad.

Perhaps you’re thinking of a transactional memory model, where each side can attempt to commit their memory modifications and that commit can fail. However “invalidate & replace” or “write and flush” are not the same thing.

What’s the provision in Linux (I assume that’s the basis for your example) for invalidating a page out of a driver backed memory mapping? From what I’ve read (admittedly limited) the driver can supply a page when there’s a fault, but I didn’t see a way the driver could revoke a single page later on.

Efficiency:

One can avoid needing to allocate and lock all the needed memory up front using section objects. If you’re going to use shared memory between driver and app I suspect you want one big slab (otherwise offsets/pointers to other objects are much more difficult to track) and even in pieces you need page sized allocations to share so that you don’t expose unnecessary data to the application.

In the case where you had a big paged pool allocation that you wanted to map to an application without locking the entire buffer … there is one spot where mmap lets you do something new. However if you have that in Windows you could use a section object and then it’s mapable into user-mode and doesn’t have to be locked down at all.

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
Sent: Friday, June 05, 2009 2:21 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Allocate buffers for DMA

Peter,

I’m probably going to regret this but …

Don’t worry - after all, if we reach the point when we are about to stray too far away from the original topic we can always move a discussion to NTTALK…

Sharing memory between app and driver is pretty easy under Windows.

Well, “ease” is relative thing . More on it below…

You can map a section object if you want page-file backed memory, or lock an arbitrary
section of memory and then map it into the kernel. The same underlying physical pages
(or prototype PTEs pointing to the backing store) are used in each address space. Easy, peasy.

This part is, indeed, easy. What you have not mentioned is that an app and driver have to synchronize an access to the buffer somehow. In addition to that, don’t forget that shared memory is a potential security risk - you’ve got to a fair amount of work to ensure that an app does not screw up your driver - for example, if it “forgets” to signal the event. that your driver is waiting on ( or just terminates abnormally). I don’t want to say it is incredibly complex, but still you need to do some work that you could otherwise avoid…

What is not in Windows is an easy way to memory map a driver “file”.
However I’m not sure what about having a driver specified page fault handler would make this easier or more efficient.

If you could handle the whole thing with MapViewOfFile()( and FlushViewOfFile() if you want to flush data to device), don’t you think it would be easier??? Forget about the security concerns - you can make these mapping private to the target process if you wish. Forget about synchronization with an app - whenever you want to make a page unavailable to an app you can simply mark the target page as not present in PTE, and handle synchronization 100% in a driver’s code (for example, between page fault handler that wants to map a page to the user space and DPC routine that gets invoked when DMA is complete)…

Isn’t it more efficient to just have the driver and the app use the
same page than to have the driver conjure up a page when the application doesn’t happen to have it resident?

Well, think yourself. Let’s say you’ve got a buffer of few dozen megs, but you don’t know to which extent it will be actually used, because the rate at which your app produces data may vary wildly. What do you think is more efficient - to allocate memory in advance and lock it in RAM, or to allocate pages only when (and if) they are needed??? After all, this is what the whole concept of demand paging is about…

Furthermore, using the same page and “conjuring up” the one can be just two sides of the same thing .
For example, if you want don’t want a page to be available to an app at the moment (for example, when you do streaming DMA to a device) your driver can mark it as non-present in PTE. In such case you will have to conjure it up if an app accesses it. …

Anton Bassov


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

> If you could handle the whole thing with MapViewOfFile()( and FlushViewOfFile() if you want to flush

Why is DeviceIoControl(IOCTL_MYDRIVER_MAP_MEMORY) worse?

  • whenever you want to make a page unavailable to an app you can simply mark the target page as
    not present in PTE, and handle synchronization 100% in a driver’s code

How funny a hack.


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

> Memory mapping is neat and has its place. However trying to manage concurrent

access & modification to a data structure using *only* page invalidation doesn?t seem feasible.
Without some form of mutual exclusion between the reader and writer the data being shared can
still be accessed in an inconsistent state.

Under Windows , this is, indeed, a bit of a problem, because Windows does not allow you to spot everyone who maps a given page into its address space - the only thing you may know is the number of references to a given page. Therefore, you just cannot mark a given page as non-present in all user PTEs simply because you haven’t got enough info. However, it does not necessarily have to be this way. More on it below…

What’s the provision in Linux (I assume that’s the basis for your example) for invalidating
a page out of a driver backed memory mapping? From what I’ve read (admittedly limited)
the driver can supply a page when there’s a fault, but I didn’t see a way the driver could
revoke a single page later on.

Linux has quite useful feature known as “reverse mapping” - given a mapped page descriptor you can
quickly detect everyone who maps it into its address space, and discover at which particular address it is being mapped by a given process. This means that you can mark a page as non-present in all PTEs without actually invalidating the virtual address itself. The whole thing is designed for page frame reclaiming, but we can easily adjust it to our needs

Let’s say we want to flush a mapped buffer to a device. It is understandable that we would not want anyone to access memory while DMA is in progress. Therefore, our fsync() handler can mark all pages of interest as non-present in all PTEs without actually invalidating the virtual addresses themselves…As a result, whenever an app that maps a given page tries to access a given address page fault will get raised. Once the address is itself valid, our page fault handler will get invoked. Page fault handler can block on completion (notification event , in Windows terminology) that gets signaled by DPC when DMA completes, and, at this point it will know that it can safely map the page into the address space, i.e. make it available to an app again. The whole thing can be done totally transparently to an app…

One can avoid needing to allocate and lock all the needed memory up front using section objects.

It does not really matter if you use a section or anonymous range - once it is pageable you have to
probe and lock it anyway if you intend to use it in DMA transfers (ironically, the only situation when you would normally consider sharing a buffer - otherwise you will be better off with METHOD_BUFFERED anyway) Therefore, using a section does not really offer a solution…

Anton Bassov

> Let’s say we want to flush a mapped buffer to a device. It is understandable that we would not want

anyone to access memory while DMA is in progress.

Why? if DMA is write - host to device - then it is OK :slight_smile:


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

>> - whenever you want to make a page unavailable to an app you can simply mark the target page as

>not present in PTE, and handle synchronization 100% in a driver’s code

How funny a hack.

Well, under Windows it would be, indeed , a hack, but it would not be a funny one at all - it is so major one that even the OS itself cannot do it, safely or otherwise…

Anton Bassov

Of course, invalidating page validity on multiprocessor systems is not necessarily free. If the reason for using shared memory is to reduce overhead, you don’t really want to be continually altering page translations as that requires that TLBs be flushed globally and soforth. This is probably not a thing you want to require on each DMA transaction if that is what you were referring to.

  • S

-----Original Message-----
From: xxxxx@hotmail.com
Sent: Friday, June 05, 2009 14:22
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Allocate buffers for DMA

Peter,

> I’m probably going to regret this but …

Don’t worry - after all, if we reach the point when we are about to stray too far away from the original topic we can always move a discussion to NTTALK…

> Sharing memory between app and driver is pretty easy under Windows.

Well, “ease” is relative thing . More on it below…

> You can map a section object if you want page-file backed memory, or lock an arbitrary
> section of memory and then map it into the kernel. The same underlying physical pages
> (or prototype PTEs pointing to the backing store) are used in each address space. Easy, peasy.

This part is, indeed, easy. What you have not mentioned is that an app and driver have to synchronize an access to the buffer somehow. In addition to that, don’t forget that shared memory is a potential security risk - you’ve got to a fair amount of work to ensure that an app does not screw up your driver - for example, if it “forgets” to signal the event. that your driver is waiting on ( or just terminates abnormally). I don’t want to say it is incredibly complex, but still you need to do some work that you could otherwise avoid…

> What is not in Windows is an easy way to memory map a driver “file”. However I’m not sure what
> about having a driver specified page fault handler would make this easier or more efficient.

If you could handle the whole thing with MapViewOfFile()( and FlushViewOfFile() if you want to flush data to device), don’t you think it would be easier??? Forget about the security concerns - you can make these mapping private to the target process if you wish. Forget about synchronization with an app - whenever you want to make a page unavailable to an app you can simply mark the target page as not present in PTE, and handle synchronization 100% in a driver’s code (for example, between page fault handler that wants to map a page to the user space and DPC routine that gets invoked when DMA is complete)…

> Isn’t it more efficient to just have the driver and the app use the same page than to have
> the driver conjure up a page when the application doesn’t happen to have it resident?

Well, think yourself. Let’s say you’ve got a buffer of few dozen megs, but you don’t know to which extent it will be actually used, because the rate at which your app produces data may vary wildly. What do you think is more efficient - to allocate memory in advance and lock it in RAM, or to allocate pages only when (and if) they are needed??? After all, this is what the whole concept of demand paging is about…

Furthermore, using the same page and “conjuring up” the one can be just two sides of the same thing .
For example, if you want don’t want a page to be available to an app at the moment (for example, when you do streaming DMA to a device) your driver can mark it as non-present in PTE. In such case you will have to conjure it up if an app accesses it. …

Anton Bassov


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

> Why? if DMA is write - host to device - then it is OK :slight_smile:

Well, I thought it would be obvious that I was speaking about scatter-gather DMA - once we may allocate pages for the buffer dynamically they are not guaranteed to be physically contigious, don’t you think…

Therefore, we have to ensure that the buffer is not available to both CPU and device at the same time - otherwise you may get some “surprise”…

Anton Bassov

> Under Windows , this is, indeed, a bit of a problem, because

Windows does not allow you to spot everyone who maps a
given page into its address space - the only thing you may
know is the number of references to a given page. Therefore,
you just cannot mark a given page as non-present in all user
PTEs simply because you haven’t got enough info.

This is actually possible on Windows 7 (for file-backed sections) -
see CcCoherencyFlushAndPurgeCache.


Pavel Lebedinsky/Windows Kernel Test
This posting is provided “AS IS” with no warranties, and confers no rights.

> Of course, invalidating page validity on multiprocessor systems is not necessarily free.

If the reason for using shared memory is to reduce overhead, you don’t really want to be
continually altering page translations as that requires that TLBs be flushed globally and soforth.
This is probably not a thing you want to require on each DMA transaction if that is what you were
referring to.

Please note that we do DMA transactions that require page invalidation only in response to user request to flush a device file…

Anton Bassov