KMDF DMA for continous streaming operations

Delmondo_Giorgio · May 28, 2015, 9:37am

Hello,

I am designing a driver for a custom PCI FPGA board.
The user allocates big chunks of memory using malloc and passes the pointer and the length to the driver with an IoControl. The driver locks each page (MmProbeAndLockPages) and should sends back a list of structure containing the physical address of each page.

The user then create a dma chain and sends to the driver to start the transactions. The transaction could last forever (typical situation), with no sw intervention, as the last element could point to the first one.
I know this approach is unsafe, but it is required to maintain full user space compatibility with a real old board we are replacing. Also, the hardware has 32bit only addressing capability.

I do not find any way to handle this with the KMDF DMA API. Do you have any suggestions?

Thanks,
Giorgio

Alex_Grig · May 28, 2015, 11:24am

This ain’t gonna work. There is no guarantee that the pages will have 32 bit physical address. You’ll have to always boot the OS with memory limit.

You should be giving the user virtual addresses instead of physical back. The application will build the chain from these virtual addresses. Then the driver will check that the VAs are in the known buffer, translate them to physical (from the known SGL for the buffer) and build the real request.

Delmondo_Giorgio · May 28, 2015, 12:20pm

Thanks for your fast reply.

Unfortunately the API I must have consist of an IOCTL where the user gives
me one descriptor at time, with a 32bit phisical address as the identifier
of the page of the previously allocated buffer.

Is there any way to setup MAP register without dealing with a dma tranfer?
I mean, I am looking to MapTranfer and AllocateAdapterChannel WDM calls.
Both of them assume that a single Allocate*, Map* causes a transfer, wait
for the end, ecc…
In my case user allocate one or more buffer, pushes one or more descriptors
and than starts the DMA operations

Sorry if my terminology is not correct. I am a Linux driver developer (and
just did the linux driver for this device), and a windows newbie.

Thanks,
G

On Thu, May 28, 2015 at 5:23 PM, wrote:

> This ain’t gonna work. There is no guarantee that the pages will have 32
> bit physical address. You’ll have to always boot the OS with memory limit.
>
> You should be giving the user virtual addresses instead of physical back.
> The application will build the chain from these virtual addresses. Then the
> driver will check that the VAs are in the known buffer, translate them to
> physical (from the known SGL for the buffer) and build the real request.
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

Tim_Roberts · May 28, 2015, 1:35pm

xxxxx@gmail.com wrote:

I am designing a driver for a custom PCI FPGA board.
The user allocates big chunks of memory using malloc and passes the pointer and the length to the driver with an IoControl. The driver locks each page (MmProbeAndLockPages) and should sends back a list of structure containing the physical address of each page.

That design is totally unsafe. Remember that user-mode processes are
not secure. You can write to essentially any user-mode process from
another process. That means a malicious user-mode process could alter
those physical addresses and cause no end of harm.

The user/driver exchange should deal with virtual addresses only.

The user then create a dma chain and sends to the driver to start the transactions. The transaction could last forever (typical situation), with no sw intervention, as the last element could point to the first one.
I know this approach is unsafe, but it is required to maintain full user space compatibility with a real old board we are replacing. Also, the hardware has 32bit only addressing capability.

I do not find any way to handle this with the KMDF DMA API. Do you have any suggestions?

I suppose you could lie to the user. Remember the virtual address, then
send back “handles” instead of “physical addresses”. When you get the
request to start DMA, you can use the safe and secure DMA abstraction to
do the actual mapping, and substitute the actual physical addresses for
the handles you got back.

That doesn’t necessarily help with the 32-bit limitation, however. You
might be forced into allocating a common buffer to do the DMA, and then
copy the data back and forth.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Alex_Grig · May 28, 2015, 1:52pm

What kind of transfers does the application perform?

Delmondo_Giorgio · May 28, 2015, 2:59pm

The application is a frame grabber.
The main application allocates two big chunks of ~20MB and generate
descriptors in order to be awaken when one buffer has completed. The
application implements also some other transfers for selftesting purpose
(both input and output with the same mechanism)
No user intervention is required after the DMA has been started.
Unfortunately the software is certified and could not be modified at all, I
have to adapt my driver to this model.

Tim, I know that this model is totally unsafe, but it is required and it is
an embedded application in a controlled environment.

Since data rate is high I would avoid using a big kernel buffer and than
memcpy.

On Thu, May 28, 2015 at 8:57 PM, Giorgio Delmondo
wrote:

> The application is a frame grabber.
> The main application allocates two big chunks of ~20MB and generate
> descriptors in order to be awaken when one buffer has completed. The
> application implements also some other transfers for selftesting purpose
> (both input and output with the same mechanism)
> No user intervention is required after the DMA has been started.
> Unfortunately the software is certified and could not be modified at all,
> I have to adapt my driver to this model.
>
> Tim, I know that this model is totally unsafe, but it is required and it
> is an embedded application in a controlled environment.
>
> Since data rate is high I would avoid using a big kernel buffer and than
> memcpy.
>
> On Thu, May 28, 2015 at 7:51 PM, wrote:
>
>> What kind of transfers does the application perform?
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>>
>> OSR is HIRING!! See http://www.osr.com/careers
>>
>> For our schedule of WDF, WDM, debugging and other seminars visit:
>> http://www.osr.com/seminars
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http://www.osronline.com/page.cfm?name=ListServer
>>
>
>
>
> –
> —
> Giorgio Delmondo
> LVD Systems srl
> mail: xxxxx@lvdsystems.it
> mob: +39-3485297317
> www.lvdsystems.it
>

Tim_Roberts · May 28, 2015, 3:24pm

Giorgio Delmondo wrote:

The application is a frame grabber.
The main application allocates two big chunks of ~20MB and generate
descriptors in order to be awaken when one buffer has completed. The
application implements also some other transfers for selftesting
purpose (both input and output with the same mechanism)

I have certainly worked with frame grabbers that did this. The
interface was designed in the MS-DOS days, when the line between kernel
and user processing was fuzzy. That doesn’t mean it is right.

No user intervention is required after the DMA has been started.
Unfortunately the software is certified and could not be modified at
all, I have to adapt my driver to this model.

How can a software certification possibly continue to be valid when the
driver it requires is rewritten?

Tim, I know that this model is totally unsafe, but it is required and
it is an embedded application in a controlled environment.

Since data rate is high I would avoid using a big kernel buffer and
than memcpy.

How high? Most people dramatically overestimate the cost of memcpy. If
this is an old design, as it seems, then the data rates are probably not
that large.

Do not rule out the fake address/handle scheme.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Delmondo_Giorgio · May 29, 2015, 8:13am

> That doesn’t mean it is right.
I agree…

How can a software certification possibly continue to be valid when the
driver it requires is rewritten?
A good question. Also the hw is different. But the customer is happy…

How high?
About 100 MB/s

Thanks for your help!

On Thu, May 28, 2015 at 9:23 PM, Tim Roberts wrote:

> Giorgio Delmondo wrote:
> > The application is a frame grabber.
> > The main application allocates two big chunks of ~20MB and generate
> > descriptors in order to be awaken when one buffer has completed. The
> > application implements also some other transfers for selftesting
> > purpose (both input and output with the same mechanism)
>
> I have certainly worked with frame grabbers that did this. The
> interface was designed in the MS-DOS days, when the line between kernel
> and user processing was fuzzy. That doesn’t mean it is right.
>
>
> > No user intervention is required after the DMA has been started.
> > Unfortunately the software is certified and could not be modified at
> > all, I have to adapt my driver to this model.
>
> How can a software certification possibly continue to be valid when the
> driver it requires is rewritten?
>
>
> > Tim, I know that this model is totally unsafe, but it is required and
> > it is an embedded application in a controlled environment.
> >
> > Since data rate is high I would avoid using a big kernel buffer and
> > than memcpy.
>
> How high? Most people dramatically overestimate the cost of memcpy. If
> this is an old design, as it seems, then the data rates are probably not
> that large.
>
> Do not rule out the fake address/handle scheme.
>
> –
> Tim Roberts, xxxxx@probo.com
> Providenza & Boekelheide, Inc.
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

Peter_Viscarola_OSR · May 29, 2015, 9:07am

Ugh. That’s a terrible API… it’s terrible now and was terrible when it was designed. Mallon 20MB of data? Great.

The OP is almost certainly dealing with a piece of medical equipment… And it’s probably 32-bit only. The UM software is tested and certified, a lengthy process, and thepat process does not account for the drivers. That’s a separate testing and cert process.

I’m dealing with rewriting the (non-DMA) drivers for just such a system now, based on NT V4 drivers. It is an architectural “pleasure.”

OP: Mr. Roberts’ scheme of returning handles instead of PAs is a good one. Heck, use the VAs (byte swap them if you want to be cover) as the handle. ProbeAndLock the pages during the initial IOCTL. Will the transfer always be for a virtually contiguous range in the pre-allocated buffer? If so, then it should be easy to create a buffer descriptor of UVAs … Which is a contiguous virtual memory range … Which is just a data buffer for you to build an MDL for process at EvtIoInCallerContext, and after that just do standard packet based processing.

Peter
OSR
@OSRDrivers

Delmondo_Giorgio · May 29, 2015, 9:32am

Hi Peter,
I have did like all of you suggested and my driver “looks” working.
My only issue is with the 32bit address limitation, without using the DMA
API but manually locking the pages I really don’t know how to remap them in
lower 4GB (as they are user allocated). On the target env it is currently
not a problem as does not have enough memory, but I want to avoid issue in
the future.

Under Linux (sorry…) I can use a dma api call to setup mapping register
to remap those pages on the bus, avoiding copies. I wonder if such trick
can be done also on Windows…

Giorgio

Alex_Grig · May 29, 2015, 10:32am

Your bet bet for fake physical addresses is to use virtual addresses. Any fudged page handle may not work because the application may try to pass address+page offset back to the driver.

Tim_Roberts · May 29, 2015, 12:39pm

Giorgio Delmondo wrote:

>How high?
About 100 MB/s

That’s trivial. Copying that much data is way less than 1% of a modern CPU.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Tim_Roberts · May 29, 2015, 12:44pm

Giorgio Delmondo wrote:

I have did like all of you suggested and my driver “looks” working.
My only issue is with the 32bit address limitation, without using the DMA
API but manually locking the pages I really don’t know how to remap
them in
lower 4GB (as they are user allocated).

Right. You can’t remap physical addresses. You have to copy the pages.

On the target env it is currently
not a problem as does not have enough memory, but I want to avoid issue in
the future.

Remember that, because of the PCI address hole, a machine with 4GB RAM
will always have some portion above the 4GB mark.

Under Linux (sorry…) I can use a dma api call to setup mapping register
to remap those pages on the bus, avoiding copies. I wonder if such trick
can be done also on Windows…

That only works if your bus hardware has remapping registers. Most do
not. Windows has the concept of “mapping registers”, but in most cases
that involved a behind-the-scenes copy to a region below 4GB. I’m
wondering if your Linux code is doing the same thing.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Peter_Viscarola_OSR · May 29, 2015, 1:46pm

Right. That won’t work. Which is why I said to “build an MDL … at
EvtIoInCallerContext, and after that just do standard packet based
processing.”

This will solve the 32-bit addressing issue (which I wasn’t aware you had, but never mind)…

Peter
OSR
@OSRDrivers

Maxim_S_Shatskih · May 31, 2015, 3:37pm

Surely it can be done on Windows, just build your own MDL.

Yes, fill the MDL tail by hand. The tricks your user app is playing with the physical addresses are already beyond the notion of the “good architecture”, so, if you will fill the MDL tail by hand, this will not make the things worse.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

“Giorgio Delmondo” wrote in message news:xxxxx@ntdev…
Hi Peter,
I have did like all of you suggested and my driver “looks” working.
My only issue is with the 32bit address limitation, without using the DMA
API but manually locking the pages I really don’t know how to remap them in
lower 4GB (as they are user allocated). On the target env it is currently
not a problem as does not have enough memory, but I want to avoid issue in
the future.

Under Linux (sorry…) I can use a dma api call to setup mapping register
to remap those pages on the bus, avoiding copies. I wonder if such trick
can be done also on Windows…

Giorgio

Delmondo_Giorgio · June 1, 2015, 4:08am

I am not sure to understand.
I see two ways to do it:

IOMMU
allocate pages in the lower memory and tricking the MMU to move the user
virtual address to the new physical address

How this can be done by modifying the MDL tails?

On Sun, May 31, 2015 at 9:36 PM, Maxim S. Shatskih
wrote:

> Surely it can be done on Windows, just build your own MDL.
>
> Yes, fill the MDL tail by hand. The tricks your user app is playing
> with the physical addresses are already beyond the notion of the “good
> architecture”, so, if you will fill the MDL tail by hand, this will not
> make the things worse.
>
> –
> Maxim S. Shatskih
> Microsoft MVP on File System And Storage
> xxxxx@storagecraft.com
> http://www.storagecraft.com
>
> “Giorgio Delmondo” wrote in message
> news:xxxxx@ntdev…
> Hi Peter,
> I have did like all of you suggested and my driver “looks” working.
> My only issue is with the 32bit address limitation, without using the DMA
> API but manually locking the pages I really don’t know how to remap them in
> lower 4GB (as they are user allocated). On the target env it is currently
> not a problem as does not have enough memory, but I want to avoid issue in
> the future.
>
> Under Linux (sorry…) I can use a dma api call to setup mapping register
> to remap those pages on the bus, avoiding copies. I wonder if such trick
> can be done also on Windows…
>
> Giorgio
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

Maxim_S_Shatskih · June 1, 2015, 6:33am

>How this can be done by modifying the MDL tails?

Your task is to a) accept the physical addresses from user mode b) run DMA over them.

So, just handicraft the MDL tail with the physical addresses you’ve got from user mode.

Then pass this MDL to the standard DMA APIs. If IOMMU is present (not always, more so - rarely) - it will be employed. Or double-buffering will be employed. DMA adapter object will do all of this for you.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

On Sun, May 31, 2015 at 9:36 PM, Maxim S. Shatskih wrote:

Surely it can be done on Windows, just build your own MDL.

Yes, fill the MDL tail by hand. The tricks your user app is playing with the physical addresses are already beyond the notion of the “good architecture”, so, if you will fill the MDL tail by hand, this will not make the things worse.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

“Giorgio Delmondo” wrote in message news:xxxxx@ntdev…
Hi Peter,
I have did like all of you suggested and my driver “looks” working.
My only issue is with the 32bit address limitation, without using the DMA
API but manually locking the pages I really don’t know how to remap them in
lower 4GB (as they are user allocated). On the target env it is currently
not a problem as does not have enough memory, but I want to avoid issue in
the future.

Under Linux (sorry…) I can use a dma api call to setup mapping register
to remap those pages on the bus, avoiding copies. I wonder if such trick
can be done also on Windows…

Giorgio

—
NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Peter_Viscarola_OSR · June 1, 2015, 9:11am

Windows will handle the copy from high-memory to low-memory for any pages that must be copied. He’ll provide you the SG list that includes these segments.

There is no need to build the MDL by hand as Mr. Shatskih is suggesting IFthe user data buffer is virtually contiguous in user space, and how can it not be. As I said last week, just build the MDL the conventional way in the context of the caller… and use the standard KMDF Packet programming functions.

Assuming that I understand your project, this really isn’t hard. What MAKES it worrying about physical addresses and such in user mode… which is outdated and dangerous. Instead of returning PAs to the user, return VAs. The user-mode code won’t know the difference (32-bits is 32-bits, right?), but your DRIVER code will. Build the MDL from those VAs and you’ve got a very ordinary packet-model transfer.

Peter
OSR
@OSRDrivers

Delmondo_Giorgio · June 1, 2015, 11:32am

> IFthe user data buffer is virtually contiguous in user space, and how can
it not be.
unfortunately it can be.

The API the user gives it the following:

ALLOCATE_BUFFER, where a virtually contiguous buffer is allocated. It can
be called as many times as the user reqeuest. An array of “pages” is
returned.
PUSH_DESCRIPTOR where the USER compose the scatter/gather list, referring
to the physical_address/descriptor/… given by the previous call
START DMA/STOP DMA
Wait dma block

Using the push descriptor the user can potentially setup virtually non
contigous chains in block of PAGE_SIZE

The other issue is blocking me from using the proper DMA API is that once
the DMA is started, the dma could never complete (loopback of the chain)
and I get just end of block interrupts, which are composed of multiple
pages.

The Windows DMA API AFAIK wants DMA transfert that have an end. The only
way to trick it I found is to re-issue a new transfert each time a DPC
finds an end of block interrupt. This implies to follow the Scatter/Gather
list handling in the driver which is a nightmare.

Am I missing something?

On Mon, Jun 1, 2015 at 3:10 PM, wrote:

>

>
> Windows will handle the copy from high-memory to low-memory for any pages
> that must be copied. He’ll provide you the SG list that includes these
> segments.
>
> There is no need to build the MDL by hand as Mr. Shatskih is suggesting
> IFthe user data buffer is virtually contiguous in user space, and how can
> it not be. As I said last week, just build the MDL the conventional way in
> the context of the caller… and use the standard KMDF Packet programming
> functions.
>
> Assuming that I understand your project, this really isn’t hard. What
> MAKES it worrying about physical addresses and such in user mode… which
> is outdated and dangerous. Instead of returning PAs to the user, return
> VAs. The user-mode code won’t know the difference (32-bits is 32-bits,
> right?), but your DRIVER code will. Build the MDL from those VAs and
> you’ve got a very ordinary packet-model transfer.
>
> Peter
> OSR
> @OSRDrivers
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev
>
> OSR is HIRING!! See http://www.osr.com/careers
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>

Tim_Roberts · June 1, 2015, 1:25pm

Giorgio Delmondo wrote:

> IFthe user data buffer is virtually contiguous in user space, and
how can it not be.
unfortunately it can be.

The API the user gives it the following:

ALLOCATE_BUFFER, where a virtually contiguous buffer is allocated.
It can be called as many times as the user reqeuest. An array of
“pages” is returned.

PUSH_DESCRIPTOR where the USER compose the scatter/gather list,
referring to the physical_address/descriptor/… given by the previous
call

START DMA/STOP DMA

Wait dma block

Using the push descriptor the user can potentially setup virtually non
contigous chains in block of PAGE_SIZE

Yes, but do they actually DO that? Many APIs allow things that don’t
make sense, and because of that don’t get used. Any app writer with a
brain is just going to allocate a buffer, and use the pages in that
buffer, in order.

For the third time, I need like to point out that you could solve many
of your problem here by returning handles or virtual addresses instead
of physical addresses. That leaves the physical address manipulation
entirely in the driver’s control. You remember in your driver context
the list of allocated buffers and their sizes. Then, after you collect
all the descriptors, you say
for each descriptor
for each buffer
if this descriptor is within this buffer
pull the physical address from the MDL based on the
offset within the buffer

If they pass an address that was not in one of the allocated buffers,
then they violated the contract.

The other issue is blocking me from using the proper DMA API is that
once the DMA is started, the dma could never complete (loopback of the
chain) and I get just end of block interrupts, which are composed of
multiple pages.

The Windows DMA API AFAIK wants DMA transfert that have an end. The
only way to trick it I found is to re-issue a new transfert each time
a DPC finds an end of block interrupt. This implies to follow the
Scatter/Gather list handling in the driver which is a nightmare.

Not necessarily. Not every DMA scenario fits the KMDF abstraction. You
could use WdfDmaEnablerWdmGetDmaAdapter to fetch a DMA_ADAPTER, and call
BuildScatterGatherList by hand.

However, if you are going to need mapping registers (bounce buffers) to
compensate for your 32-bit address limitation, you have a problem.
First, those bounce buffers are a severely limited resource – typically
only about 64k bytes. That means your transaction has to be chopped up
into 64k increments, so that the data can be copied to/from the bounce
buffers.

If you truly want uninterrupted long DMA transfers without worrying
about multiple transactions, then I don’t see that you have any choice
other than allocating a common buffer below 4GB and copying the data to
the user buffers. That allows you to do the looping you describe. If
you have to compensate for addresses above 4GB, then there is **ALWAYS**
going to be a copy involved. There is no other alternative. With the
abstraction, that copy is hidden in the DMA transaction. Without the
abstraction, you do it explicitly. The performance is the same.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.