DMA "streaming" in small pieces?

Hi all,

I am using KMDF to manage DMA from a PCI device to a userspace
ringbuffer. The current implementation initializes the transaction
upon a single (ringbuffer does not wrap) or chained MDL (ringbuffer
does wrap). The transaction creates a SCATTER_GATHER_LIST which is
programmed into the device. The transaction is over when the device
signals that fact with a single interrupt, upon which I call
WdfDmaTransactionDmaCompleted(). So far so simple.

As a “minor” modification of that scheme, I program the device with
smaller chunks. The SCATTER_GATHER_ELEMENT elements of the
SCATTER_GATHER_LIST are split up into pieces of, say, a maximum of 32
bytes, which are then programmed into the device. The device is told
to signal an interrupt after every such piece, to let userspace see
data with low latency.

My question is: how do I flush DMA memory toward the CPU? Bounce
buffers are most likely involved all over (userspace memory). Can I
use WdfDmaTransactionDmaCompletedWithLength() to flush small pieces of
a larger DMA trans(fer|action)? Is EvtProgramDma() called repeatedly,
as I read in the MSDN doc?

Thank you in advance,
Joerg Faschingbauer

On Tue, Dec 15, 2015 at 7:55 AM, Jörg Faschingbauer <
xxxxx@aon.at> wrote:

WdfDmaTransactionDmaCompletedWithLength

The source code is online in github. See
https://github.com/Microsoft/Windows-Driver-Frameworks/blob/master/src/framework/kmdf/src/dma/base/fxdmatransaction.cpp

WdfDmaTransactionDmaCompletedWithLength ends up calling
PutScatterGatherList which will do the right thing for map registers.
However by breaking large dma transactions up into many small dma
transactions you are going to add a lot of dma transaction overhead to your
IO operations.

Mark Roddy

On 12/15/2015 03:21 PM, Mark Roddy wrote:

On Tue, Dec 15, 2015 at 7:55 AM, Jörg Faschingbauer
> wrote:
>
> WdfDmaTransactionDmaCompletedWithLength
>
>
> The source code is online in github.
> See https://github.com/Microsoft/Windows-Driver-Frameworks/blob/master/src/framework/kmdf/src/dma/base/fxdmatransaction.cpp
>
>
> WdfDmaTransactionDmaCompletedWithLength ends up calling
> PutScatterGatherList which will do the right thing for map registers.
> However by breaking large dma transactions up into many small dma
> transactions you are going to add a lot of dma transaction overhead to
> your IO operations.

I’ll have to live with that.

PutScatterGatherList() gives back the entire list. My question was, can
I use WdfDmaTransactionDmaCompletedWithLength() to give back part of the
list, be it even half of a SCATTER_GATHER_ELEMENT?

Joerg

Jörg Faschingbauer wrote:

As a “minor” modification of that scheme, I program the device with
smaller chunks. The SCATTER_GATHER_ELEMENT elements of the
SCATTER_GATHER_LIST are split up into pieces of, say, a maximum of 32
bytes, which are then programmed into the device. The device is told
to signal an interrupt after every such piece, to let userspace see
data with low latency.

How do you expect that to work? How is the user app going to know data
is available? 32 bytes on a PCI bus takes about 300ns. If you’re
waiting on an event, getting a notification passed back through the I/O
system, switching to user mode, task-switching your process in, and
getting your thread fired up is going to take tens of microseconds.

My question is: how do I flush DMA memory toward the CPU? Bounce
buffers are most likely involved all over (userspace memory).

Bounce buffers will only be involved if your hardware cannot handle
64-bit physical addresses.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

On 12/15/2015 07:53 PM, Tim Roberts wrote:

Jörg Faschingbauer wrote:
> As a “minor” modification of that scheme, I program the device with
> smaller chunks. The SCATTER_GATHER_ELEMENT elements of the
> SCATTER_GATHER_LIST are split up into pieces of, say, a maximum of 32
> bytes, which are then programmed into the device. The device is told
> to signal an interrupt after every such piece, to let userspace see
> data with low latency.

How do you expect that to work? How is the user app going to know data
is available? 32 bytes on a PCI bus takes about 300ns. If you’re
waiting on an event, getting a notification passed back through the I/O
system, switching to user mode, task-switching your process in, and
getting your thread fired up is going to take tens of microseconds.

Data is generated externally, received by the device (which is a high
precision measurement device), and delivered by the device over PCI.
Frames don’t come in at PCI speed, rather at that of the external source

  • CAN bus, for example.

The current procedure is to set up one 32 byte DMA transaction for every
frame, and to have one frame/transaction active at a time. This means
that I have to re-setup the transaction everytime a frame has come in -
which can be performance sensitive if frames come in at a high rate on
multiple devices.

Next idea is to setup a bigger transaction of, say, 100*32 bytes, and to
give back 32 bytes on each frame interrupt - using
WdfDmaTransactionDmaCompletedWithLength(32 bytes). The transaction
re-setup would only have to be done after all 100 frames are over.

Would that be possible?

> My question is: how do I flush DMA memory toward the CPU? Bounce
> buffers are most likely involved all over (userspace memory).

Bounce buffers will only be involved if your hardware cannot handle
64-bit physical addresses.

It cannot.

Thanks,
Joerg

IIRC, WdfDmaTransationDmaCompletedWithLength will do EXACTLY what you’re asking. I’ve never called this API myself, however. But from what I remember of reading the code, and as you should verify for yourself by trying it and by reading the code on GitHub as Mr. Roddy implied, this should do what you want.

Peter
OSR
@OSRDrivers

On 12/15/2015 08:57 PM, xxxxx@osr.com wrote:

[quote] My question was, can I use
WdfDmaTransactionDmaCompletedWithLength() to give back part of the
list, be it even half of a SCATTER_GATHER_ELEMENT? [/quote]

IIRC, WdfDmaTransationDmaCompletedWithLength will do EXACTLY what
you’re asking. I’ve never called this API myself, however. But from
what I remember of reading the code, and as you should verify for
yourself by trying it and by reading the code on GitHub as Mr. Roddy
implied, this should do what you want.

Done.

FxDmaTransactionBase::DmaCompleted() (when called with
FxDmaCompletionTypePartial, which
WdfDmaTransactionDmaCompletedWithLength() does) appears to adjust all
necessary pieces based on the length that I give.

Then it calls FxDmaScatterGatherTransaction::TransferCompleted() to
free the current SGL, and
FxDmaScatterGatherTransaction::StageTransfer() which builds a new SGL.

Is that right?

If yes, performance-wise this appears to be doing the same as my
current implementation. Transaction-cleanup and transaction-resetup
after each 32 bytes. I believe.

Thanks,
Joerg

That’s how it looks, after a brief review, to ME.

So, with any luck, you’re “good to go” as they say.

Peter
OSR
@OSRDrivers