Efficient Continuous DMA

Hi all,

In the WDM/WDF DMA model, the recommended way to do continuous DMA is to
use a common buffer. But what if a driver needed to perform continuous
DMA to an externally-supplied circular buffer? Copying to a common buffer
could be unnecessarily expensive. One obvious solution would be to use a
repeating sequence of packet-based transfers, i.e. break the circular
buffer into segments and repeatedly call
GetScatterGatherList()/PutScatterGatherList() on each segment. The
problem with this approach is that the logical addresses returned by
GetScatterGatherList() (or MapTransfer() if you’re old-school) aren’t
guaranteed to be the same for successive calls on the same buffer segment.
This means you’ll have to reprogram your device for each buffer segment,
which can be expensive. And it can be even more expensive if the call to
GetScatterGatherList() involves programming physical->logical address
translations into an IOMMU.

It seems like one thing the WDM/WDF DMA APIs could really use is a
separate routine that *only* does DMA synchronization on a buffer, without
regenerating the logical address mapping. Depending on when it’s called
relative to when the buffer segment is transferred, it could do any
necessary cache-flushing or bounce-buffer copying. So for a continuous
transfer, a driver could call GetScatterGatherList() only once to
establish the logical address mapping, and then just repeatedly call this
new synchronization routine on each buffer segment during the transfer,
finally calling PutScatterGatherList() to teardown the mapping when the
transfer is stopped.

I understand that the packet-based APIs were designed to enforce equal
distribution of map registers among devices, and this approach could allow
a driver to reserve a set of map registers for a long time. But in my
experience map registers really aren’t *that* limited, at least with
32-bit HW on a 64-bit OS (32-bit HW on a 32-bit OS w/ PAE is a different
story, but I’m choosing to ignore that case here:). It’s also interesting
to note that other OSes already have this kind of de-coupled
synchronization routine:

Mac OS X: IODMACommand::synchronize()
*BSD: bus_dmamap_sync()
Linux: dma_sync_sg_for_cpu()
Solaris: ddi_dma_sync()

So is this a limitation that other people have encountered w/ WDM? If so,
is it just something I have to live with, or are there tricks I’m not
aware of for working around it?

Thanks,
Jason

Thx for your feedback.

I don’t know if this will meet your needs but someone before mentioned this article: http://www.osronline.com/article.cfm?id=39
I.e., you could expose your common buffer to your app.

Egi.

In a perfect world, yeah, we’d always be able to allocate our own buffer and map it. But there are cases where the buffer *has* to be allocated elsewhere, often in userspace. Even without those cases, the physically contiguous common buffer allocation could be a problem for larger buffers, and trying to stitch together multiple smaller common buffers can make things really complex.

Yeah… what you’re talking about is static map register allocation. Lots of operating systems have supported this in the past. In fact, historically, this is probably the more common approach.

In the world of IOMMUs, it probably makes a lot of sense for the Windows folks to re-think how the Windows DMA APIs can be used, and it probably makes sense to provide for some sort of static allocation. Doing this in a backwards-compatible way, where a device capable of 32-bit addressing can be used on systems with more than 4GB of memory, might not be such a great idea. But I suspect that can be managed.

Let’s look at the problem your posing, and see how it works-out in “real life” TODAY:

Givens: Your driver is handed a data buffer which it needs to setup for continuous mode DMA with your device.

There are only two situations that I can think of that matter:

a) The buffer is outside the DMA addressing range of your device (for example, your device is 32-bit DMA capable and one or more fragments of the buffer are located at or above 4GB). In this case, you’re out of luck and it won’t work in any case. It’s NOT a packet-based transfer you’re seeking… so Windows bounce-buffers don’t solve the problem.

b) The buffer is WITHIN the DMA addressing range of your device (64-bit busmaster DMA device on a system with 4GB or more of memory). In this case, EXCEPT FOR THE IOMMU CASE, your device bus logical address will be the physical address of the memory segment… so you’re OK.

So… as a practical matter, today what you’re proposing isn’t really an issue.

The real solution, I’m sorry to say, is to use a common buffer approach as already discussed. Byt you know this. This gets you the static allocation of “map registers” that you seek, with the gains of not having to re-program your hardware. Of course, it doesn’t allow you to randomly locate your data buffer, and this seems to be part of your overall requirement.

Peter
OSR

On a non-IOMMU machine case a) will incur *some* buffer copying regardless of the OS–it really comes down to only wanting to bounce the pages that are out-of-range as opposed to bouncing the whole buffer to a common buffer.

My hunch in case b) is that the SGL could be reused even in the IOMMU case, *provided* there aren’t any intermediary caches that would need to be flushed and KeFlushIoBuffers (no-op anyway) would be enough for cache synchronization. With an IOMMU, this would probably work even with 32-bit HW since bounce buffers should never be required on such a machine.

Not that this is really ever a supported use case for the packet-based API anyway:)

Btw, does anyone know if there’s any version of Windows yet that uses VT-d for DMA remapping? I’d heard a rumor that Server 2008 does.

> Yeah… what you’re talking about is static map register allocation.

Can you:

  • in adapter control routine, the map registers are allocated
  • save MapRegisterBase somewhere and return DeallocateObjectKeepRegisters
  • then, when you need to actually map a MDL (short enough MDL), just call IoMapTransfer with the saved MapRegisterBase outside of adapter control routine.
  • then program the hardware and do the DMA
  • then call IoFlushAdapterBuffers
  • then reuse MapRegisterBase for the next MDL

Will this work?


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

We seem to be talking at cross-purposes.

Case A “Some buffer copy”?? No, it won’t work at all. You specifically said “continuous mode DMA”… and by definition, this means that there’s no way for the system (or anything outside of the peripheral DMA controller) to know when a DMA operation is started or finished. So, bounce buffers do not solve this problem. The ONLY thing that solves this problem (again, short of an IOMMU) is locating the buffer fragment within the addressing range of the device.

Case B “just works.”

I guess I’m confused as to what your actual issue is… Are we talking generically in operating system theory for some graduate school class, or are we talking about how to solve a real problem on a real Windows machine today?

Peter
OSR

Btw, does anyone know if there’s any version of Windows yet that uses VT-d for DMA remapping? I’d heard a rumor that Server 2008 does.

Considering the hardware that S2K8 will run on, and when it was released, I kind of doubt it. I mean, it could special case it where available, but there’s only a few publicly available chipsets that even support it at this point, and I believe that msft is waiting for ‘VT-d2’ or whatever, with it’s ability to remap interrupts (maybe).

mm

Maxim, I actually have the same question about the legacy DMA API. Repeatedly calling FlushAdapterBuffers should work to synchronize caches/bounce buffers for a read operation, but for a write (which I also have to support for my HW), MapTransfer would have to be called repeatedly since AFAIK that’s where bounce buffer synchronization is done for writes. And I’m not sure if a repeated MapTransfer call would even work, let alone return the same logical address.

Peter, you’re talking about what I’d call *unsynchronized* continuous DMA–in that case, you’re right that bounce buffers will never work.
What I’m talking about is a device that continuously transfers data to a circular buffer, but also at least has some mechanism for generating an interrupt after every n bytes have been transferred–synchronization of a buffer (or buffer segment as it were) is done in response to such an interrupt, as is notification of the controlling app that more data is available. Depending on the type of data being transferred, the device or even the driver may also have some flow-control mechanism for halting/suspending the transfer if buffer segments can’t be serviced quickly enough.

All the continuously-transferring devices I’m familiar with have such a feature–and yes, I’m dealing with real hardware on real Windows machines.

Really? I’ve never seen a device that works that way… network cards, storage controllers, etc usually maintain data structures in host memory that describe a set of ring buffers (pointers, state, etc). These data structures are randomly (and thus, effectively “continuously”) accessed by the bus-master adapter. As the buffers are filled by the adapter, the state of the buffers is indicated directly within the common buffer memory block. The adapter will generate an interrupt when it overruns the available buffers… but that’s not intended to be a synchronization mechanism.

I’d call what you’re describing a hybrid packet-based / continuous DMA design.

OK, OK… now at least I have a clue as to what you’re talking about.

And yes… In THAT case, using Max’s approach should work. It USED to work, in any case. We know this because you COULD call MapTransfer multiple times using a different offset from which to start mapping, without having to call (gad! What is it??) AllocateAdapterChannel.

I’d have to look at some of my old DMA code to be certain, but I believe I’ve used (and written even about) this approach long ago in the dark ages of, like, NT V3.51 or something.

I have no idea whether it’ll pass driver verifier, or (heck) even if you can still use this pre-Windows 2000 (“classic model”) DMA approach. There was a conditional in NTDDK.H that defined these functions, long after the W2K release.

But, even if you COULD get these crufty old buddies to work, you have to wonder if it’s sensible to base a new driver design on functions that were outdated almost 10 years ago. Will they work on new buses? Will they work on 64-bit Windows? Will they work after the next service pack??

Peter
OSR

The MIPS and Alpha AXP ports of Windows NT used IOMMUs, and those were
supported from Windows NT version 3.1 through 4.0. So, to answer your
question very literally, yes, there are versions of Windows that use IOMMUs
(though not VT-d) for DMA remapping.

No version of Windows that you can buy today supports an IOMMU (VT-d or
otherwise.) We’re considering it.


Jake Oshins
Hyper-V I/O Architect
Windows Kernel Group

This post implies no warranties and confers no rights.


<jason.harmening> wrote in message news:xxxxx@ntdev…
> On a non-IOMMU machine case a) will incur some buffer copying regardless
> of the OS–it really comes down to only wanting to bounce the pages that
> are out-of-range as opposed to bouncing the whole buffer to a common
> buffer.
>
> My hunch in case b) is that the SGL could be reused even in the IOMMU
> case, provided there aren’t any intermediary caches that would need to
> be flushed and KeFlushIoBuffers (no-op anyway) would be enough for cache
> synchronization. With an IOMMU, this would probably work even with 32-bit
> HW since bounce buffers should never be required on such a machine.
>
> Not that this is really ever a supported use case for the packet-based API
> anyway:)
>
> Btw, does anyone know if there’s any version of Windows yet that uses VT-d
> for DMA remapping? I’d heard a rumor that Server 2008 does.
></jason.harmening>

>buses? Will they work on 64-bit Windows? Will they work after the next service pack??

Are they officially deprecated?


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

There’s the difference–I’ve never dealt w/ network/storage, but I’ve done a lot of work w/ video capture & data acquisition, though these devices can be made to do unsynchronized DMA also.

And that’s the biggest reason I’d be hesitant to use them. I’m even beginning to worry that new driver model features will start to be *WDF*-only.

That’s a reasonable concern, really. At least for the future. Because, we ARE already seeing that requirement (for USB, for example).

Max’s point (again!) is valid, however: MapTransfer is NOT officially deprecated. IoMapTransfer is deprecated.

So these functions might be old and crufty, but the prototypes DO appear in WDM.H, they’re not marked deprecated, the documentation appears in the WDK (and doesn’t say anything about not using them), and the code in the various HALs for this function and its friends looks to be functional even on x64 (at least through Vista, which is what I checked).

If you feel like you really need the performance, this might wind-up being your best approach. Some quick prototyping should lead you in the right direction.

Peter
OSR

> Max’s point (again!) is valid, however: MapTransfer is NOT officially deprecated. IoMapTransfer is

deprecated.

IoMapTransfer is a stupid wrapper around ->MapTransfer


Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com