What is DMA?
DMA is a way for you to offload the work of transferring data between
main memory and the device onto your device. This is in contrast to
programmed I/O (PIO) where you have the processor copying data between
main memory and the device.
PIO results in high data-rates, but it wastes significant amounts of CPU
time since you’re effectively running memcpy() for every transfer. It’s
better to offload this mundane work to some other processor which can
move the data from A to B and then interrupt (or signal in some other
way) when the transfer is done. The CPU is then left free to do the
things you paid for it to do - math, comparisons, branches, etc…
There are two flavors of DMA - slave-mode and bus-mastering. In
Slave-Mode DMA you would use a device built onto the motherboard (these
days it’s somewhere in the chipset) to transfer from main memory to
device memory. The slave-mode DMA controller has a number of
limitations, the two biggest being that it’s slow to have all these
transfers going through a third party, and that you have to share it
across all slave-mode devices so you have this reserve/release system
which makes things more expensive. Slave-Mode seems to be more-or-less
obsolete, particuarly on PCI devices, but you may still see it on older
ISA/PCMCIA devices.
In bus-mastering DMA you place some logic on your device to allow it to
run the DMA cycles itself. The device will steal some bus time and
initiate a memory transfer as if it were another CPU. Data is
transferred directly from main-memory into the device’s memory ranges
(this is all a simplficiation). You can have multiple bus-masters
running independently of each other, so this is more effiicent than
having all your devices fight over a single transfer agent (whether it’s
the CPU (PIO) or a separate DMA controller (slave-mode)).
What does DMA look like on a device?
From the driver’s point of view there are two aspects to DMA. The first
is how you prepare your data for DMA transfers. The second is how you
program the device to initiate the transfers & how you notice that a
transfer is done.
Let’s talk about the second part first. There are an infinite number of
models for programming your device to start a DMA. Each introduces its
own limitations. I’ll go over a few of the common ones i’ve seen:
1 - The device takes a single physical address base and
a length for an operation. This is very simple to program, but requres
the transfer to be physically contiguous, which is unlikely for anything
other than the smallest transfers (physical memory is often very
fragmented, so the chance of two adjoining virtual pages using adjoining
physical pages is pretty small). The device will usually interrupt when
the DMA transfer is complete.
2 - The device takes a single physical address base & a
length for each fragment of an operation. It interrupts when it’s done
transferring each fragment, allowing your driver to program in the next
one. This is going to be slow because of the latency between each
fragment, but is still easy to implement.
3 - The device takes a sequence of (physical-address,
length) pairs which describe all the fragments of the transfer. This
sequence is called a “scatter-gather list” (SG List). The device can
then transfer each fragment on its own without the need to interrupt the
CPU until all sections are done. In the simplest version of this, the
driver programs the SG list to the controller through its
registers/ports - writing each element into the device’s internal
memory. The device will only have a limited space for the SG list, so
you may only be able to handle 16 fragments in a given transfer.
4 - In the more complex version of 3, the SG list itself
is stored in DMA accessible system memory and the device is programmed
with the physical address and length of the scatter-gather list itself.
The device can then use DMA to transfer the SG list entries into its own
internal buffers. This can reduce the limitations on the length of the
SG list, but requires more complex logic in the DMA controller to handle
it. However this would require the memory holding the SG list to be
physically contiguous.
Many of these options can be changed to use “common buffer” (a
physically-contiguous, DMA ready buffer that’s shared between the device
and the system). For some devices you cannot just give them an
arbitrary physical address for a scatter gather list, but instead
require that you give them an offset into a “common buffer” that’s
previously been configured for the device. Sometimes you allocate a
“command block” in the common buffer, put a pointer to the SG list in
that, and then write the address of the command block to a register on
the controller. For devices that can handle multiple requests at a
time, this can reduce the amount of internal memory that the controller
needs & reduce the limitiations on the number of requests that can be
run at one time.
Now that we have a vague idea of how you might program DMA on a device,
how do you prepare the buffers? This might seem simple - you just get
the physical addresses of the buffer and send them to the controller.
But what if you have a 32-bit controller on a 36 or 64 bit system? Or
if the PCI bridge between your device and system memory is broken and
must be run in 32-bit mode even when you have a 64-bit controller. Or
if the buffer is in a memory on a NUMA node that your device can’t see
(this last one is hypothetical at the moment)? In these cases your
buffer contains pages which your device cannot see, and you need to
“bounce” them into auxillary buffers which the device can see.
Now you may think you could take care of this bouncing yourself if you
knew all the limitations on your device. But since we can’t enumerate
all future restrictions today, WDM doesn’t tell you this. This makes
some sense - everyone expects your driver from today to run for the next
5+ years, but we know the underlying system hardware will morph between
now and then.
One option is to bounce all of your transfers into common-buffer. This
might even make sense if your transfer consists of many small pieces
(like a network packet, which might have several headers added to it).
Since common-buffer is always visible to your device this is always a
safe place to bounce a transfer. However it can be hard to allocate, so
you probably need to get one big slab initially and then split it up
yourself, which becomes complicated.
This is where the WDM DMA functions come into play. Functions like
GetScatterGatherList can do the transformation for you - copying only
the pages that are out of range into bounce buffers and giving you back
the scatter gather list to program to your device. You can then
initiate the transfer, and when you’re done call PutScatterGatherList to
free the SG List and it’s associated resources. If your device is a
bus-master but doesn’t support scatter gather, you can request
“packet-based” DMA in which case the WDM DMA functions will compose a
single contiguous buffer for your transfer.
Each single bounced page corresponds to a “map register”. So if your
maximum transfer size is 256KB, you need (256KB / PAGE_SIZE) bounce
buffers available to ensure that you can bounce the entire thing into
visible memory. When you first request the DMA_ADAPTER structure from
WDM, you need to provide information on how many map registers you
require.
There are a number of other issues with DMA. Hopefully this will get
you started.
thanks,
-p
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Alex Farber
Sent: Monday, February 27, 2006 7:22 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] What is DMA?
I have a lot of information how to implement DMA in WDM and WDF drivers,
but I am missing general DMA concept. What is DMA from driver
developer’s point of view, and how is it supported in hardware?
Yahoo! Mail
Use Photomail
http:.mail.yahoo.com> to share photos without annoying attachments. —
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 To unsubscribe, visit the
List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer</http:>