What is DMA?

I have a lot of information how to implement DMA in WDM and WDF drivers, but I am missing general DMA concept. What is DMA from driver developer’s point of view, and how is it supported in hardware?


Yahoo! Mail
Use Photomail to share photos without annoying attachments.

Direct Memory Access. It allows peripheral devices with the proper
hardware to read or write a specific “chunk” of memory while the CPU does
something else.

Gary G. Little


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@yahoo.com
Sent: Monday, February 27, 2006 9:22 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] What is DMA?

I have a lot of information how to implement DMA in WDM and WDF drivers,
but I am missing general DMA concept. What is DMA from driver developer’s
point of view, and how is it supported in hardware?


Yahoo! Mail
Use
http:l.yahoo.com> Photomail to share photos without annoying attachments. —
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 To unsubscribe, visit the List
Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer</http:>

Alex Farber wrote:

I have a lot of information how to implement DMA in WDM and WDF
drivers, but I am missing general DMA concept. What is DMA from driver
developer’s point of view, and how is it supported in hardware?

DMA, as the term is used today, is a loose term for what is more
properly called “bus mastering”. In the original ISA bus, there was
exactly one bus master: the processor. Everything that happened on the
bus was initiated by the processor. With PCI, any device on the bus can
temporarily become the “bus master” and generate its own bus traffic.

Let’s say you want to blit a bitmap from system memory to a graphics
card. You could do it by using RtlMoveMemory in your driver, and that’s
nearly as fast as you can possibly do that transfer, but the processor
can’t do anything else while it is copying. With DMA, I can tell the
graphics chip to go do that transfer. I have to tell it (somehow; it
varies by device) where to copy from, where to copy to, and how far to
go. Then, I just tell it to start the transfer. Now, I can go do
something else in the CPU while the graphics card generates bus traffic
to read from memory.

The transfer itself might take a little longer in real-time, but because
the processor can go on to do other things, overall system throughput is
improved.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> I have a lot of information how to implement DMA in WDM and WDF drivers, but

I am missing general DMA concept. What is DMA from driver developer’s point
of view, and how is it supported in hardware?

DMA is - when the hardware device says the CPU to hold off and accesses the RAM
itself, the ways similar to the CPU.

As about DMA support basics in Windows - this is a large chapter in Walter
Oney’s book and in the DDK docs. Not a subject for a short forum message.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

What is DMA?

DMA is a way for you to offload the work of transferring data between
main memory and the device onto your device. This is in contrast to
programmed I/O (PIO) where you have the processor copying data between
main memory and the device.

PIO results in high data-rates, but it wastes significant amounts of CPU
time since you’re effectively running memcpy() for every transfer. It’s
better to offload this mundane work to some other processor which can
move the data from A to B and then interrupt (or signal in some other
way) when the transfer is done. The CPU is then left free to do the
things you paid for it to do - math, comparisons, branches, etc…

There are two flavors of DMA - slave-mode and bus-mastering. In
Slave-Mode DMA you would use a device built onto the motherboard (these
days it’s somewhere in the chipset) to transfer from main memory to
device memory. The slave-mode DMA controller has a number of
limitations, the two biggest being that it’s slow to have all these
transfers going through a third party, and that you have to share it
across all slave-mode devices so you have this reserve/release system
which makes things more expensive. Slave-Mode seems to be more-or-less
obsolete, particuarly on PCI devices, but you may still see it on older
ISA/PCMCIA devices.

In bus-mastering DMA you place some logic on your device to allow it to
run the DMA cycles itself. The device will steal some bus time and
initiate a memory transfer as if it were another CPU. Data is
transferred directly from main-memory into the device’s memory ranges
(this is all a simplficiation). You can have multiple bus-masters
running independently of each other, so this is more effiicent than
having all your devices fight over a single transfer agent (whether it’s
the CPU (PIO) or a separate DMA controller (slave-mode)).

What does DMA look like on a device?

From the driver’s point of view there are two aspects to DMA. The first
is how you prepare your data for DMA transfers. The second is how you
program the device to initiate the transfers & how you notice that a
transfer is done.

Let’s talk about the second part first. There are an infinite number of
models for programming your device to start a DMA. Each introduces its
own limitations. I’ll go over a few of the common ones i’ve seen:

1 - The device takes a single physical address base and
a length for an operation. This is very simple to program, but requres
the transfer to be physically contiguous, which is unlikely for anything
other than the smallest transfers (physical memory is often very
fragmented, so the chance of two adjoining virtual pages using adjoining
physical pages is pretty small). The device will usually interrupt when
the DMA transfer is complete.

2 - The device takes a single physical address base & a
length for each fragment of an operation. It interrupts when it’s done
transferring each fragment, allowing your driver to program in the next
one. This is going to be slow because of the latency between each
fragment, but is still easy to implement.

3 - The device takes a sequence of (physical-address,
length) pairs which describe all the fragments of the transfer. This
sequence is called a “scatter-gather list” (SG List). The device can
then transfer each fragment on its own without the need to interrupt the
CPU until all sections are done. In the simplest version of this, the
driver programs the SG list to the controller through its
registers/ports - writing each element into the device’s internal
memory. The device will only have a limited space for the SG list, so
you may only be able to handle 16 fragments in a given transfer.

4 - In the more complex version of 3, the SG list itself
is stored in DMA accessible system memory and the device is programmed
with the physical address and length of the scatter-gather list itself.
The device can then use DMA to transfer the SG list entries into its own
internal buffers. This can reduce the limitations on the length of the
SG list, but requires more complex logic in the DMA controller to handle
it. However this would require the memory holding the SG list to be
physically contiguous.

Many of these options can be changed to use “common buffer” (a
physically-contiguous, DMA ready buffer that’s shared between the device
and the system). For some devices you cannot just give them an
arbitrary physical address for a scatter gather list, but instead
require that you give them an offset into a “common buffer” that’s
previously been configured for the device. Sometimes you allocate a
“command block” in the common buffer, put a pointer to the SG list in
that, and then write the address of the command block to a register on
the controller. For devices that can handle multiple requests at a
time, this can reduce the amount of internal memory that the controller
needs & reduce the limitiations on the number of requests that can be
run at one time.

Now that we have a vague idea of how you might program DMA on a device,
how do you prepare the buffers? This might seem simple - you just get
the physical addresses of the buffer and send them to the controller.
But what if you have a 32-bit controller on a 36 or 64 bit system? Or
if the PCI bridge between your device and system memory is broken and
must be run in 32-bit mode even when you have a 64-bit controller. Or
if the buffer is in a memory on a NUMA node that your device can’t see
(this last one is hypothetical at the moment)? In these cases your
buffer contains pages which your device cannot see, and you need to
“bounce” them into auxillary buffers which the device can see.

Now you may think you could take care of this bouncing yourself if you
knew all the limitations on your device. But since we can’t enumerate
all future restrictions today, WDM doesn’t tell you this. This makes
some sense - everyone expects your driver from today to run for the next
5+ years, but we know the underlying system hardware will morph between
now and then.

One option is to bounce all of your transfers into common-buffer. This
might even make sense if your transfer consists of many small pieces
(like a network packet, which might have several headers added to it).
Since common-buffer is always visible to your device this is always a
safe place to bounce a transfer. However it can be hard to allocate, so
you probably need to get one big slab initially and then split it up
yourself, which becomes complicated.

This is where the WDM DMA functions come into play. Functions like
GetScatterGatherList can do the transformation for you - copying only
the pages that are out of range into bounce buffers and giving you back
the scatter gather list to program to your device. You can then
initiate the transfer, and when you’re done call PutScatterGatherList to
free the SG List and it’s associated resources. If your device is a
bus-master but doesn’t support scatter gather, you can request
“packet-based” DMA in which case the WDM DMA functions will compose a
single contiguous buffer for your transfer.

Each single bounced page corresponds to a “map register”. So if your
maximum transfer size is 256KB, you need (256KB / PAGE_SIZE) bounce
buffers available to ensure that you can bounce the entire thing into
visible memory. When you first request the DMA_ADAPTER structure from
WDM, you need to provide information on how many map registers you
require.

There are a number of other issues with DMA. Hopefully this will get
you started.

thanks,
-p


From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Alex Farber
Sent: Monday, February 27, 2006 7:22 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] What is DMA?

I have a lot of information how to implement DMA in WDM and WDF drivers,
but I am missing general DMA concept. What is DMA from driver
developer’s point of view, and how is it supported in hardware?


Yahoo! Mail
Use Photomail
http:.mail.yahoo.com> to share photos without annoying attachments. —
Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256 To unsubscribe, visit the
List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer</http:>

Peter,

This is the best description on Windows DMA I’ve ever
read.

At driver initialization, is there a *reliable* way to
tell whether nor not bounce-buffer will be used,
assuming the device can DMA from/to the entire 64-bit
address space?

Thanks,
Calvin

— Peter Wieland
wrote:

> What is DMA?
>
> DMA is a way for you to offload the work of
> transferring data between
> main memory and the device onto your device. This
> is in contrast to
> programmed I/O (PIO) where you have the processor
> copying data between
> main memory and the device.
>
> PIO results in high data-rates, but it wastes
> significant amounts of CPU
> time since you’re effectively running memcpy() for
> every transfer. It’s
> better to offload this mundane work to some other
> processor which can
> move the data from A to B and then interrupt (or
> signal in some other
> way) when the transfer is done. The CPU is then
> left free to do the
> things you paid for it to do - math, comparisons,
> branches, etc…
>
> There are two flavors of DMA - slave-mode and
> bus-mastering. In
> Slave-Mode DMA you would use a device built onto the
> motherboard (these
> days it’s somewhere in the chipset) to transfer from
> main memory to
> device memory. The slave-mode DMA controller has a
> number of
> limitations, the two biggest being that it’s slow to
> have all these
> transfers going through a third party, and that you
> have to share it
> across all slave-mode devices so you have this
> reserve/release system
> which makes things more expensive. Slave-Mode seems
> to be more-or-less
> obsolete, particuarly on PCI devices, but you may
> still see it on older
> ISA/PCMCIA devices.
>
> In bus-mastering DMA you place some logic on your
> device to allow it to
> run the DMA cycles itself. The device will steal
> some bus time and
> initiate a memory transfer as if it were another
> CPU. Data is
> transferred directly from main-memory into the
> device’s memory ranges
> (this is all a simplficiation). You can have
> multiple bus-masters
> running independently of each other, so this is more
> effiicent than
> having all your devices fight over a single transfer
> agent (whether it’s
> the CPU (PIO) or a separate DMA controller
> (slave-mode)).
>
> What does DMA look like on a device?
>
> From the driver’s point of view there are two
> aspects to DMA. The first
> is how you prepare your data for DMA transfers. The
> second is how you
> program the device to initiate the transfers & how
> you notice that a
> transfer is done.
>
> Let’s talk about the second part first. There are
> an infinite number of
> models for programming your device to start a DMA.
> Each introduces its
> own limitations. I’ll go over a few of the common
> ones i’ve seen:
>
> 1 - The device takes a single physical address
> base and
> a length for an operation. This is very simple to
> program, but requres
> the transfer to be physically contiguous, which is
> unlikely for anything
> other than the smallest transfers (physical memory
> is often very
> fragmented, so the chance of two adjoining virtual
> pages using adjoining
> physical pages is pretty small). The device will
> usually interrupt when
> the DMA transfer is complete.
>
> 2 - The device takes a single physical address
> base & a
> length for each fragment of an operation. It
> interrupts when it’s done
> transferring each fragment, allowing your driver to
> program in the next
> one. This is going to be slow because of the
> latency between each
> fragment, but is still easy to implement.
>
> 3 - The device takes a sequence of
> (physical-address,
> length) pairs which describe all the fragments of
> the transfer. This
> sequence is called a “scatter-gather list” (SG
> List). The device can
> then transfer each fragment on its own without the
> need to interrupt the
> CPU until all sections are done. In the simplest
> version of this, the
> driver programs the SG list to the controller
> through its
> registers/ports - writing each element into the
> device’s internal
> memory. The device will only have a limited space
> for the SG list, so
> you may only be able to handle 16 fragments in a
> given transfer.
>
> 4 - In the more complex version of 3, the SG list
> itself
> is stored in DMA accessible system memory and the
> device is programmed
> with the physical address and length of the
> scatter-gather list itself.
> The device can then use DMA to transfer the SG list
> entries into its own
> internal buffers. This can reduce the limitations
> on the length of the
> SG list, but requires more complex logic in the DMA
> controller to handle
> it. However this would require the memory holding
> the SG list to be
> physically contiguous.
>
> Many of these options can be changed to use “common
> buffer” (a
> physically-contiguous, DMA ready buffer that’s
> shared between the device
> and the system). For some devices you cannot just
> give them an
> arbitrary physical address for a scatter gather
> list, but instead
> require that you give them an offset into a “common
> buffer” that’s
> previously been configured for the device.
> Sometimes you allocate a
> “command block” in the common buffer, put a pointer
> to the SG list in
> that, and then write the address of the command
> block to a register on
> the controller. For devices that can handle
> multiple requests at a
> time, this can reduce the amount of internal memory
> that the controller
> needs & reduce the limitiations on the number of
> requests that can be
> run at one time.
>
> Now that we have a vague idea of how you might
> program DMA on a device,
> how do you prepare the buffers? This might seem
> simple - you just get
> the physical addresses of the buffer and send them
> to the controller.
> But what if you have a 32-bit controller on a 36 or
> 64 bit system? Or
> if the PCI bridge between your device and system
> memory is broken and
> must be run in 32-bit mode even when you have a
> 64-bit controller. Or
> if the buffer is in a memory on a NUMA node that
> your device can’t see
> (this last one is hypothetical at the moment)? In
> these cases your
> buffer contains pages which your device cannot see,
> and you need to
> “bounce” them into auxillary buffers which the
> device can see.
>
> Now you may think you could take care of this
> bouncing yourself if you
> knew all the limitations on your device. But since
> we can’t enumerate
> all future restrictions today, WDM doesn’t tell you
> this. This makes
> some sense - everyone expects your driver from today
> to run for the next
> 5+ years, but we know the underlying system hardware
> will morph between
> now and then.
>
> One option is to bounce all of your transfers into
> common-buffer. This
> might even make sense if your transfer consists of
> many small pieces
> (like a network packet, which might have several
> headers added to it).
> Since common-buffer is always visible to your device
> this is always a
>
=== message truncated ===

__________________________________________________________
Find your next car at http://autos.yahoo.ca

See the last paragraph (or so). I’m not aware of anything that tells
you whether you’ll end up using bounce buffers.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Calvin Guan
Sent: Monday, February 27, 2006 6:48 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] What is DMA?

Peter,

This is the best description on Windows DMA I’ve ever read.

At driver initialization, is there a *reliable* way to tell whether nor
not bounce-buffer will be used, assuming the device can DMA from/to the
entire 64-bit address space?

Thanks,
Calvin

— Peter Wieland
wrote:

> What is DMA?
>
> DMA is a way for you to offload the work of transferring data between
> main memory and the device onto your device. This is in contrast to
> programmed I/O (PIO) where you have the processor copying data between

> main memory and the device.
>
> PIO results in high data-rates, but it wastes significant amounts of
> CPU time since you’re effectively running memcpy() for every transfer.

> It’s better to offload this mundane work to some other processor which

> can move the data from A to B and then interrupt (or signal in some
> other
> way) when the transfer is done. The CPU is then left free to do the
> things you paid for it to do - math, comparisons, branches, etc…
>
> There are two flavors of DMA - slave-mode and bus-mastering. In
> Slave-Mode DMA you would use a device built onto the motherboard
> (these days it’s somewhere in the chipset) to transfer from main
> memory to device memory. The slave-mode DMA controller has a number
> of limitations, the two biggest being that it’s slow to have all these

> transfers going through a third party, and that you have to share it
> across all slave-mode devices so you have this reserve/release system
> which makes things more expensive. Slave-Mode seems to be
> more-or-less obsolete, particuarly on PCI devices, but you may still
> see it on older ISA/PCMCIA devices.
>
> In bus-mastering DMA you place some logic on your device to allow it
> to run the DMA cycles itself. The device will steal some bus time and

> initiate a memory transfer as if it were another CPU. Data is
> transferred directly from main-memory into the device’s memory ranges
> (this is all a simplficiation). You can have multiple bus-masters
> running independently of each other, so this is more effiicent than
> having all your devices fight over a single transfer agent (whether
> it’s the CPU (PIO) or a separate DMA controller (slave-mode)).
>
> What does DMA look like on a device?
>
> From the driver’s point of view there are two aspects to DMA. The
> first is how you prepare your data for DMA transfers. The second is
> how you program the device to initiate the transfers & how you notice
> that a transfer is done.
>
> Let’s talk about the second part first. There are an infinite number
> of models for programming your device to start a DMA.
> Each introduces its
> own limitations. I’ll go over a few of the common ones i’ve seen:
>
> 1 - The device takes a single physical address base and
a length for
> an operation. This is very simple to program, but requres the
> transfer to be physically contiguous, which is unlikely for anything
> other than the smallest transfers (physical memory is often very
> fragmented, so the chance of two adjoining virtual pages using
> adjoining physical pages is pretty small). The device will usually
> interrupt when the DMA transfer is complete.
>
> 2 - The device takes a single physical address base & a
length for
> each fragment of an operation. It interrupts when it’s done
> transferring each fragment, allowing your driver to program in the
> next one. This is going to be slow because of the latency between
> each fragment, but is still easy to implement.
>
> 3 - The device takes a sequence of
> (physical-address,
> length) pairs which describe all the fragments of the transfer. This
> sequence is called a “scatter-gather list” (SG List). The device can
> then transfer each fragment on its own without the need to interrupt
> the CPU until all sections are done. In the simplest version of this,

> the driver programs the SG list to the controller through its
> registers/ports - writing each element into the device’s internal
> memory. The device will only have a limited space for the SG list, so

> you may only be able to handle 16 fragments in a given transfer.
>
> 4 - In the more complex version of 3, the SG list itself
is stored
> in DMA accessible system memory and the device is programmed with the
> physical address and length of the scatter-gather list itself.
> The device can then use DMA to transfer the SG list entries into its
> own internal buffers. This can reduce the limitations on the length
> of the SG list, but requires more complex logic in the DMA controller
> to handle it. However this would require the memory holding the SG
> list to be physically contiguous.
>
> Many of these options can be changed to use “common buffer” (a
> physically-contiguous, DMA ready buffer that’s shared between the
> device and the system). For some devices you cannot just give them an

> arbitrary physical address for a scatter gather list, but instead
> require that you give them an offset into a “common buffer” that’s
> previously been configured for the device.
> Sometimes you allocate a
> “command block” in the common buffer, put a pointer to the SG list in
> that, and then write the address of the command block to a register on

> the controller. For devices that can handle multiple requests at a
> time, this can reduce the amount of internal memory that the
> controller needs & reduce the limitiations on the number of requests
> that can be run at one time.
>
> Now that we have a vague idea of how you might program DMA on a
> device, how do you prepare the buffers? This might seem simple - you
> just get the physical addresses of the buffer and send them to the
> controller.
> But what if you have a 32-bit controller on a 36 or
> 64 bit system? Or
> if the PCI bridge between your device and system memory is broken and
> must be run in 32-bit mode even when you have a 64-bit controller. Or

> if the buffer is in a memory on a NUMA node that your device can’t see

> (this last one is hypothetical at the moment)? In these cases your
> buffer contains pages which your device cannot see, and you need to
> “bounce” them into auxillary buffers which the device can see.
>
> Now you may think you could take care of this bouncing yourself if you

> knew all the limitations on your device. But since we can’t enumerate

> all future restrictions today, WDM doesn’t tell you this. This makes
> some sense - everyone expects your driver from today to run for the
> next
> 5+ years, but we know the underlying system hardware
> will morph between
> now and then.
>
> One option is to bounce all of your transfers into common-buffer.
> This might even make sense if your transfer consists of many small
> pieces (like a network packet, which might have several headers added
> to it).
> Since common-buffer is always visible to your device this is always a
>
=== message truncated ===

__________________________________________________________
Find your next car at http://autos.yahoo.ca


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Thank you very much for explanations, this helps me a lot.

Peter Wieland wrote: See the last paragraph (or so). I’m not aware of anything that tells
you whether you’ll end up using bounce buffers.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Calvin Guan
Sent: Monday, February 27, 2006 6:48 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] What is DMA?

Peter,

This is the best description on Windows DMA I’ve ever read.

At driver initialization, is there a reliable way to tell whether nor
not bounce-buffer will be used, assuming the device can DMA from/to the
entire 64-bit address space?

Thanks,
Calvin

— Peter Wieland

wrote:

> What is DMA?
>
> DMA is a way for you to offload the work of transferring data between
> main memory and the device onto your device. This is in contrast to
> programmed I/O (PIO) where you have the processor copying data between

> main memory and the device.
>
> PIO results in high data-rates, but it wastes significant amounts of
> CPU time since you’re effectively running memcpy() for every transfer.

> It’s better to offload this mundane work to some other processor which

> can move the data from A to B and then interrupt (or signal in some
> other
> way) when the transfer is done. The CPU is then left free to do the
> things you paid for it to do - math, comparisons, branches, etc…
>
> There are two flavors of DMA - slave-mode and bus-mastering. In
> Slave-Mode DMA you would use a device built onto the motherboard
> (these days it’s somewhere in the chipset) to transfer from main
> memory to device memory. The slave-mode DMA controller has a number
> of limitations, the two biggest being that it’s slow to have all these

> transfers going through a third party, and that you have to share it
> across all slave-mode devices so you have this reserve/release system
> which makes things more expensive. Slave-Mode seems to be
> more-or-less obsolete, particuarly on PCI devices, but you may still
> see it on older ISA/PCMCIA devices.
>
> In bus-mastering DMA you place some logic on your device to allow it
> to run the DMA cycles itself. The device will steal some bus time and

> initiate a memory transfer as if it were another CPU. Data is
> transferred directly from main-memory into the device’s memory ranges
> (this is all a simplficiation). You can have multiple bus-masters
> running independently of each other, so this is more effiicent than
> having all your devices fight over a single transfer agent (whether
> it’s the CPU (PIO) or a separate DMA controller (slave-mode)).
>
> What does DMA look like on a device?
>
> From the driver’s point of view there are two aspects to DMA. The
> first is how you prepare your data for DMA transfers. The second is
> how you program the device to initiate the transfers & how you notice
> that a transfer is done.
>
> Let’s talk about the second part first. There are an infinite number
> of models for programming your device to start a DMA.
> Each introduces its
> own limitations. I’ll go over a few of the common ones i’ve seen:
>
> 1 - The device takes a single physical address base and
a length for
> an operation. This is very simple to program, but requres the
> transfer to be physically contiguous, which is unlikely for anything
> other than the smallest transfers (physical memory is often very
> fragmented, so the chance of two adjoining virtual pages using
> adjoining physical pages is pretty small). The device will usually
> interrupt when the DMA transfer is complete.
>
> 2 - The device takes a single physical address base & a
length for
> each fragment of an operation. It interrupts when it’s done
> transferring each fragment, allowing your driver to program in the
> next one. This is going to be slow because of the latency between
> each fragment, but is still easy to implement.
>
> 3 - The device takes a sequence of
> (physical-address,
> length) pairs which describe all the fragments of the transfer. This
> sequence is called a “scatter-gather list” (SG List). The device can
> then transfer each fragment on its own without the need to interrupt
> the CPU until all sections are done. In the simplest version of this,

> the driver programs the SG list to the controller through its
> registers/ports - writing each element into the device’s internal
> memory. The device will only have a limited space for the SG list, so

> you may only be able to handle 16 fragments in a given transfer.
>
> 4 - In the more complex version of 3, the SG list itself
is stored
> in DMA accessible system memory and the device is programmed with the
> physical address and length of the scatter-gather list itself.
> The device can then use DMA to transfer the SG list entries into its
> own internal buffers. This can reduce the limitations on the length
> of the SG list, but requires more complex logic in the DMA controller
> to handle it. However this would require the memory holding the SG
> list to be physically contiguous.
>
> Many of these options can be changed to use “common buffer” (a
> physically-contiguous, DMA ready buffer that’s shared between the
> device and the system). For some devices you cannot just give them an

> arbitrary physical address for a scatter gather list, but instead
> require that you give them an offset into a “common buffer” that’s
> previously been configured for the device.
> Sometimes you allocate a
> “command block” in the common buffer, put a pointer to the SG list in
> that, and then write the address of the command block to a register on

> the controller. For devices that can handle multiple requests at a
> time, this can reduce the amount of internal memory that the
> controller needs & reduce the limitiations on the number of requests
> that can be run at one time.
>
> Now that we have a vague idea of how you might program DMA on a
> device, how do you prepare the buffers? This might seem simple - you
> just get the physical addresses of the buffer and send them to the
> controller.
> But what if you have a 32-bit controller on a 36 or
> 64 bit system? Or
> if the PCI bridge between your device and system memory is broken and
> must be run in 32-bit mode even when you have a 64-bit controller. Or

> if the buffer is in a memory on a NUMA node that your device can’t see

> (this last one is hypothetical at the moment)? In these cases your
> buffer contains pages which your device cannot see, and you need to
> “bounce” them into auxillary buffers which the device can see.
>
> Now you may think you could take care of this bouncing yourself if you

> knew all the limitations on your device. But since we can’t enumerate

> all future restrictions today, WDM doesn’t tell you this. This makes
> some sense - everyone expects your driver from today to run for the
> next
> 5+ years, but we know the underlying system hardware
> will morph between
> now and then.
>
> One option is to bounce all of your transfers into common-buffer.
> This might even make sense if your transfer consists of many small
> pieces (like a network packet, which might have several headers added
> to it).
> Since common-buffer is always visible to your device this is always a
>
=== message truncated ===

__________________________________________________________
Find your next car at http://autos.yahoo.ca


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail makes sharing a breeze.

That’s too bad, this could be useful information - especially when
measuring and optimizing DMA performance. If you see slower than
exepcted performance you would be able to check to see if bounce buffers
were used and that could explain some of the difference in speed. If
bounce buffers weren’t used, then some other issue is causing the slow
down and you may need to investigate other parts of your code (interrupt
and DPC handling maybe) or ask your hardware folks to address it.

Beverly

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Peter Wieland
Sent: Monday, February 27, 2006 10:02 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] What is DMA?

See the last paragraph (or so). I’m not aware of anything that tells
you whether you’ll end up using bounce buffers.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Calvin Guan
Sent: Monday, February 27, 2006 6:48 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] What is DMA?

Peter,

This is the best description on Windows DMA I’ve ever read.

At driver initialization, is there a *reliable* way to tell whether nor
not bounce-buffer will be used, assuming the device can DMA from/to the
entire 64-bit address space?

Thanks,
Calvin

— Peter Wieland
wrote:

> What is DMA?
>
> DMA is a way for you to offload the work of transferring data between
> main memory and the device onto your device. This is in contrast to
> programmed I/O (PIO) where you have the processor copying data between

> main memory and the device.
>
> PIO results in high data-rates, but it wastes significant amounts of
> CPU time since you’re effectively running memcpy() for every transfer.

> It’s better to offload this mundane work to some other processor which

> can move the data from A to B and then interrupt (or signal in some
> other
> way) when the transfer is done. The CPU is then left free to do the
> things you paid for it to do - math, comparisons, branches, etc…
>
> There are two flavors of DMA - slave-mode and bus-mastering. In
> Slave-Mode DMA you would use a device built onto the motherboard
> (these days it’s somewhere in the chipset) to transfer from main
> memory to device memory. The slave-mode DMA controller has a number
> of limitations, the two biggest being that it’s slow to have all these

> transfers going through a third party, and that you have to share it
> across all slave-mode devices so you have this reserve/release system
> which makes things more expensive. Slave-Mode seems to be
> more-or-less obsolete, particuarly on PCI devices, but you may still
> see it on older ISA/PCMCIA devices.
>
> In bus-mastering DMA you place some logic on your device to allow it
> to run the DMA cycles itself. The device will steal some bus time and

> initiate a memory transfer as if it were another CPU. Data is
> transferred directly from main-memory into the device’s memory ranges
> (this is all a simplficiation). You can have multiple bus-masters
> running independently of each other, so this is more effiicent than
> having all your devices fight over a single transfer agent (whether
> it’s the CPU (PIO) or a separate DMA controller (slave-mode)).
>
> What does DMA look like on a device?
>
> From the driver’s point of view there are two aspects to DMA. The
> first is how you prepare your data for DMA transfers. The second is
> how you program the device to initiate the transfers & how you notice
> that a transfer is done.
>
> Let’s talk about the second part first. There are an infinite number
> of models for programming your device to start a DMA.
> Each introduces its
> own limitations. I’ll go over a few of the common ones i’ve seen:
>
> 1 - The device takes a single physical address base and
a length for
> an operation. This is very simple to program, but requres the
> transfer to be physically contiguous, which is unlikely for anything
> other than the smallest transfers (physical memory is often very
> fragmented, so the chance of two adjoining virtual pages using
> adjoining physical pages is pretty small). The device will usually
> interrupt when the DMA transfer is complete.
>
> 2 - The device takes a single physical address base & a
length for
> each fragment of an operation. It interrupts when it’s done
> transferring each fragment, allowing your driver to program in the
> next one. This is going to be slow because of the latency between
> each fragment, but is still easy to implement.
>
> 3 - The device takes a sequence of
> (physical-address,
> length) pairs which describe all the fragments of the transfer. This
> sequence is called a “scatter-gather list” (SG List). The device can
> then transfer each fragment on its own without the need to interrupt
> the CPU until all sections are done. In the simplest version of this,

> the driver programs the SG list to the controller through its
> registers/ports - writing each element into the device’s internal
> memory. The device will only have a limited space for the SG list, so

> you may only be able to handle 16 fragments in a given transfer.
>
> 4 - In the more complex version of 3, the SG list itself
is stored
> in DMA accessible system memory and the device is programmed with the
> physical address and length of the scatter-gather list itself.
> The device can then use DMA to transfer the SG list entries into its
> own internal buffers. This can reduce the limitations on the length
> of the SG list, but requires more complex logic in the DMA controller
> to handle it. However this would require the memory holding the SG
> list to be physically contiguous.
>
> Many of these options can be changed to use “common buffer” (a
> physically-contiguous, DMA ready buffer that’s shared between the
> device and the system). For some devices you cannot just give them an

> arbitrary physical address for a scatter gather list, but instead
> require that you give them an offset into a “common buffer” that’s
> previously been configured for the device.
> Sometimes you allocate a
> “command block” in the common buffer, put a pointer to the SG list in
> that, and then write the address of the command block to a register on

> the controller. For devices that can handle multiple requests at a
> time, this can reduce the amount of internal memory that the
> controller needs & reduce the limitiations on the number of requests
> that can be run at one time.
>
> Now that we have a vague idea of how you might program DMA on a
> device, how do you prepare the buffers? This might seem simple - you
> just get the physical addresses of the buffer and send them to the
> controller.
> But what if you have a 32-bit controller on a 36 or
> 64 bit system? Or
> if the PCI bridge between your device and system memory is broken and
> must be run in 32-bit mode even when you have a 64-bit controller. Or

> if the buffer is in a memory on a NUMA node that your device can’t see

> (this last one is hypothetical at the moment)? In these cases your
> buffer contains pages which your device cannot see, and you need to
> “bounce” them into auxillary buffers which the device can see.
>
> Now you may think you could take care of this bouncing yourself if you

> knew all the limitations on your device. But since we can’t enumerate

> all future restrictions today, WDM doesn’t tell you this. This makes
> some sense - everyone expects your driver from today to run for the
> next
> 5+ years, but we know the underlying system hardware
> will morph between
> now and then.
>
> One option is to bounce all of your transfers into common-buffer.
> This might even make sense if your transfer consists of many small
> pieces (like a network packet, which might have several headers added
> to it).
> Since common-buffer is always visible to your device this is always a
>
=== message truncated ===

__________________________________________________________
Find your next car at http://autos.yahoo.ca


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

The classic NT DMA model makes the assumption that the
driver knows the start and end of a DMA transaction so
that “bouncing buffer” can be employed by doing
FlushXxx, GetXxx, PutXxx etc. As long as you are doing
what DDK says, the OS would handle this for you and
you’ll be fine.

This is assumption is not correct for certain types of
devices.

There ought to be a way to tell if bouncing buffer
will be used. HAL+bus driver should have such
knowledge. Windows should not hide such fact since it
is important for such driver to refuse to load if that
is the case.

Calvin Guan (DDK MVP)

— “Brown, Beverly” wrote:

> That’s too bad, this could be useful information -
> especially when
> measuring and optimizing DMA performance. If you see
> slower than
> exepcted performance you would be able to check to
> see if bounce buffers
> were used and that could explain some of the
> difference in speed. If
> bounce buffers weren’t used, then some other issue
> is causing the slow
> down and you may need to investigate other parts of
> your code (interrupt
> and DPC handling maybe) or ask your hardware folks
> to address it.
>
> Beverly
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf
> Of Peter Wieland
> Sent: Monday, February 27, 2006 10:02 PM
> To: Windows System Software Devs Interest List
> Subject: RE: [ntdev] What is DMA?
>
> See the last paragraph (or so). I’m not aware of
> anything that tells
> you whether you’ll end up using bounce buffers.
>
> -p
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf
> Of Calvin Guan
> Sent: Monday, February 27, 2006 6:48 PM
> To: Windows System Software Devs Interest List
> Subject: RE: [ntdev] What is DMA?
>
> Peter,
>
> This is the best description on Windows DMA I’ve
> ever read.
>
> At driver initialization, is there a reliable way
> to tell whether nor
> not bounce-buffer will be used, assuming the device
> can DMA from/to the
> entire 64-bit address space?
>
> Thanks,
> Calvin
>
> — Peter Wieland
> wrote:
>
> > What is DMA?
> >
> > DMA is a way for you to offload the work of
> transferring data between
> > main memory and the device onto your device. This
> is in contrast to
> > programmed I/O (PIO) where you have the processor
> copying data between
>
> > main memory and the device.
> >
> > PIO results in high data-rates, but it wastes
> significant amounts of
> > CPU time since you’re effectively running memcpy()
> for every transfer.
>
> > It’s better to offload this mundane work to some
> other processor which
>
> > can move the data from A to B and then interrupt
> (or signal in some
> > other
> > way) when the transfer is done. The CPU is then
> left free to do the
> > things you paid for it to do - math, comparisons,
> branches, etc…
> >
> > There are two flavors of DMA - slave-mode and
> bus-mastering. In
> > Slave-Mode DMA you would use a device built onto
> the motherboard
> > (these days it’s somewhere in the chipset) to
> transfer from main
> > memory to device memory. The slave-mode DMA
> controller has a number
> > of limitations, the two biggest being that it’s
> slow to have all these
>
> > transfers going through a third party, and that
> you have to share it
> > across all slave-mode devices so you have this
> reserve/release system
> > which makes things more expensive. Slave-Mode
> seems to be
> > more-or-less obsolete, particuarly on PCI devices,
> but you may still
> > see it on older ISA/PCMCIA devices.
> >
> > In bus-mastering DMA you place some logic on your
> device to allow it
> > to run the DMA cycles itself. The device will
> steal some bus time and
>
> > initiate a memory transfer as if it were another
> CPU. Data is
> > transferred directly from main-memory into the
> device’s memory ranges
> > (this is all a simplficiation). You can have
> multiple bus-masters
> > running independently of each other, so this is
> more effiicent than
> > having all your devices fight over a single
> transfer agent (whether
> > it’s the CPU (PIO) or a separate DMA controller
> (slave-mode)).
> >
> > What does DMA look like on a device?
> >
> > From the driver’s point of view there are two
> aspects to DMA. The
> > first is how you prepare your data for DMA
> transfers. The second is
> > how you program the device to initiate the
> transfers & how you notice
> > that a transfer is done.
> >
> > Let’s talk about the second part first. There are
> an infinite number
> > of models for programming your device to start a
> DMA.
> > Each introduces its
> > own limitations. I’ll go over a few of the common
> ones i’ve seen:
> >
> > 1 - The device takes a single physical address
> base and
> a length for
> > an operation. This is very simple to program, but
> requres the
> > transfer to be physically contiguous, which is
> unlikely for anything
> > other than the smallest transfers (physical memory
> is often very
> > fragmented, so the chance of two adjoining virtual
> pages using
> > adjoining physical pages is pretty small). The
> device will usually
> > interrupt when the DMA transfer is complete.
> >
> > 2 - The device takes a single physical address
> base & a
> length for
> > each fragment of an operation. It interrupts when
> it’s done
> > transferring each fragment, allowing your driver
> to program in the
> > next one. This is going to be slow because of the
> latency between
> > each fragment, but is still easy to implement.
> >
> > 3 - The device takes a sequence of
> > (physical-address,
> > length) pairs which describe all the fragments of
> the transfer. This
> > sequence is called a “scatter-gather list” (SG
> List). The device can
> > then transfer each fragment on its own without the
> need to interrupt
> > the CPU until all sections are done. In the
> simplest version of this,
>
> > the driver programs the SG list to the controller
> through its
> > registers/ports - writing each element into the
> device’s internal
> > memory. The device will only have a limited space
> for the SG list, so
>
> > you may only be able to handle 16 fragments in a
> given transfer.
> >
> > 4 - In the more complex version of 3, the SG
> list itself
> is stored
> > in DMA accessible system memory and the device is
> programmed with the
> > physical address and length of the scatter-gather
> list itself.
> > The device can then use DMA to transfer the SG
> list entries into its
> > own internal buffers. This can reduce the
> limitations on the length
> > of the SG list, but requires more complex logic in
> the DMA controller
> > to handle it. However this would require the
> memory
=== message truncated ===

__________________________________________________________
Find your next car at http://autos.yahoo.ca

Agreed. You have to profile and see if your driver is spending a lot of time in copying memory underneath GetScatterGatherList instead.

This is likely one of those cases of the bad eggs spoiling it for everyone. We’ve spent a lot of time in the past supporting people who thought they knew how DMA worked on X86 and just grabbed physical addresss directly. I suspect that makes us a little gun-shy when it comes to exposing more of the internals.

I’ll check with the DMA guy - there might be some perf counters that i’m not aware of that you could use to get a handle on this.

-p


From: xxxxx@lists.osr.com on behalf of Brown, Beverly
Sent: Tue 2/28/2006 6:31 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] What is DMA?

That’s too bad, this could be useful information - especially when
measuring and optimizing DMA performance. If you see slower than
exepcted performance you would be able to check to see if bounce buffers
were used and that could explain some of the difference in speed. If
bounce buffers weren’t used, then some other issue is causing the slow
down and you may need to investigate other parts of your code (interrupt
and DPC handling maybe) or ask your hardware folks to address it.

Beverly

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Peter Wieland
Sent: Monday, February 27, 2006 10:02 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] What is DMA?

See the last paragraph (or so). I’m not aware of anything that tells
you whether you’ll end up using bounce buffers.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Calvin Guan
Sent: Monday, February 27, 2006 6:48 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] What is DMA?

Peter,

This is the best description on Windows DMA I’ve ever read.

At driver initialization, is there a *reliable* way to tell whether nor
not bounce-buffer will be used, assuming the device can DMA from/to the
entire 64-bit address space?

Thanks,
Calvin

— Peter Wieland
wrote:

> What is DMA?
>
> DMA is a way for you to offload the work of transferring data between
> main memory and the device onto your device. This is in contrast to
> programmed I/O (PIO) where you have the processor copying data between

> main memory and the device.
>
> PIO results in high data-rates, but it wastes significant amounts of
> CPU time since you’re effectively running memcpy() for every transfer.

> It’s better to offload this mundane work to some other processor which

> can move the data from A to B and then interrupt (or signal in some
> other
> way) when the transfer is done. The CPU is then left free to do the
> things you paid for it to do - math, comparisons, branches, etc…
>
> There are two flavors of DMA - slave-mode and bus-mastering. In
> Slave-Mode DMA you would use a device built onto the motherboard
> (these days it’s somewhere in the chipset) to transfer from main
> memory to device memory. The slave-mode DMA controller has a number
> of limitations, the two biggest being that it’s slow to have all these

> transfers going through a third party, and that you have to share it
> across all slave-mode devices so you have this reserve/release system
> which makes things more expensive. Slave-Mode seems to be
> more-or-less obsolete, particuarly on PCI devices, but you may still
> see it on older ISA/PCMCIA devices.
>
> In bus-mastering DMA you place some logic on your device to allow it
> to run the DMA cycles itself. The device will steal some bus time and

> initiate a memory transfer as if it were another CPU. Data is
> transferred directly from main-memory into the device’s memory ranges
> (this is all a simplficiation). You can have multiple bus-masters
> running independently of each other, so this is more effiicent than
> having all your devices fight over a single transfer agent (whether
> it’s the CPU (PIO) or a separate DMA controller (slave-mode)).
>
> What does DMA look like on a device?
>
> From the driver’s point of view there are two aspects to DMA. The
> first is how you prepare your data for DMA transfers. The second is
> how you program the device to initiate the transfers & how you notice
> that a transfer is done.
>
> Let’s talk about the second part first. There are an infinite number
> of models for programming your device to start a DMA.
> Each introduces its
> own limitations. I’ll go over a few of the common ones i’ve seen:
>
> 1 - The device takes a single physical address base and
a length for
> an operation. This is very simple to program, but requres the
> transfer to be physically contiguous, which is unlikely for anything
> other than the smallest transfers (physical memory is often very
> fragmented, so the chance of two adjoining virtual pages using
> adjoining physical pages is pretty small). The device will usually
> interrupt when the DMA transfer is complete.
>
> 2 - The device takes a single physical address base & a
length for
> each fragment of an operation. It interrupts when it’s done
> transferring each fragment, allowing your driver to program in the
> next one. This is going to be slow because of the latency between
> each fragment, but is still easy to implement.
>
> 3 - The device takes a sequence of
> (physical-address,
> length) pairs which describe all the fragments of the transfer. This
> sequence is called a “scatter-gather list” (SG List). The device can
> then transfer each fragment on its own without the need to interrupt
> the CPU until all sections are done. In the simplest version of this,

> the driver programs the SG list to the controller through its
> registers/ports - writing each element into the device’s internal
> memory. The device will only have a limited space for the SG list, so

> you may only be able to handle 16 fragments in a given transfer.
>
> 4 - In the more complex version of 3, the SG list itself
is stored
> in DMA accessible system memory and the device is programmed with the
> physical address and length of the scatter-gather list itself.
> The device can then use DMA to transfer the SG list entries into its
> own internal buffers. This can reduce the limitations on the length
> of the SG list, but requires more complex logic in the DMA controller
> to handle it. However this would require the memory holding the SG
> list to be physically contiguous.
>
> Many of these options can be changed to use “common buffer” (a
> physically-contiguous, DMA ready buffer that’s shared between the
> device and the system). For some devices you cannot just give them an

> arbitrary physical address for a scatter gather list, but instead
> require that you give them an offset into a “common buffer” that’s
> previously been configured for the device.
> Sometimes you allocate a
> “command block” in the common buffer, put a pointer to the SG list in
> that, and then write the address of the command block to a register on

> the controller. For devices that can handle multiple requests at a
> time, this can reduce the amount of internal memory that the
> controller needs & reduce the limitiations on the number of requests
> that can be run at one time.
>
> Now that we have a vague idea of how you might program DMA on a
> device, how do you prepare the buffers? This might seem simple - you
> just get the physical addresses of the buffer and send them to the
> controller.
> But what if you have a 32-bit controller on a 36 or
> 64 bit system? Or
> if the PCI bridge between your device and system memory is broken and
> must be run in 32-bit mode even when you have a 64-bit controller. Or

> if the buffer is in a memory on a NUMA node that your device can’t see

> (this last one is hypothetical at the moment)? In these cases your
> buffer contains pages which your device cannot see, and you need to
> “bounce” them into auxillary buffers which the device can see.
>
> Now you may think you could take care of this bouncing yourself if you

> knew all the limitations on your device. But since we can’t enumerate

> all future restrictions today, WDM doesn’t tell you this. This makes
> some sense - everyone expects your driver from today to run for the
> next
> 5+ years, but we know the underlying system hardware
> will morph between
> now and then.
>
> One option is to bounce all of your transfers into common-buffer.
> This might even make sense if your transfer consists of many small
> pieces (like a network packet, which might have several headers added
> to it).
> Since common-buffer is always visible to your device this is always a
>
=== message truncated ===

__________________________________________________________
Find your next car at http://autos.yahoo.ca


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer


Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

I forget to add that CommonBuffer is not an option for
such device. The buffer has to be allocated by user
mode application.

Calvin Guan (DDK MVP)

— Calvin Guan wrote:

> The classic NT DMA model makes the assumption that
> the
> driver knows the start and end of a DMA transaction
> so
> that “bouncing buffer” can be employed by doing
> FlushXxx, GetXxx, PutXxx etc. As long as you are
> doing
> what DDK says, the OS would handle this for you and
> you’ll be fine.
>
> This is assumption is not correct for certain types
> of
> devices.
>
> There ought to be a way to tell if bouncing buffer
> will be used. HAL+bus driver should have such
> knowledge. Windows should not hide such fact since
> it
> is important for such driver to refuse to load if
> that
> is the case.
>
> Calvin Guan (DDK MVP)
>
> — “Brown, Beverly” wrote:
>
> > That’s too bad, this could be useful information -
> > especially when
> > measuring and optimizing DMA performance. If you
> see
> > slower than
> > exepcted performance you would be able to check to
> > see if bounce buffers
> > were used and that could explain some of the
> > difference in speed. If
> > bounce buffers weren’t used, then some other issue
> > is causing the slow
> > down and you may need to investigate other parts
> of
> > your code (interrupt
> > and DPC handling maybe) or ask your hardware folks
> > to address it.
> >
> > Beverly
> >
> > -----Original Message-----
> > From: xxxxx@lists.osr.com
> > [mailto:xxxxx@lists.osr.com] On
> Behalf
> > Of Peter Wieland
> > Sent: Monday, February 27, 2006 10:02 PM
> > To: Windows System Software Devs Interest List
> > Subject: RE: [ntdev] What is DMA?
> >
> > See the last paragraph (or so). I’m not aware of
> > anything that tells
> > you whether you’ll end up using bounce buffers.
> >
> > -p
> >
> > -----Original Message-----
> > From: xxxxx@lists.osr.com
> > [mailto:xxxxx@lists.osr.com] On
> Behalf
> > Of Calvin Guan
> > Sent: Monday, February 27, 2006 6:48 PM
> > To: Windows System Software Devs Interest List
> > Subject: RE: [ntdev] What is DMA?
> >
> > Peter,
> >
> > This is the best description on Windows DMA I’ve
> > ever read.
> >
> > At driver initialization, is there a reliable
> way
> > to tell whether nor
> > not bounce-buffer will be used, assuming the
> device
> > can DMA from/to the
> > entire 64-bit address space?
> >
> > Thanks,
> > Calvin
> >
> > — Peter Wieland
> > wrote:
> >
> > > What is DMA?
> > >
> > > DMA is a way for you to offload the work of
> > transferring data between
> > > main memory and the device onto your device.
> This
> > is in contrast to
> > > programmed I/O (PIO) where you have the
> processor
> > copying data between
> >
> > > main memory and the device.
> > >
> > > PIO results in high data-rates, but it wastes
> > significant amounts of
> > > CPU time since you’re effectively running
> memcpy()
> > for every transfer.
> >
> > > It’s better to offload this mundane work to some
> > other processor which
> >
> > > can move the data from A to B and then interrupt
> > (or signal in some
> > > other
> > > way) when the transfer is done. The CPU is then
> > left free to do the
> > > things you paid for it to do - math,
> comparisons,
> > branches, etc…
> > >
> > > There are two flavors of DMA - slave-mode and
> > bus-mastering. In
> > > Slave-Mode DMA you would use a device built onto
> > the motherboard
> > > (these days it’s somewhere in the chipset) to
> > transfer from main
> > > memory to device memory. The slave-mode DMA
> > controller has a number
> > > of limitations, the two biggest being that it’s
> > slow to have all these
> >
> > > transfers going through a third party, and that
> > you have to share it
> > > across all slave-mode devices so you have this
> > reserve/release system
> > > which makes things more expensive. Slave-Mode
> > seems to be
> > > more-or-less obsolete, particuarly on PCI
> devices,
> > but you may still
> > > see it on older ISA/PCMCIA devices.
> > >
> > > In bus-mastering DMA you place some logic on
> your
> > device to allow it
> > > to run the DMA cycles itself. The device will
> > steal some bus time and
> >
> > > initiate a memory transfer as if it were another
> > CPU. Data is
> > > transferred directly from main-memory into the
> > device’s memory ranges
> > > (this is all a simplficiation). You can have
> > multiple bus-masters
> > > running independently of each other, so this is
> > more effiicent than
> > > having all your devices fight over a single
> > transfer agent (whether
> > > it’s the CPU (PIO) or a separate DMA controller
> > (slave-mode)).
> > >
> > > What does DMA look like on a device?
> > >
> > > From the driver’s point of view there are two
> > aspects to DMA. The
> > > first is how you prepare your data for DMA
> > transfers. The second is
> > > how you program the device to initiate the
> > transfers & how you notice
> > > that a transfer is done.
> > >
> > > Let’s talk about the second part first. There
> are
> > an infinite number
> > > of models for programming your device to start a
> > DMA.
> > > Each introduces its
> > > own limitations. I’ll go over a few of the
> common
> > ones i’ve seen:
> > >
> > > 1 - The device takes a single physical address
> > base and
> > a length for
> > > an operation. This is very simple to program,
> but
> > requres the
> > > transfer to be physically contiguous, which is
> > unlikely for anything
> > > other than the smallest transfers (physical
> memory
> > is often very
> > > fragmented, so the chance of two adjoining
> virtual
> > pages using
> > > adjoining physical pages is pretty small). The
> > device will usually
> > > interrupt when the DMA transfer is complete.
> > >
> > > 2 - The device takes a single physical address
> > base & a
> > length for
>
=== message truncated ===

__________________________________________________________
Find your next car at http://autos.yahoo.ca

Which particular counter-example are you thinking of?

-p


From: xxxxx@lists.osr.com on behalf of Calvin Guan
Sent: Tue 2/28/2006 10:20 AM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] What is DMA?

The classic NT DMA model makes the assumption that the
driver knows the start and end of a DMA transaction so
that “bouncing buffer” can be employed by doing
FlushXxx, GetXxx, PutXxx etc. As long as you are doing
what DDK says, the OS would handle this for you and
you’ll be fine.

This is assumption is not correct for certain types of
devices.

There ought to be a way to tell if bouncing buffer
will be used. HAL+bus driver should have such
knowledge. Windows should not hide such fact since it
is important for such driver to refuse to load if that
is the case.

Calvin Guan (DDK MVP)

— “Brown, Beverly” wrote:

> That’s too bad, this could be useful information -
> especially when
> measuring and optimizing DMA performance. If you see
> slower than
> exepcted performance you would be able to check to
> see if bounce buffers
> were used and that could explain some of the
> difference in speed. If
> bounce buffers weren’t used, then some other issue
> is causing the slow
> down and you may need to investigate other parts of
> your code (interrupt
> and DPC handling maybe) or ask your hardware folks
> to address it.
>
> Beverly
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf
> Of Peter Wieland
> Sent: Monday, February 27, 2006 10:02 PM
> To: Windows System Software Devs Interest List
> Subject: RE: [ntdev] What is DMA?
>
> See the last paragraph (or so). I’m not aware of
> anything that tells
> you whether you’ll end up using bounce buffers.
>
> -p
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf
> Of Calvin Guan
> Sent: Monday, February 27, 2006 6:48 PM
> To: Windows System Software Devs Interest List
> Subject: RE: [ntdev] What is DMA?
>
> Peter,
>
> This is the best description on Windows DMA I’ve
> ever read.
>
> At driver initialization, is there a reliable way
> to tell whether nor
> not bounce-buffer will be used, assuming the device
> can DMA from/to the
> entire 64-bit address space?
>
> Thanks,
> Calvin
>
> — Peter Wieland
> wrote:
>
> > What is DMA?
> >
> > DMA is a way for you to offload the work of
> transferring data between
> > main memory and the device onto your device. This
> is in contrast to
> > programmed I/O (PIO) where you have the processor
> copying data between
>
> > main memory and the device.
> >
> > PIO results in high data-rates, but it wastes
> significant amounts of
> > CPU time since you’re effectively running memcpy()
> for every transfer.
>
> > It’s better to offload this mundane work to some
> other processor which
>
> > can move the data from A to B and then interrupt
> (or signal in some
> > other
> > way) when the transfer is done. The CPU is then
> left free to do the
> > things you paid for it to do - math, comparisons,
> branches, etc…
> >
> > There are two flavors of DMA - slave-mode and
> bus-mastering. In
> > Slave-Mode DMA you would use a device built onto
> the motherboard
> > (these days it’s somewhere in the chipset) to
> transfer from main
> > memory to device memory. The slave-mode DMA
> controller has a number
> > of limitations, the two biggest being that it’s
> slow to have all these
>
> > transfers going through a third party, and that
> you have to share it
> > across all slave-mode devices so you have this
> reserve/release system
> > which makes things more expensive. Slave-Mode
> seems to be
> > more-or-less obsolete, particuarly on PCI devices,
> but you may still
> > see it on older ISA/PCMCIA devices.
> >
> > In bus-mastering DMA you place some logic on your
> device to allow it
> > to run the DMA cycles itself. The device will
> steal some bus time and
>
> > initiate a memory transfer as if it were another
> CPU. Data is
> > transferred directly from main-memory into the
> device’s memory ranges
> > (this is all a simplficiation). You can have
> multiple bus-masters
> > running independently of each other, so this is
> more effiicent than
> > having all your devices fight over a single
> transfer agent (whether
> > it’s the CPU (PIO) or a separate DMA controller
> (slave-mode)).
> >
> > What does DMA look like on a device?
> >
> > From the driver’s point of view there are two
> aspects to DMA. The
> > first is how you prepare your data for DMA
> transfers. The second is
> > how you program the device to initiate the
> transfers & how you notice
> > that a transfer is done.
> >
> > Let’s talk about the second part first. There are
> an infinite number
> > of models for programming your device to start a
> DMA.
> > Each introduces its
> > own limitations. I’ll go over a few of the common
> ones i’ve seen:
> >
> > 1 - The device takes a single physical address
> base and
> a length for
> > an operation. This is very simple to program, but
> requres the
> > transfer to be physically contiguous, which is
> unlikely for anything
> > other than the smallest transfers (physical memory
> is often very
> > fragmented, so the chance of two adjoining virtual
> pages using
> > adjoining physical pages is pretty small). The
> device will usually
> > interrupt when the DMA transfer is complete.
> >
> > 2 - The device takes a single physical address
> base & a
> length for
> > each fragment of an operation. It interrupts when
> it’s done
> > transferring each fragment, allowing your driver
> to program in the
> > next one. This is going to be slow because of the
> latency between
> > each fragment, but is still easy to implement.
> >
> > 3 - The device takes a sequence of
> > (physical-address,
> > length) pairs which describe all the fragments of
> the transfer. This
> > sequence is called a “scatter-gather list” (SG
> List). The device can
> > then transfer each fragment on its own without the
> need to interrupt
> > the CPU until all sections are done. In the
> simplest version of this,
>
> > the driver programs the SG list to the controller
> through its
> > registers/ports - writing each element into the
> device’s internal
> > memory. The device will only have a limited space
> for the SG list, so
>
> > you may only be able to handle 16 fragments in a
> given transfer.
> >
> > 4 - In the more complex version of 3, the SG
> list itself
> is stored
> > in DMA accessible system memory and the device is
> programmed with the
> > physical address and length of the scatter-gather
> list itself.
> > The device can then use DMA to transfer the SG
> list entries into its
> > own internal buffers. This can reduce the
> limitations on the length
> > of the SG list, but requires more complex logic in
> the DMA controller
> > to handle it. However this would require the
> memory
=== message truncated ===

__________________________________________________________
Find your next car at http://autos.yahoo.ca


Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Graphics Drivers and the AGP bridge. The d3d/ddraw
rendering application could start a DMA w/o the
graphics driver’s knowledge. Do a dumpbin on some of
the most popular graphics drivers and see how many of
them are using MmAllocateContigousXxx and/or
MmGetPhysicalAddress.

RDMA (Remote DMA) and “kernel bypass” for the next
generation clustering connection fabric devices.
The existing IB (InfiniBand)device drivers already
have such problem if they support kernel bypass in
their driver.

I can discuss with you on the RDMA topic offline in
more detail if you are interested because it’s getting
into the area of my current employer’s bussiness.

Thanks,
Calvin Guan (DDK MVP)

— Peter Wieland
wrote:

> Which particular counter-example are you thinking
> of?
>
> -p
>
>
>
> From: xxxxx@lists.osr.com on behalf of
> Calvin Guan
> Sent: Tue 2/28/2006 10:20 AM
> To: Windows System Software Devs Interest List
> Subject: RE: [ntdev] What is DMA?
>
>
>
> The classic NT DMA model makes the assumption that
> the
> driver knows the start and end of a DMA transaction
> so
> that “bouncing buffer” can be employed by doing
> FlushXxx, GetXxx, PutXxx etc. As long as you are
> doing
> what DDK says, the OS would handle this for you and
> you’ll be fine.
>
> This is assumption is not correct for certain types
> of
> devices.
>
> There ought to be a way to tell if bouncing buffer
> will be used. HAL+bus driver should have such
> knowledge. Windows should not hide such fact since
> it
> is important for such driver to refuse to load if
> that
> is the case.
>
> Calvin Guan (DDK MVP)
>
> — “Brown, Beverly” wrote:
>
> > That’s too bad, this could be useful information -
> > especially when
> > measuring and optimizing DMA performance. If you
> see
> > slower than
> > exepcted performance you would be able to check to
> > see if bounce buffers
> > were used and that could explain some of the
> > difference in speed. If
> > bounce buffers weren’t used, then some other issue
> > is causing the slow
> > down and you may need to investigate other parts
> of
> > your code (interrupt
> > and DPC handling maybe) or ask your hardware folks
> > to address it.
> >
> > Beverly
> >
> > -----Original Message-----
> > From: xxxxx@lists.osr.com
> > [mailto:xxxxx@lists.osr.com] On
> Behalf
> > Of Peter Wieland
> > Sent: Monday, February 27, 2006 10:02 PM
> > To: Windows System Software Devs Interest List
> > Subject: RE: [ntdev] What is DMA?
> >
> > See the last paragraph (or so). I’m not aware of
> > anything that tells
> > you whether you’ll end up using bounce buffers.
> >
> > -p
> >
> > -----Original Message-----
> > From: xxxxx@lists.osr.com
> > [mailto:xxxxx@lists.osr.com] On
> Behalf
> > Of Calvin Guan
> > Sent: Monday, February 27, 2006 6:48 PM
> > To: Windows System Software Devs Interest List
> > Subject: RE: [ntdev] What is DMA?
> >
> > Peter,
> >
> > This is the best description on Windows DMA I’ve
> > ever read.
> >
> > At driver initialization, is there a reliable
> way
> > to tell whether nor
> > not bounce-buffer will be used, assuming the
> device
> > can DMA from/to the
> > entire 64-bit address space?
> >
> > Thanks,
> > Calvin
> >
> > — Peter Wieland
> > wrote:
> >
> > > What is DMA?
> > >
> > > DMA is a way for you to offload the work of
> > transferring data between
> > > main memory and the device onto your device.
> This
> > is in contrast to
> > > programmed I/O (PIO) where you have the
> processor
> > copying data between
> >
> > > main memory and the device.
> > >
> > > PIO results in high data-rates, but it wastes
> > significant amounts of
> > > CPU time since you’re effectively running
> memcpy()
> > for every transfer.
> >
> > > It’s better to offload this mundane work to some
> > other processor which
> >
> > > can move the data from A to B and then interrupt
> > (or signal in some
> > > other
> > > way) when the transfer is done. The CPU is then
> > left free to do the
> > > things you paid for it to do - math,
> comparisons,
> > branches, etc…
> > >
> > > There are two flavors of DMA - slave-mode and
> > bus-mastering. In
> > > Slave-Mode DMA you would use a device built onto
> > the motherboard
> > > (these days it’s somewhere in the chipset) to
> > transfer from main
> > > memory to device memory. The slave-mode DMA
> > controller has a number
> > > of limitations, the two biggest being that it’s
> > slow to have all these
> >
> > > transfers going through a third party, and that
> > you have to share it
> > > across all slave-mode devices so you have this
> > reserve/release system
> > > which makes things more expensive. Slave-Mode
> > seems to be
> > > more-or-less obsolete, particuarly on PCI
> devices,
> > but you may still
> > > see it on older ISA/PCMCIA devices.
> > >
> > > In bus-mastering DMA you place some logic on
> your
> > device to allow it
> > > to run the DMA cycles itself. The device will
> > steal some bus time and
> >
> > > initiate a memory transfer as if it were another
> > CPU. Data is
> > > transferred directly from main-memory into the
> > device’s memory ranges
> > > (this is all a simplficiation). You can have
> > multiple bus-masters
> > > running independently of each other, so this is
> > more effiicent than
> > > having all your devices fight over a single
> > transfer agent (whether
> > > it’s the CPU (PIO) or a separate DMA controller
> > (slave-mode)).
> > >
> > > What does DMA look like on a device?
> > >
> > > From the driver’s point of view there are two
> > aspects to DMA. The
> > > first is how you prepare your data for DMA
> > transfers. The second is
> > > how you program the device to initiate the
> > transfers & how you notice
> > > that a transfer is done.
> > >
> > > Let’s talk about the second part first. There
> are
> > an infinite number
> > > of models for programming your device to start a
> > DMA.
> > > Each introduces its
> > > own limitations. I’ll go over a few of the
> common
> > ones i’ve seen:
> > >
> > > 1 - The device takes a single
> physical address
> > base and
> > a length for
> > > an operation. This is very simple to program,
> but
>
=== message truncated ===

__________________________
Find your next car at http://autos.yahoo.ca

Regarding bounce buffer use, I actually looked into this a few months ago for a
PCI card we are working on. I never found a way to determine if Windows was
going to use bounce buffers in the software. But, I think you can figure this
out a priori, if you know your hardware.

In my case, I found that if the PCI card could access all of memory space, and
it supported HW Scatter/Gather, then there would never be a need for “bounce
buffers”, and Windows wouldn’t use them.

However, our PCI card doesn’t have Scatter/Gather support, so we get a bounce
buffer from Windows (16k last I checked). Practically, this was transparent to
our driver (because we followed the DMA model), and despite using bounce buffers
it’s still blazing fast (when the driver isn’t doing DbgPrint()s in the ISR. :slight_smile:

There is a really good paper at:
http://www.microsoft.com/whdc/driver/kernel/default.mspx
describing DMA in wonderful / painful detail, which I’ve been using as one of my
references. But don’t look for bounce buffers, look for “Map Registers”.

-Mike

Peter Wieland wrote:

See the last paragraph (or so). I’m not aware of anything that tells
you whether you’ll end up using bounce buffers.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Calvin Guan
Sent: Monday, February 27, 2006 6:48 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] What is DMA?

Peter,

This is the best description on Windows DMA I’ve ever read.

At driver initialization, is there a *reliable* way to tell whether nor
not bounce-buffer will be used, assuming the device can DMA from/to the
entire 64-bit address space?

Thanks,
Calvin

— Peter Wieland
> wrote:
>
>> What is DMA?
>>
>> DMA is a way for you to offload the work of transferring data between
>> main memory and the device onto your device. This is in contrast to
>> programmed I/O (PIO) where you have the processor copying data between
>
>> main memory and the device.
>>
>> PIO results in high data-rates, but it wastes significant amounts of
>> CPU time since you’re effectively running memcpy() for every transfer.
>
>> It’s better to offload this mundane work to some other processor which
>
>> can move the data from A to B and then interrupt (or signal in some
>> other
>> way) when the transfer is done. The CPU is then left free to do the
>> things you paid for it to do - math, comparisons, branches, etc…
>>
>> There are two flavors of DMA - slave-mode and bus-mastering. In
>> Slave-Mode DMA you would use a device built onto the motherboard
>> (these days it’s somewhere in the chipset) to transfer from main
>> memory to device memory. The slave-mode DMA controller has a number
>> of limitations, the two biggest being that it’s slow to have all these
>
>> transfers going through a third party, and that you have to share it
>> across all slave-mode devices so you have this reserve/release system
>> which makes things more expensive. Slave-Mode seems to be
>> more-or-less obsolete, particuarly on PCI devices, but you may still
>> see it on older ISA/PCMCIA devices.
>>
>> In bus-mastering DMA you place some logic on your device to allow it
>> to run the DMA cycles itself. The device will steal some bus time and
>
>> initiate a memory transfer as if it were another CPU. Data is
>> transferred directly from main-memory into the device’s memory ranges
>> (this is all a simplficiation). You can have multiple bus-masters
>> running independently of each other, so this is more effiicent than
>> having all your devices fight over a single transfer agent (whether
>> it’s the CPU (PIO) or a separate DMA controller (slave-mode)).
>>
>> What does DMA look like on a device?
>>
>> From the driver’s point of view there are two aspects to DMA. The
>> first is how you prepare your data for DMA transfers. The second is
>> how you program the device to initiate the transfers & how you notice
>> that a transfer is done.
>>
>> Let’s talk about the second part first. There are an infinite number
>> of models for programming your device to start a DMA.
>> Each introduces its
>> own limitations. I’ll go over a few of the common ones i’ve seen:
>>
>> 1 - The device takes a single physical address base and
> a length for
>> an operation. This is very simple to program, but requres the
>> transfer to be physically contiguous, which is unlikely for anything
>> other than the smallest transfers (physical memory is often very
>> fragmented, so the chance of two adjoining virtual pages using
>> adjoining physical pages is pretty small). The device will usually
>> interrupt when the DMA transfer is complete.
>>
>> 2 - The device takes a single physical address base & a
> length for
>> each fragment of an operation. It interrupts when it’s done
>> transferring each fragment, allowing your driver to program in the
>> next one. This is going to be slow because of the latency between
>> each fragment, but is still easy to implement.
>>
>> 3 - The device takes a sequence of
>> (physical-address,
>> length) pairs which describe all the fragments of the transfer. This
>> sequence is called a “scatter-gather list” (SG List). The device can
>> then transfer each fragment on its own without the need to interrupt
>> the CPU until all sections are done. In the simplest version of this,
>
>> the driver programs the SG list to the controller through its
>> registers/ports - writing each element into the device’s internal
>> memory. The device will only have a limited space for the SG list, so
>
>> you may only be able to handle 16 fragments in a given transfer.
>>
>> 4 - In the more complex version of 3, the SG list itself
> is stored
>> in DMA accessible system memory and the device is programmed with the
>> physical address and length of the scatter-gather list itself.
>> The device can then use DMA to transfer the SG list entries into its
>> own internal buffers. This can reduce the limitations on the length
>> of the SG list, but requires more complex logic in the DMA controller
>> to handle it. However this would require the memory holding the SG
>> list to be physically contiguous.
>>
>> Many of these options can be changed to use “common buffer” (a
>> physically-contiguous, DMA ready buffer that’s shared between the
>> device and the system). For some devices you cannot just give them an
>
>> arbitrary physical address for a scatter gather list, but instead
>> require that you give them an offset into a “common buffer” that’s
>> previously been configured for the device.
>> Sometimes you allocate a
>> “command block” in the common buffer, put a pointer to the SG list in
>> that, and then write the address of the command block to a register on
>
>> the controller. For devices that can handle multiple requests at a
>> time, this can reduce the amount of internal memory that the
>> controller needs & reduce the limitiations on the number of requests
>> that can be run at one time.
>>
>> Now that we have a vague idea of how you might program DMA on a
>> device, how do you prepare the buffers? This might seem simple - you
>> just get the physical addresses of the buffer and send them to the
>> controller.
>> But what if you have a 32-bit controller on a 36 or
>> 64 bit system? Or
>> if the PCI bridge between your device and system memory is broken and
>> must be run in 32-bit mode even when you have a 64-bit controller. Or
>
>> if the buffer is in a memory on a NUMA node that your device can’t see
>
>> (this last one is hypothetical at the moment)? In these cases your
>> buffer contains pages which your device cannot see, and you need to
>> “bounce” them into auxillary buffers which the device can see.
>>
>> Now you may think you could take care of this bouncing yourself if you
>
>> knew all the limitations on your device. But since we can’t enumerate
>
>> all future restrictions today, WDM doesn’t tell you this. This makes
>> some sense - everyone expects your driver from today to run for the
>> next
>> 5+ years, but we know the underlying system hardware
>> will morph between
>> now and then.
>>
>> One option is to bounce all of your transfers into common-buffer.
>> This might even make sense if your transfer consists of many small
>> pieces (like a network packet, which might have several headers added
>> to it).
>> Since common-buffer is always visible to your device this is always a
>>
> === message truncated ===
>
>
>
>
>
>
>
> __________________________________________________________
> Find your next car at http://autos.yahoo.ca
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
>
>

Yeah - I realize I use the terminology a little differently from the
WDK. I actually use both terms.

To me a Map Register are the logical resources that I allocate up front
when I configure DMA. These could represent any physical entity - a
single page bounce buffer, a free slot in a DMA MMU (which is where Map
Registers originally came from), or something else entirely.

The bounce buffers are the physical pages that are allocated to
double-buffer portions of a transfer. Each Map Register corresponds to
one (or more potentially) bounce buffers.

Currently the only use for Map Registers is to abstract out the bounce
buffers. In the future they could control hardware DMA permission maps,
hardware DMA MMU’s (again) or some combination of all three.

When I think about it in abstract I think about Map Registers. When I
want to really think about how my data is flowing around, I think of
bounce buffers.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Michael Becker
Sent: Tuesday, February 28, 2006 1:05 PM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] What is DMA?

Regarding bounce buffer use, I actually looked into this a few months
ago for a PCI card we are working on. I never found a way to determine
if Windows was going to use bounce buffers in the software. But, I
think you can figure this out a priori, if you know your hardware.

In my case, I found that if the PCI card could access all of memory
space, and it supported HW Scatter/Gather, then there would never be a
need for “bounce buffers”, and Windows wouldn’t use them.

However, our PCI card doesn’t have Scatter/Gather support, so we get a
bounce buffer from Windows (16k last I checked). Practically, this was
transparent to our driver (because we followed the DMA model), and
despite using bounce buffers it’s still blazing fast (when the driver
isn’t doing DbgPrint()s in the ISR. :slight_smile:

There is a really good paper at:
http://www.microsoft.com/whdc/driver/kernel/default.mspx
describing DMA in wonderful / painful detail, which I’ve been using as
one of my references. But don’t look for bounce buffers, look for “Map
Registers”.

-Mike

Peter Wieland wrote:

See the last paragraph (or so). I’m not aware of anything that tells
you whether you’ll end up using bounce buffers.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Calvin Guan
Sent: Monday, February 27, 2006 6:48 PM
To: Windows System Software Devs Interest List
Subject: RE: [ntdev] What is DMA?

Peter,

This is the best description on Windows DMA I’ve ever read.

At driver initialization, is there a *reliable* way to tell whether
nor not bounce-buffer will be used, assuming the device can DMA
from/to the entire 64-bit address space?

Thanks,
Calvin

— Peter Wieland
> wrote:
>
>> What is DMA?
>>
>> DMA is a way for you to offload the work of transferring data between

>> main memory and the device onto your device. This is in contrast to
>> programmed I/O (PIO) where you have the processor copying data
>> between
>
>> main memory and the device.
>>
>> PIO results in high data-rates, but it wastes significant amounts of
>> CPU time since you’re effectively running memcpy() for every
transfer.
>
>> It’s better to offload this mundane work to some other processor
>> which
>
>> can move the data from A to B and then interrupt (or signal in some
>> other
>> way) when the transfer is done. The CPU is then left free to do the
>> things you paid for it to do - math, comparisons, branches, etc…
>>
>> There are two flavors of DMA - slave-mode and bus-mastering. In
>> Slave-Mode DMA you would use a device built onto the motherboard
>> (these days it’s somewhere in the chipset) to transfer from main
>> memory to device memory. The slave-mode DMA controller has a number
>> of limitations, the two biggest being that it’s slow to have all
>> these
>
>> transfers going through a third party, and that you have to share it
>> across all slave-mode devices so you have this reserve/release system

>> which makes things more expensive. Slave-Mode seems to be
>> more-or-less obsolete, particuarly on PCI devices, but you may still
>> see it on older ISA/PCMCIA devices.
>>
>> In bus-mastering DMA you place some logic on your device to allow it
>> to run the DMA cycles itself. The device will steal some bus time
>> and
>
>> initiate a memory transfer as if it were another CPU. Data is
>> transferred directly from main-memory into the device’s memory ranges

>> (this is all a simplficiation). You can have multiple bus-masters
>> running independently of each other, so this is more effiicent than
>> having all your devices fight over a single transfer agent (whether
>> it’s the CPU (PIO) or a separate DMA controller (slave-mode)).
>>
>> What does DMA look like on a device?
>>
>> From the driver’s point of view there are two aspects to DMA. The
>> first is how you prepare your data for DMA transfers. The second is
>> how you program the device to initiate the transfers & how you notice

>> that a transfer is done.
>>
>> Let’s talk about the second part first. There are an infinite number

>> of models for programming your device to start a DMA.
>> Each introduces its
>> own limitations. I’ll go over a few of the common ones i’ve seen:
>>
>> 1 - The device takes a single physical address base and
> a length for
>> an operation. This is very simple to program, but requres the
>> transfer to be physically contiguous, which is unlikely for anything
>> other than the smallest transfers (physical memory is often very
>> fragmented, so the chance of two adjoining virtual pages using
>> adjoining physical pages is pretty small). The device will usually
>> interrupt when the DMA transfer is complete.
>>
>> 2 - The device takes a single physical address base & a
> length for
>> each fragment of an operation. It interrupts when it’s done
>> transferring each fragment, allowing your driver to program in the
>> next one. This is going to be slow because of the latency between
>> each fragment, but is still easy to implement.
>>
>> 3 - The device takes a sequence of
>> (physical-address,
>> length) pairs which describe all the fragments of the transfer. This

>> sequence is called a “scatter-gather list” (SG List). The device can

>> then transfer each fragment on its own without the need to interrupt
>> the CPU until all sections are done. In the simplest version of
>> this,
>
>> the driver programs the SG list to the controller through its
>> registers/ports - writing each element into the device’s internal
>> memory. The device will only have a limited space for the SG list,
>> so
>
>> you may only be able to handle 16 fragments in a given transfer.
>>
>> 4 - In the more complex version of 3, the SG list itself
> is stored
>> in DMA accessible system memory and the device is programmed with the

>> physical address and length of the scatter-gather list itself.
>> The device can then use DMA to transfer the SG list entries into its
>> own internal buffers. This can reduce the limitations on the length
>> of the SG list, but requires more complex logic in the DMA controller

>> to handle it. However this would require the memory holding the SG
>> list to be physically contiguous.
>>
>> Many of these options can be changed to use “common buffer” (a
>> physically-contiguous, DMA ready buffer that’s shared between the
>> device and the system). For some devices you cannot just give them
>> an
>
>> arbitrary physical address for a scatter gather list, but instead
>> require that you give them an offset into a “common buffer” that’s
>> previously been configured for the device.
>> Sometimes you allocate a
>> “command block” in the common buffer, put a pointer to the SG list in

>> that, and then write the address of the command block to a register
>> on
>
>> the controller. For devices that can handle multiple requests at a
>> time, this can reduce the amount of internal memory that the
>> controller needs & reduce the limitiations on the number of requests
>> that can be run at one time.
>>
>> Now that we have a vague idea of how you might program DMA on a
>> device, how do you prepare the buffers? This might seem simple - you

>> just get the physical addresses of the buffer and send them to the
>> controller.
>> But what if you have a 32-bit controller on a 36 or
>> 64 bit system? Or
>> if the PCI bridge between your device and system memory is broken and

>> must be run in 32-bit mode even when you have a 64-bit controller.
>> Or
>
>> if the buffer is in a memory on a NUMA node that your device can’t
>> see
>
>> (this last one is hypothetical at the moment)? In these cases your
>> buffer contains pages which your device cannot see, and you need to
>> “bounce” them into auxillary buffers which the device can see.
>>
>> Now you may think you could take care of this bouncing yourself if
>> you
>
>> knew all the limitations on your device. But since we can’t
>> enumerate
>
>> all future restrictions today, WDM doesn’t tell you this. This makes

>> some sense - everyone expects your driver from today to run for the
>> next
>> 5+ years, but we know the underlying system hardware
>> will morph between
>> now and then.
>>
>> One option is to bounce all of your transfers into common-buffer.
>> This might even make sense if your transfer consists of many small
>> pieces (like a network packet, which might have several headers added

>> to it).
>> Since common-buffer is always visible to your device this is always a
>>
> === message truncated ===
>
>
>
>
>
>
>
> __________________________________________________________
> Find your next car at http://autos.yahoo.ca
>
>
> —
> Questions? First check the Kernel Driver FAQ at
> http://www.osronline.com/article.cfm?id=256
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer
>
>
>


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

— Michael Becker wrote:

> In my case, I found that if the PCI card could
> access all of memory space, and
> it supported HW Scatter/Gather, then there would
> never be a need for “bounce
> buffers”, and Windows wouldn’t use them.

This is not always true. Let’s say your device has a
32-bit bus, or it’s a PCIX64 bit card but plugged into
a 32-bit PCI slot, the system has more than 4GB
memory; your device also support SGL and DAC. Based on
that, you may believe bounce buffer will not be used
since your device is designed to be capable of DMA to
the entire 64-bit memory space. But your driver does
not know whether or not all bridges between your card
and the MC are DAC capable. Unfortunately, there are
stupid bridges doesn’t do DAC. In such case, Windows
would have to use bounce buffer.

Calvin Guan (DDK MVP)

__________________________________________________________
Find your next car at http://autos.yahoo.ca

Calvin Guan wrote:

— Michael Becker wrote:
>
>> In my case, I found that if the PCI card could
>> access all of memory space, and
>> it supported HW Scatter/Gather, then there would
>> never be a need for “bounce
>> buffers”, and Windows wouldn’t use them.
>
> This is not always true. Let’s say your device has a
> 32-bit bus, or it’s a PCIX64 bit card but plugged into
> a 32-bit PCI slot, the system has more than 4GB
> memory; your device also support SGL and DAC. Based on
> that, you may believe bounce buffer will not be used
> since your device is designed to be capable of DMA to
> the entire 64-bit memory space. But your driver does
> not know whether or not all bridges between your card
> and the MC are DAC capable. Unfortunately, there are
> stupid bridges doesn’t do DAC. In such case, Windows
> would have to use bounce buffer.

I never realized that! Thanks for the info (and the warning)! Obviously the
test machine I’m using doesn’t have any stupid bridges. One of the mistakes I
made initially during development was telling Windows that the HW did have
scatter/gather, and I got a full SG-List of the User mode buffer - no bounce
buffers at all.

Hello,

Peter Wieland wrote:

What is DMA?

[…and very good explanations…]

In case you did not notice until now: Peter started a blog. Of course,
his first topic is DMA:

http://blogs.msdn.com/peterwie/

Thanks to Doron for mentioning it on his blog.

Regards,
Spiro.


Spiro R. Trikaliotis http://cbm4win.sf.net/
http://www.trikaliotis.net/ http://www.viceteam.org/