Philosophical question: DMA vs. Programmed I/O

OSR_Community_User · March 17, 2003, 12:24pm

Devs,

I develop drivers for a family of PCI/PMC data acquisition boards based on a
PLX PCI controller chipset. All of our products have input and output
buffers, as required. As compared to the PCI bus, the data transfer rates
in and out of the cable side of the card are fairly slow. Buffers usually
run about 32k bytes, max.

Those at the home office chant the mantra, “DMA good. PIO bad.” I can see
situations where DMA is clearly the best choice, but I question the why with
our cards. The people at the home office have never written device drivers,
and therefore know a hell of a lot more about these things than I
do…oops - was I being sarcastic?

The PIO sequence of events (for receive) is:

Get an interrupt when the buffer is getting full.
Spin off a DPC.
In the DPC, copy from the buffer location (a single address) directly
into the user buffer until the buffer status reads empty.
Complete the IRP, or set up to wait for another data interrupt, as
required.
get on with life, not worrying about timeouts, DMA aborts, etc.

The DMA sequence is:

Get an interrupt when the buffer is getting full.
Spin off a DPC.
in the DPC, make an educated guess about how much data (minimum) is in
the buffer (there is no register giving the exact number of bytes in the
buffer).

4a) for non-scatter-gather:

4a1) get the mapping registers.
4a2) initiate the transfer.
4a3) Start a timer and wait for DMA complete interrupt.
4a4) spin off DPC.
4a5) In the DPC, clean up (stop timer, release map registers, etc.)
4a6) initiate the next stage (if not enough registers available, or IRP is
not full).

4b) for scatter-gather:

4b1) set up the scatter-gather list.
4b2) initiate the transfer.
4b3) start a timer and wait for DMA complete interrupt.
4a4) spin off DPC.
4a5) In the DPC, clean up (stop timer, release SGL, map registers, etc.)
4a6) initiate the next stage (IRP is not full, etc.).

I assume that for some magic size of transfer, there is greater CPU overhead
for doing DMA vs. just copying the data from the buffer at full PCI speed.
The question is, what is that size, and how does one figure it out?

Thanks,

Evan Hillman

OSR_Community_User · March 17, 2003, 2:19pm

Standard ‘rule of thumb’: Multiples of PAGE_SIZE: dma. Exactly which
mulitplier would that be? I give up. Implement both approaches and measure
it.

However dedicated systems may get better performance (of the data transfer
operation,) with PIO by hogging a cpu. Obviously other programs on the
system will suffer.

-----Original Message-----
From: Evan Hillman [mailto:xxxxx@attbi.com]
Sent: Monday, March 17, 2003 12:25 PM
To: NT Developers Interest List
Subject: [ntdev] Philosophical question: DMA vs. Programmed I/O

Devs,

I develop drivers for a family of PCI/PMC data acquisition
boards based on a PLX PCI controller chipset. All of our
products have input and output buffers, as required. As
compared to the PCI bus, the data transfer rates in and out
of the cable side of the card are fairly slow. Buffers
usually run about 32k bytes, max.

Those at the home office chant the mantra, “DMA good. PIO
bad.” I can see situations where DMA is clearly the best
choice, but I question the why with our cards. The people at
the home office have never written device drivers, and
therefore know a hell of a lot more about these things than I
do…oops - was I being sarcastic?

The PIO sequence of events (for receive) is:

Get an interrupt when the buffer is getting full.

Spin off a DPC.

In the DPC, copy from the buffer location (a single
address) directly into the user buffer until the buffer
status reads empty.

Complete the IRP, or set up to wait for another data
interrupt, as required.

get on with life, not worrying about timeouts, DMA aborts, etc.

The DMA sequence is:

Get an interrupt when the buffer is getting full.

Spin off a DPC.

in the DPC, make an educated guess about how much data
(minimum) is in the buffer (there is no register giving the
exact number of bytes in the buffer).

4a) for non-scatter-gather:

4a1) get the mapping registers.
4a2) initiate the transfer.
4a3) Start a timer and wait for DMA complete interrupt.
4a4) spin off DPC.
4a5) In the DPC, clean up (stop timer, release map registers, etc.)
4a6) initiate the next stage (if not enough registers
available, or IRP is not full).

4b) for scatter-gather:

4b1) set up the scatter-gather list.
4b2) initiate the transfer.
4b3) start a timer and wait for DMA complete interrupt.
4a4) spin off DPC.
4a5) In the DPC, clean up (stop timer, release SGL, map
registers, etc.)
4a6) initiate the next stage (IRP is not full, etc.).

I assume that for some magic size of transfer, there is
greater CPU overhead for doing DMA vs. just copying the data
from the buffer at full PCI speed. The question is, what is
that size, and how does one figure it out?

Thanks,

Evan Hillman

You are currently subscribed to ntdev as:
xxxxx@stratus.com To unsubscribe send a blank email to
xxxxx@lists.osr.com

Peter_Viscarola_OSR · March 17, 2003, 6:40pm

“DMA good. PIO bad.” I used to chant that myself. It’s an old
mini-computer thing. Sorta like line-frequency clocks and current-loop
terminal interfaces.

This is a VERY hard equation to evaluate, because of the potential for
variations. Not to mention that everything depends on how you define
“goodness.” As Mark indicated in his reply, you can sometimes buy
throughput at the cost of CPU time. However, your note SPECIFICALLY asked
about CPU overhead, which I’ll take to mean instruction cycles.

So, let’s confine our analysis to how many instructions will get executed.
For PIO: Assuming your FIFO is 32 bits wide, if you’re doing a 32KB
transfer, that means you’ve got 8000 and some odd register accesses. Each
one of which is serializing your I/O bus. And that doesn’t even count the
overhead needed to either copy the data using METHOD_BUFFERED (another 8000+
instructions) or to set up and tear down the mapping in kernel mode with
MmMapLockedPages/MmUnmapLockedPages (and it’s accordant TLB flushing).

Next, let’s look at DMA: Your device uses the PLX chip, so it can do h/w
scatter/gather, and let’s assume that you support 64-bit addressing or that
you’re never running on a system with more than 4GB of main memory. If you
have to intermediate-buffer, all bets are off here, OK? But, let’s go on
and also assume that you supply the buffer into which the s/g list is
returned and that the transfer is going to a physically contiguous user data
buffer (don’t worry, I’ll add slop in below to account for this
simlification – You’ve only got a maximum of 7 fragments to deal with, and
the overhead of walking the MDL will be proportional to the overhead
required to do the same in MmMapLockedPagesSpecifyCache, so they should wash
anyhow).

Given the above, if you walk into the code for GetScatterGatherList(…) in
the debugger, I think you’ll find the code path builds the s/g list and is
just about a direct call to your execution routine. Let’s say absolutely no
more than 200 instructions (I’m guessing here, I don’t happen to have WinDbg
running this second… but you could pretty easily do the count). Now your
execution routine gets called, which needs to set up your hardware. Let’s
say that’s another 200 instructions, plus a half dozen register accesses.
Your DMA finishes. You get an interrupt. At least another 200 instructions
within your ISR to read the status and queue your DpcForISR? Probably. And
maybe another 500 to get to your ISR from the interrupt vector?? Who knows,
but let’s say so. Plus let’s say another half-dozen register accesses.
Your DPCforIsr runs, you clean up, blah blah. Let’s ignore calling
IoCompleteRequest which you have to do in either case. Let’s say, oh,
another 500 instructions.

So, sure, there are a lot of functions to call and a lot of just plain crap
to do… but none of it is that complex or time consuming.

So, with this SWAG you get: 1600 instructions, plus a dozen register
accesses in the DMA path. You think I’m a bit too optimistic? Multiply by
two and round up. No, tell you what, MULTIPLY BY THREE and round up to the
next highest thousand. So, we’ll call it 5000 instructions, OK?

I think you’ll agree, it’s pretty hard to beat DMA when it comes to lower
CPU utilization. And don’t forget that you’ve only touched your device
registers perhaps a dozen times.

Overall, I’ve made a bunch of assumptions and broad estimates here. Make
whatever assumptions and estimates YOU like based on your experience.

This would be a great area for somebody with an ICE to take some
measurements.

Once again, this explicitly assumes that all we’re interested in is CPU
time. In MY calculus, the amount of CPU time used is a factor, but I
generally believe customer’s CPUs are there to be used. And that they
generally sit idle far too much. But that’s not the question you asked,

HTH,

Peter
OSR

> -----Original Message-----
> From: Evan Hillman [mailto:xxxxx@attbi.com]
> Sent: Monday, March 17, 2003 12:25 PM
> To: NT Developers Interest List
> Subject: [ntdev] Philosophical question: DMA vs. Programmed I/O
>
>
> Devs,
>
>
> Those at the home office chant the mantra, “DMA good. PIO
> bad.” I can see situations where DMA is clearly the best

Maxim_S_Shatskih · March 18, 2003, 2:59pm

> Those at the home office chant the mantra, “DMA good. PIO bad.” I
can see

DMA utilizes the PCI bus in burst mode, while the CPU’s access to the
device memory cannot do this.
The CPU overhead necessary to build the DMA structures is by far
smaller then the the CPU stall cycles which the PCI bus (running on
33MHz) will introduce on each access to the device memory or
registers.
Also, if your device is capable of chain DMA, then you can feed it
with lots of pending IO requests, and the DMA will run from one of
them to another. This relaxes the IRQ latency requirements for a
device a lot due to huge “buffer” size formed by the pending IRPs. For
instance, digital video is a realtime thing, and runs fine without the
strict requirements due to OHCI1394 controller using smart DMA
techniques.
MS have provided a PCI hardware design recommendations, which will
make your device not-demanding in terms of IRQ latency.

Surely I’m speaking on significant amount of traffic here, not on HID
devices

Max

OSR_Community_User · March 18, 2003, 4:58pm

> -----Original Message-----

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Maxim
S. Shatskih
Sent: Tuesday, March 18, 2003 12:22 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Philosophical question: DMA vs. Programmed I/O

DMA utilizes the PCI bus in burst mode, while the CPU’s access to the
device memory cannot do this.

That is not a true statement. I have seen PIO writes to consecutive
register addresses on PCI devices combined into a burst on the PCI bus.
This actually unveiled a bug in the device I was writing the driver for
at the time, which turned out to ignore the PCI byte lane enable bits.
I did 5 32-bit writes to 5 consecutive 32-bit registers as part of
programming the device. The data was burst as 6 writes (actually, it
first became a burst of 3 64-bit writes across a 64-bit cPCI backplane,
which was then converted back to 6 32-bit writes by the 64-bit to 32-bit
PCI-PCI bridge on the PMC carrier where the device was installed). The
6th 32-bit write was a null PCI cycle (a valid PCI cycle where all byte
lane enable bits are inactive). Since the hardware ignored the byte
lane enables, this 6th write corrupted the next consecutive register on
the device, which just happenned to be the device’s configuration
register. I found a software solution around the problem that prevented
the burst from occurring (do the 5 writes to the 5 registers in a
different order), but my point is that you certainly can get PCI bursts
when using PIO as opposed to DMA. Interestingly enough, it only
intermittently bursted the sequence of consecutive writes, so the
corruption of the configuration register only happenned after a few
minutes of successful operation. Needless to say, it required a cPCI
bus analyzer to track down the conditions that caused the problem. The
vendor of the device in question ultimately admitted to the fact that
their device ignored the byte lane enable bits, in violation of the PCI
spec.

Jay

Jay Talbott
Principal Consulting Engineer
SysPro Consulting, LLC
3519 E. South Fork Drive
Suite 201
Phoenix, AZ 85044
(480) 704-8045
xxxxx@sysproconsulting.com
http://www.sysproconsulting.com

OSR_Community_User · March 19, 2003, 11:13am

A well tuned SIMD loop can be pretty fast, so can an effective common-buffer
DMA with a healthy set of big buffers. It all depends on whether we need
that extra CPU bandwidth, and whether we’re talking troughput or response
time. And the weakest link isn’t the CPU, but memory bandwidth and bus
utilization.

Alberto.

-----Original Message-----
From: Peter Viscarola [mailto:xxxxx@osr.com]
Sent: Monday, March 17, 2003 6:49 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Philosophical question: DMA vs. Programmed I/O