A well tuned SIMD loop can be pretty fast, so can an effective common-buffer
DMA with a healthy set of big buffers. It all depends on whether we need
that extra CPU bandwidth, and whether we’re talking troughput or response
time. And the weakest link isn’t the CPU, but memory bandwidth and bus
utilization.
Alberto.
-----Original Message-----
From: Peter Viscarola [mailto:xxxxx@osr.com]
Sent: Monday, March 17, 2003 6:49 PM
To: NT Developers Interest List
Subject: [ntdev] Re: Philosophical question: DMA vs. Programmed I/O
“DMA good. PIO bad.” I used to chant that myself. It’s an old
mini-computer thing. Sorta like line-frequency clocks and current-loop
terminal interfaces.
This is a VERY hard equation to evaluate, because of the potential for
variations. Not to mention that everything depends on how you define
“goodness.” As Mark indicated in his reply, you can sometimes buy
throughput at the cost of CPU time. However, your note SPECIFICALLY asked
about CPU overhead, which I’ll take to mean instruction cycles.
So, let’s confine our analysis to how many instructions will get executed.
For PIO: Assuming your FIFO is 32 bits wide, if you’re doing a 32KB
transfer, that means you’ve got 8000 and some odd register accesses. Each
one of which is serializing your I/O bus. And that doesn’t even count the
overhead needed to either copy the data using METHOD_BUFFERED (another 8000+
instructions) or to set up and tear down the mapping in kernel mode with
MmMapLockedPages/MmUnmapLockedPages (and it’s accordant TLB flushing).
Next, let’s look at DMA: Your device uses the PLX chip, so it can do h/w
scatter/gather, and let’s assume that you support 64-bit addressing or that
you’re never running on a system with more than 4GB of main memory. If you
have to intermediate-buffer, all bets are off here, OK? But, let’s go on
and also assume that you supply the buffer into which the s/g list is
returned and that the transfer is going to a physically contiguous user data
buffer (don’t worry, I’ll add slop in below to account for this
simlification – You’ve only got a maximum of 7 fragments to deal with, and
the overhead of walking the MDL will be proportional to the overhead
required to do the same in MmMapLockedPagesSpecifyCache, so they should wash
anyhow).
Given the above, if you walk into the code for GetScatterGatherList(…) in
the debugger, I think you’ll find the code path builds the s/g list and is
just about a direct call to your execution routine. Let’s say absolutely no
more than 200 instructions (I’m guessing here, I don’t happen to have WinDbg
running this second… but you could pretty easily do the count). Now your
execution routine gets called, which needs to set up your hardware. Let’s
say that’s another 200 instructions, plus a half dozen register accesses.
Your DMA finishes. You get an interrupt. At least another 200 instructions
within your ISR to read the status and queue your DpcForISR? Probably. And
maybe another 500 to get to your ISR from the interrupt vector?? Who knows,
but let’s say so. Plus let’s say another half-dozen register accesses.
Your DPCforIsr runs, you clean up, blah blah. Let’s ignore calling
IoCompleteRequest which you have to do in either case. Let’s say, oh,
another 500 instructions.
So, sure, there are a lot of functions to call and a lot of just plain crap
to do… but none of it is that complex or time consuming.
So, with this SWAG you get: 1600 instructions, plus a dozen register
accesses in the DMA path. You think I’m a bit too optimistic? Multiply by
two and round up. No, tell you what, MULTIPLY BY THREE and round up to the
next highest thousand. So, we’ll call it 5000 instructions, OK?
I think you’ll agree, it’s pretty hard to beat DMA when it comes to lower
CPU utilization. And don’t forget that you’ve only touched your device
registers perhaps a dozen times.
Overall, I’ve made a bunch of assumptions and broad estimates here. Make
whatever assumptions and estimates YOU like based on your experience.
This would be a great area for somebody with an ICE to take some
measurements.
Once again, this explicitly assumes that all we’re interested in is CPU
time. In MY calculus, the amount of CPU time used is a factor, but I
generally believe customer’s CPUs are there to be used. And that they
generally sit idle far too much. But that’s not the question you asked,
HTH,
Peter
OSR
> -----Original Message-----
> From: Evan Hillman [mailto:xxxxx@attbi.com]
> Sent: Monday, March 17, 2003 12:25 PM
> To: NT Developers Interest List
> Subject: [ntdev] Philosophical question: DMA vs. Programmed I/O
>
>
> Devs,
>
>
> Those at the home office chant the mantra, “DMA good. PIO
> bad.” I can see situations where DMA is clearly the best
You are currently subscribed to ntdev as: xxxxx@compuware.com
To unsubscribe send a blank email to xxxxx@lists.osr.com
The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.