I did point out that the formula for naive copy is number of bytes *
(source access cost + target access cost), because in MMIO, the access to
memory on the PCI bus is slower than access to main memory, and is often
mapped uncached. Nonetheless, such copies are usually done in DWORD
chunks for the main copy, with any unaligned bytes at the head or tail
being moved explicitly. So it is not as fast as RAM-to-RAM copy inmain
memory, but it is almost certainly faster than using an external DMA
mechanism, particularly one that is not actually intended for this
purpose.
When CMU acqired its IBM/360 67 (the virtual memory version of the 360/65)
we did not buy a paging drum, but instead bought 8MB of bulk core memory,
and a special memory-to-memory DMA processor. We also had a special-order
750K main memory, having via simulation dtermined that the stock 512K
memory would be too small. So instead of paging to te drum, we paged to
bulk memory. Main memory was 750ns, bulk memory was 8us, an order of
magnitude slower. After a year of performance measurement, we did two
things: (a) did the paging via the equivalent of RtlCopyMemory (b) didn’t
bother to do paging at all for most user pages, but executed directly from
bulk memory.
We supported 60 users concurrently on a machine with memory and computing
power comparable to a 286.
When people reminisce about the “good ol’ days” of computing, I point out
tat the best feature of the good ol’ days is that they are in the past.
It turns out what killed performance was the overhead of setting up and
responding to the DMA transfer. This overhead was so high that it cost
more to bring a page into fast main memory than to execute it directly in
the order-of-magnitude-slower bulk memory.
This machine had no caching, had an 8-slot TLB, no instruction prefetch or
pipeline, no speculative execution, essentially none of the cool features
that make rep movsd run screamingly fast, and it was STILL faster to use 8
MVC instructions (the equivalent of rep movsb but with an upper bound of
256 bytes in a transfer) or execute directly out of 8us bulk memory.
It took a year of careful instrumentation and analysis to determine the
correct solution. SonI tend to look with great skepticism on discussions
of the form “I have to move an undefined number of bytes between RAM and
MMIO, and {my gut tells me; I feel; a trusted friend told me; I saw in a
Web search; my manager heard; …} that I should try to use some kind of
DMA transfer mechanism to make this more efficient”. The OP rarely posts
any critical information essential to evaluating the proposal. This
discussion started the same way. And it has ended in the expected way:
don’t waste your time solving non-problems.
joe
Processors today copy memory really danged fast. Even the simple
> “rep movsd” instruction moves 4 bytes per cycle, which is 8 GB/s on a 2
> GHz machine. Copying 100 megabytes a second is less than 2% CPU load.
>
>
Classic DMA is not designed to do RAM-RAM copy. It’s mainly used to move
data between device and RAM without CPU’s involvement. Hence a more fair
comparison is looking at the time to move a DWORD between device and RAM
over a PCI bus, namely, a DWORD target memory cycle without any retry;
over
PCIe bus, a 4-B MRd or MWR TLP. Well, of course it still much faster than
system DMA on isa, but it’s not going to be nearly impressive as RAM-RAM
“rep movs” on modern CPU.
NTDEV is sponsored by OSR
OSR is HIRING!! See http://www.osr.com/careers
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer