Writing to Overlay from DPC is slower than from app

QUESTION:

Why is writing to a DirectDraw overlay buffer slower from a driver DPC
than it is from an application?

DETAILS:

I am writing a specialized PCI video capture driver for an industrial
vision product. No DirectShow or Kernel Streaming here, just capture
frames to contiguous buffers that are permanently allocated when the
driver loads. These buffers are also mapped into the app’s user virtual
space. It’s really pretty simple. (Good, 'cause I’m kinda new at this.)

Sometimes we want to DMA the incoming video to one of these buffers
continuously, and copy each frame to a Direct Draw overlay too, so live
video can be seen. The driver provides a way for an app thread to
synchronize with the end of each frame grab. The app thread takes a
certain amount of CPU time to copy each frame from the RAM buffer to the
overlay. (The app created the overlay.) This works fine.

Sometimes we want the driver to DMA directly to the overlay. So the app
passes the overlay buffer pointer to the driver. The driver creates an MDL
from that, then does a dirty trick, looking at the physical page addresses
in the MDL to get a hardware DMA address so the frame grabber can DMA
directly to the overlay. This works fine. (Please no lectures about how
this won’t work on a platform that really needs an adapter to utilize
mapping hardware. I KNOW I’m bad! That’s another discussion…)

The problem came when we tried to DMA to a RAM buffer then have a DPC
(interrupt after each frame capture is done) copy it to the overlay
instead of having the app copy it to the overlay. The driver uses
MmMapLockedPagesSpecifyCache (one time, when the app told the driver about
the overlay) to get its own Kernel-mode pointer to the overlay. I do
specify MmCached here. Yet it takes considerably more CPU time for the DPC
to copy data from the RAM buffer to the overlay, than it takes the app to
do it. Experimentation shows the time is mainly eaten up by the writes to
the overlay, not the reads from the RAM.

Why is the DPC copy so much slower than the same copy done from the app?

BTW READING from the overlay is excruciatingly slow, I think the hardware
just isn’t designed to do that fast.

Thanks for any ideas,
Paul Braun

> BTW READING from the overlay is excruciatingly slow, I think the hardware

just isn’t designed to do that fast.

This may be the key. I can’t think of any particular reason a ring 0 thread
should be better or worse than a ring 3 thread at accessing a hunk of
memory.

Is it possible that you have some sort of alignment problem in the DPC code
that you don’t have in the app code, such that the memory subsystem is
having to read from the adapter to merge your write data? This might be the
case if you are doing a byte or word copy instead of dword copy, or maybe
started on a word boundary.

Write combining might be another possibility. If it is turned off on a
memory region and many small writes are done the result is piggish in the
extreme. Of course, not all memory subsystems will support write combining,
so you have to be careful here.

Loren

Thanks for the reply, Loren.

Both buffers have nice round addresses and I deliberately wrote the code
to use 32-bit pointers to guarantee 32-bit access. Frankly I have never
heard of write combining but I guess it means holding on to smaller writes
and combining them into one big one. If this means combining, say, 16-bit
writes into a 32-bit write, this would not seem to apply. If it means
combining multiple 32-bit writes into a PCI burst then that’s exactly the
kind of thing I suspect might be happening. The way the memory is mapped
into the Kernel space and the way it is mapped into the app space must be
different. Like the way you can specify cached or not.

Paul

> Why is writing to a DirectDraw overlay buffer slower from a driver DPC

than it is from an application?

Have you specified MmWriteCombined while mapping the buffer to kernel memory?

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com