Re: Writing to Overlay from DPC is slower than from a pp

WC is a processor function, while i/o bus cycle negotiation is a bridge
function. If you submit sequential addresses with a Fast Write protocol to
the AGP bus, it will go at very fast speed; in fact, the limitation may end
up being those whiny non-video driver writers who complain when video
drivers hog the bus ! :slight_smile:

The Intel MCH chip documentation states that the transfer speed of the
system bus is 3.2Gb at 400Mhz clock, or 6.4Gb/sec at a 800Mhz clock. That
gives you 8 bytes per cycle: it figures, because the P4 data bus has 64
pins. The caches are reportedly much faster, but that’s not relevant in this
picture: data will be coming to outside memory or to the North Bridge at the
rate of 8 bytes every clock cycle. Now, the MCH claims to be able to burst
64 bytes at a time to main memory, and that’s independent of caching. As
long as the MCH works correctly, AGP transfers should operate at maximum
speed. The MCH spec says that AGP 8X and Fast Writes are supported, and
gives a 2Gb/sec maximum data throughput to the AGP bus, well, that should be
enough to keep it busy ! As long as the Bridge supports Fast Writes, the bus
should be driven at very high speed, because Fast Writing is a
block-oriented protocol rather than a word-oriented one.

The good thing about write combine is that you can write data out of
sequence to the system bus, and the WC hardware will reorder it for you and
hence make sure that data flows to the I/O bus sequentially so as not to
break the bursting achieved by the fast protocol. However, by the time that
data gets to the system bus, it’s already write-combined ! Also, there’s a
bit of a limit here, the WC buffers are 64-byte wide and that’s not that
much for graphics. Also, if we’re DMAing a buffer down to the bus, chances
are that data will be moving out in increasing addresses, so, write combine
may not help; it’s only when we have lots of data out of sequence that we
have a hope of WC improving performance, and that’s why I asked for more
details ! Write combining shouldn’t help much a well designed video


-----Original Message-----
From: Loren Wilton []
Sent: Friday, October 03, 2003 12:27 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] Re: Writing to Overlay from DPC is slower than from a pp

I see it works, wow, but I don’t quite understand why, although I’m rusty
this kind of thing. Why do they do byte updates at all ? At 32 bits/pixel,
every pixel’s a dword anyway.

IO space memory has to be uncached to work properly, so things can’t be
handled in the cache. With WC turned off on uncached space, each write,
even if it happens to be a dword, ends up being a separate bus transaction,
with all of the startup negotiation overhead of a new transaction. With WC,
the bus or memory subsystem (or the processor internal to on-chip write
buffering) can combine a series of adjacent writes into a larger write, and
do a larger burst on the bus. This cuts down on the burst negotiation
overhead, since you transfer more per packet.


Questions? First check the Kernel Driver FAQ at

You are currently subscribed to ntdev as:
To unsubscribe send a blank email to

The contents of this e-mail are intended for the named addressee only. It
contains information that may be confidential. Unless you are the named
addressee or an authorized designee, you may not copy or use it, or disclose
it to anyone else. If you received it in error please notify us immediately
and then destroy it.