Not to play the devil’s advocate, but my driver uses extensive direct i/o
and dma and it does not create adapter objects. In fact, it doesn’t use the
ddk dma scheme at all. I handle it all myself. The overriding reason here
is, I want my the largest possible amount of my code to be portable between
Windows, Solaris, MacOS and Linux, so, APIs are out - Occam reigns supreme.
My driver has a huge common and portable library, and a set of small
OS-dependent portion which mostly interfaces to the OS’s APIs and data
movement standards. I inherited this architecture from the people who
preceeded me, and I found it to be a robust and easy to handle way of doing
things.
About snooping, you can buy Mindshare’s Pentium II book - yes, Pentium II -
which has a mighty good explanation of how the “bus” works, including
snooping and the MESI protocol. The “bus” is the front-side bus, also called
the “system” bus, which connects processors and the North Bridge. All
processors listen to the bus; this is called “snooping”. All processor i/o
accesses go through the front-side bus, and all memory accesses beyond the
cache also go through it. When a processor or a bridge generates a bus
transaction, other processors read that transaction through their snooping,
and act accordingly.
The snooping per se doesn’t load the bus, because all it does is to read
what’s already in the bus. Processors snoop to maintain coherence in their
caches. Each cache line has a state associated to it, which can be M
(modified), E (exclusive), S (shared) and I (invalid): hence the MESI name
for the protocol. If a cache line does not have meaningful data, it’s in the
I state; if one processor reads a memory line, its cache line image may be
in E or S states, depending on whether the line is exclusively held by the
processor or whether it’s also cached in some other processor; or it can be
in the M state if the processor has modified it but not yet sent it out to
main memory: this happens with Writeback caches.
Now, if P0 reads a cache line, if no other processor holds the same memory
line in its cache, it goes into E state: P0 holds it exclusively. P0 is
snooping the bus, so, if P1 reads the same memory line, P0 now changes its
state to S: shared. If now P2 writes to that memory location, both P0 and P1
snoop the write cycle on the bus and mark their corresponding cache lines as
I - invalid. As for P2, its action depends on whether the cache is in
Writeback or Writethrough mode. If in Writeback mode the line is written to
the cache and not to the memory, so, the line is marked M: modified. In
Writethrough mode, the line is written directly to main memory and not to
the cache, so, that cache line’s state doesn’t change. Consequently,
Writeback caches have all four M,E,S,I states, while a Writethrough cache
only has E, S and I state.
Note that all of this happens inside the processor and it doesn’t require
bus cycles. There may be situations in the protocol where a processor must
pass its data to other processors, or issue an invalidate cycle on the bus,
in which case a bus cycle will happen. Note also that interrupts may be
signalled on the Front-side bus as well, so, any cache protocol overhead may
influence your i/o throughput. I may be wrong, but I believe that in SMP
machines, except for compute-bound scientific applications, front-side bus
bandwidth is more important than processor cycle time.
The lock prefix locks the front-side bus for the duration of the operation,
guaranteeing indivisible operation. That hogs the bus for a short period of
time, but if that operation is repeated over and over again, that may place
some pressure on the bus.
If a processor is looping on a test-and-set instruction, it will be doing
locked writes to the bus at pretty high speed. Every such write will cause
other processors to invalidate their corresponding cache lines. Hence, if
both P0 and P1 are spinning on the same test-and-set instruction, one of
them, or both, will be re-reading that variable from main memory, through
the front-side bus, at a very high speed. If you loop issuing locked writes
to a memory location, you bet you’re going to put pressure on the front-side
bus.
As I pointed out, the alternative is to spin on a plain vanilla test, which
will read the variable at most once from main memory, and then loop on the
cache line until the variable gets modified by some other processor. That
will generate one Invalidate snoop followed by one main memory read into the
cache. At that point, the test will succeed, and you follow by issuing a
test-and-set instruction, which will do one write to the front-side bus. If
the test-and-set succeeds, you’re in the critical section; if not, you go
back to the normal test, which, again, doesn’t put pressure on the
front-side bus.
Again, read the Mindshare Pentium II book, it has a nice description of the
protocol and of its ins and outs! Hope this helps,
Alberto.
----- Original Message -----
From:
To: “Windows System Software Devs Interest List”
Sent: Thursday, October 25, 2007 5:43 AM
Subject: RE:[ntdev] Allocating non-cacheable memory for DMA
> My main question is related to the WDK quote that says (as mentioned
> above) “Any driver that uses direct I/O and DMA must create an adapter
> object…”
> Do you know the implication of using
> MmAllocateContiguousMemorySpecifyCache() instead of
> AllocateCommonBuffer()?
> You did write that the performance won’t be improved and that “Snooping
> does not consume the bus bandwidth” - can you explain? Which BUS are you
> referring to?
> as far as we know cache coherency protocol consumes FSB cycles, isn’t it?
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer