Allocating non-cacheable memory for DMA

Hi,

My driver allocates DMA buffers. Some of them are only accessible by the device, others both by the driver and the device.
For performance reasons, I would like to allocate some of these buffers as non-cacheable, because I know that these buffers will not benefit from cache locality, and I’d like to eliminate the (redundant) overhead of bus coherence transactions (snooping).

I have some questions regarding memory caching and the use of the DDI functions mentioned below :

According to WDK documentation, in the function AllocateCommonBuffer(), the “BOOLEAN CacheEnabled” parameter is said to be ignored for x86-based, x64-based, and Itanium-based processors (http://msdn2.microsoft.com/en-us/library/ms806255.aspx).
Strangly, another function, MmAllocateContiguousMemorySpecifyCache(), is given the argument CacheType which can be set to MmNonCached on x86 machines (there is nothing in the documentation that forbids this).
However, I read in the WDK in “Introduction to Adapter Objects” (http://msdn2.microsoft.com/en-us/library/ms795327.aspx) the following:

  • quote start -
    Any driver that uses direct I/O and DMA must create an adapter object. The adapter object represents either a DMA controller channel or port, or a bus-master device.
    Two kinds of lowest-level drivers must use adapter objects:
    · Drivers for devices that use the system DMA controller. Such devices are called slave devices and are said to “use system (or slave) DMA.”
    · Drivers for devices that are bus-master adapters. Such devices arbitrate with the system for use of the I/O bus, and thus use bus-master DMA. "
  • quote end -

Since MmAllocateContiguousMemorySpecifyCache() is not part of the adapter structure, does the above quote implies that the latter function may not be called by our driver?

My questions are:

  1. Can I use MmAllocateContiguousMemorySpecifyCache() to allocate non-cacheable DMA buffers?
  2. If not, then what is the reason for this limitation?
  3. Why is the CacheEnabled boolean ignored for the function AllocateCommonBuffer()?

Thanks

Alon.

>Since MmAllocateContiguousMemorySpecifyCache() is not part of the adapter

structure,

It is intended to be used internally in adapter object implementation to
implement ->AllocateCommonBuffer.

  1. Why is the CacheEnabled boolean ignored for the function
    AllocateCommonBuffer()?

Because on these architectures there is no real need to use noncached memory
for DMA.

Snooping does not consume the bus bandwidth.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

That is very unlikely to improve your performance.

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@yahoo.com
Sent: Wednesday, October 24, 2007 9:40 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Allocating non-cacheable memory for DMA

Hi,

My driver allocates DMA buffers. Some of them are only accessible by the
device, others both by the driver and the device.
For performance reasons, I would like to allocate some of these buffers
as non-cacheable, because I know that these buffers will not benefit
from cache locality, and I’d like to eliminate the (redundant) overhead
of bus coherence transactions (snooping).

I have some questions regarding memory caching and the use of the DDI
functions mentioned below :

According to WDK documentation, in the function AllocateCommonBuffer(),
the “BOOLEAN CacheEnabled” parameter is said to be ignored for
x86-based, x64-based, and Itanium-based processors
(http://msdn2.microsoft.com/en-us/library/ms806255.aspx).
Strangly, another function, MmAllocateContiguousMemorySpecifyCache(), is
given the argument CacheType which can be set to MmNonCached on x86
machines (there is nothing in the documentation that forbids this).
However, I read in the WDK in “Introduction to Adapter Objects”
(http://msdn2.microsoft.com/en-us/library/ms795327.aspx) the following:

  • quote start -
    Any driver that uses direct I/O and DMA must create an adapter object.
    The adapter object represents either a DMA controller channel or port,
    or a bus-master device.
    Two kinds of lowest-level drivers must use adapter objects:
    * Drivers for devices that use the system DMA controller. Such devices
    are called slave devices and are said to “use system (or slave) DMA.”
    * Drivers for devices that are bus-master adapters. Such devices
    arbitrate with the system for use of the I/O bus, and thus use
    bus-master DMA. "
  • quote end -

Since MmAllocateContiguousMemorySpecifyCache() is not part of the
adapter structure, does the above quote implies that the latter function
may not be called by our driver?

My questions are:

  1. Can I use MmAllocateContiguousMemorySpecifyCache() to allocate
    non-cacheable DMA buffers?
  2. If not, then what is the reason for this limitation?
  3. Why is the CacheEnabled boolean ignored for the function
    AllocateCommonBuffer()?

Thanks

Alon.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

My main question is related to the WDK quote that says (as mentioned above) “Any driver that uses direct I/O and DMA must create an adapter object…”
Do you know the implication of using MmAllocateContiguousMemorySpecifyCache() instead of AllocateCommonBuffer()?
You did write that the performance won’t be improved and that “Snooping does not consume the bus bandwidth” - can you explain? Which BUS are you referring to?
as far as we know cache coherency protocol consumes FSB cycles, isn’t it?

If you step through AllocateCommonBuffer you end up in a call to MmAllocateContiguousMemorySpecifyCache, and you get back a cached region of physically contiguous memory. Along the way the platform and bus specific software within the OS has made the decision about how to allocate that memory as appropriate for the bus and the platform. Currently you could get away with not using AllocateCommonBuffer and calling MmAllocateContiguousMemorySpecifyCache directly, but that could change with some OS release or on some platform. It is very unlikely that walking around the defined interfaces is going to improve your performance.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-304158-
xxxxx@lists.osr.com] On Behalf Of xxxxx@yahoo.com
Sent: Thursday, October 25, 2007 5:43 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Allocating non-cacheable memory for DMA

My main question is related to the WDK quote that says (as mentioned
above) “Any driver that uses direct I/O and DMA must create an adapter
object…”
Do you know the implication of using
MmAllocateContiguousMemorySpecifyCache() instead of
AllocateCommonBuffer()?
You did write that the performance won’t be improved and that “Snooping
does not consume the bus bandwidth” - can you explain? Which BUS are
you referring to?
as far as we know cache coherency protocol consumes FSB cycles, isn’t
it?


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Not to play the devil’s advocate, but my driver uses extensive direct i/o
and dma and it does not create adapter objects. In fact, it doesn’t use the
ddk dma scheme at all. I handle it all myself. The overriding reason here
is, I want my the largest possible amount of my code to be portable between
Windows, Solaris, MacOS and Linux, so, APIs are out - Occam reigns supreme.
My driver has a huge common and portable library, and a set of small
OS-dependent portion which mostly interfaces to the OS’s APIs and data
movement standards. I inherited this architecture from the people who
preceeded me, and I found it to be a robust and easy to handle way of doing
things.

About snooping, you can buy Mindshare’s Pentium II book - yes, Pentium II -
which has a mighty good explanation of how the “bus” works, including
snooping and the MESI protocol. The “bus” is the front-side bus, also called
the “system” bus, which connects processors and the North Bridge. All
processors listen to the bus; this is called “snooping”. All processor i/o
accesses go through the front-side bus, and all memory accesses beyond the
cache also go through it. When a processor or a bridge generates a bus
transaction, other processors read that transaction through their snooping,
and act accordingly.

The snooping per se doesn’t load the bus, because all it does is to read
what’s already in the bus. Processors snoop to maintain coherence in their
caches. Each cache line has a state associated to it, which can be M
(modified), E (exclusive), S (shared) and I (invalid): hence the MESI name
for the protocol. If a cache line does not have meaningful data, it’s in the
I state; if one processor reads a memory line, its cache line image may be
in E or S states, depending on whether the line is exclusively held by the
processor or whether it’s also cached in some other processor; or it can be
in the M state if the processor has modified it but not yet sent it out to
main memory: this happens with Writeback caches.

Now, if P0 reads a cache line, if no other processor holds the same memory
line in its cache, it goes into E state: P0 holds it exclusively. P0 is
snooping the bus, so, if P1 reads the same memory line, P0 now changes its
state to S: shared. If now P2 writes to that memory location, both P0 and P1
snoop the write cycle on the bus and mark their corresponding cache lines as
I - invalid. As for P2, its action depends on whether the cache is in
Writeback or Writethrough mode. If in Writeback mode the line is written to
the cache and not to the memory, so, the line is marked M: modified. In
Writethrough mode, the line is written directly to main memory and not to
the cache, so, that cache line’s state doesn’t change. Consequently,
Writeback caches have all four M,E,S,I states, while a Writethrough cache
only has E, S and I state.

Note that all of this happens inside the processor and it doesn’t require
bus cycles. There may be situations in the protocol where a processor must
pass its data to other processors, or issue an invalidate cycle on the bus,
in which case a bus cycle will happen. Note also that interrupts may be
signalled on the Front-side bus as well, so, any cache protocol overhead may
influence your i/o throughput. I may be wrong, but I believe that in SMP
machines, except for compute-bound scientific applications, front-side bus
bandwidth is more important than processor cycle time.

The lock prefix locks the front-side bus for the duration of the operation,
guaranteeing indivisible operation. That hogs the bus for a short period of
time, but if that operation is repeated over and over again, that may place
some pressure on the bus.

If a processor is looping on a test-and-set instruction, it will be doing
locked writes to the bus at pretty high speed. Every such write will cause
other processors to invalidate their corresponding cache lines. Hence, if
both P0 and P1 are spinning on the same test-and-set instruction, one of
them, or both, will be re-reading that variable from main memory, through
the front-side bus, at a very high speed. If you loop issuing locked writes
to a memory location, you bet you’re going to put pressure on the front-side
bus.

As I pointed out, the alternative is to spin on a plain vanilla test, which
will read the variable at most once from main memory, and then loop on the
cache line until the variable gets modified by some other processor. That
will generate one Invalidate snoop followed by one main memory read into the
cache. At that point, the test will succeed, and you follow by issuing a
test-and-set instruction, which will do one write to the front-side bus. If
the test-and-set succeeds, you’re in the critical section; if not, you go
back to the normal test, which, again, doesn’t put pressure on the
front-side bus.

Again, read the Mindshare Pentium II book, it has a nice description of the
protocol and of its ins and outs! Hope this helps,

Alberto.

----- Original Message -----
From:
To: “Windows System Software Devs Interest List”
Sent: Thursday, October 25, 2007 5:43 AM
Subject: RE:[ntdev] Allocating non-cacheable memory for DMA

> My main question is related to the WDK quote that says (as mentioned
> above) “Any driver that uses direct I/O and DMA must create an adapter
> object…”
> Do you know the implication of using
> MmAllocateContiguousMemorySpecifyCache() instead of
> AllocateCommonBuffer()?
> You did write that the performance won’t be improved and that “Snooping
> does not consume the bus bandwidth” - can you explain? Which BUS are you
> referring to?
> as far as we know cache coherency protocol consumes FSB cycles, isn’t it?
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at
> http://www.osronline.com/page.cfm?name=ListServer

First of all, thank you for your detailed and very helpful response.

Your response deals in Intel architecture. What about AMD architecture, where there is no snooping but a coherency protocol that runs above Hyper-Transport?
PCIe “no-snoop” is translated into an HT “coherent” bit, which presumable generated HT transactions to make sure that the data is not cached in one of the CPUs.

Additionally, we thought of another reason why asking for non-cacheable buffers can increase performance: if a certain region in memory is not accessed too frequently, keeping it out of the cache makes more room for other, more frequently accessed stuff. This latter issue is not related directly to I/O and DMA.

Thanks