How to map memory as uncached

Hello,

Let’s say I have a PCI device with memory mapped I/O. I have a device driver which currently obtains the memory regions, maps them into logical memory and when a user-mode process opens the device, it maps that memory into user-mode land.

How do you tell the HAL to map the memory without caching. I would like all accesses to the memory regions to go directly to the board, without any read/write caching. If caching were to take place, then we would be required to explicitly flush the cache… In reading the Intel documentation, we have found that the CPUID instruction can serialize the CPU and also possibly flush the cache. The CPUID instruction is available from user mode land. I have used the CPUID for identification, but have never used it to serialize the CPU. Does anyone have any feedback about using CPUID from user mode?

  1. How to map memory uncached?
  2. Advise on using CPUID for serialization and cache flush…

Thank you,
Michael Uman

>How do you tell the HAL to map the memory without caching. I would like

all accesses to the memory regions to go directly to the board, without
any read/write caching. If caching were to take place, then we would be
required to explicitly flush the cache… In reading the Intel
documentation, we have found that the CPUID instruction can serialize the
CPU and also possibly flush the cache. The CPUID instruction is available
from user mode land. I have used the CPUID for identification, but have
never used it to serialize the CPU. Does anyone have any feedback about
using CPUID from user mode?

  1. How to map memory uncached?
  2. Advise on using CPUID for serialization and cache flush…

When you map the device memory in the driver, one of the parameters to the
mapping API controls cached or uncached. Also note that caching and
instruction stream serialization are different. Also note that to get PCI
bursts larger than 1 DWORD, to a target PCI device, you may need the
mapping to be cached (especially for PCI target reads).

Instruction stream synchronization may be needed if you need something like
access to system memory shared with a bus master device to be synchronized
with your uncached bus accesses. For example, if you don’t synchronize the
instruction stream, you could potentially write to device memory, and then
read from memory that’s written by bus mastering. As the processor may
execute instructions out of order, you may actually be reading the shared
memory BEFORE you execute the device memory write instruction. The Intel
processor manual has a large section on exactly how reads/writes are
ordered to different types of addresses. You also may actually want memory
to be write buffered/combined to get better PCI target burst performance.

Actually FLUSHING the memory cache can be VERY expensive, like 40,000+
clocks as I remember, not to mention processor performance suffers while
the cache reloads. Some versions of Windows also don’t exactly control the
MTRR registers so well, so getting the exact caching/write combining
behavior desired can be tricky.

Also note that bugs caused by instruction stream synchronization and
caching issues can be very tricky to debug.

Also note that depending on your hardware design, mapping a device directly
into user mode space has the potential for seriously disrupting the whole
system. Specifically, if your PCI device might decide to do PCI bus
transaction retries for a long time, to stall the access, the processor can
be FROZEN on the read/write until it completes. Interrupts will NOT be
serviced while the processor waits for the access to happen, for PCI 2.1
and later devices, bus master activity should interleave with the retry
attempts. Some video cards have produced interrupt latencies of > 10
milliseconds because of this, causing other devices to malfunction. Kernel
mode code can also cause this latency disruption too, but kernel mode
developers are supposed to know better.

  • Jan

>How do you tell the HAL to map the memory without caching.

IIRC some bit in the card’s BAR register specifies that this memory cannot
be cached. HalAssignSlotResources sets the appropriate flag in
CM_RESOURCE_LIST if this BAR bit is set.

Max

>IIRC some bit in the card’s BAR register specifies that this memory cannot

be cached. HalAssignSlotResources sets the appropriate flag in
CM_RESOURCE_LIST if this BAR bit is set.

I thought the bit in the BAR indicated the ability to prefetch the memory.
An example would be you read one DWORD and the bus controller actually does
a large burst read. Following consecutive reads then come from buffers,
which I think can actually be different buffers than the normal cache. This
would be prefetchable uncached memory. These buffers may be in the system
chipset or even a PCI bridge. To get good target read performance, you have
to have bursts bigger than 1 DWORD. When I think about really caching ANY
device memory in the normal processor cache, it’s sounds really ugly. Say
your write a DWORD that’s not cache line aligned, that imples the cache
controller first should read the line (assuming your doing cache line
allocation on write, which modern Intel processor do). Write combining,
which is different from caching seems more appropriate for many devices. If
set for non-prefetched uncached but write combined, each read get’s
translated into a single DWORD read on the PCI bus. Writes to consecutive
addresses may be deferred and merged into a large burst write. For video
memory, this gives much better performance. I’m not quite sure how to force
flushing the write combining buffers though. Ideally, a video card would
use something like prefetchable uncached write combining for the frame
buffer, and non-prefetchable uncached non-write combining for it’s control
registers.

  • Jan