Access speed of memory allocated with 'MmAllocateNonCachedMemory'

Hello all,

I did not made exact measurements for now, but at first sight , it seems that the access of memory allocate with ‘MmAllocateNonCachedMemory()’ is about 4 times slower than memory allocated with “ExAllocatePoolWithTag” ( paged or non-paged doesn’t matter ). Has someone an idea about the cause of this ?

Christiaan

Could it be to do with the ‘NonCached’ bit ?
“Christiaan Ghijselinck” wrote in message news:xxxxx@ntdev…

Hello all,

I did not made exact measurements for now, but at first sight , it seems that the access of memory allocate with ‘MmAllocateNonCachedMemory()’ is about 4 times slower than memory allocated with “ExAllocatePoolWithTag” ( paged or non-paged doesn’t matter ). Has someone an idea about the cause of this ?

Christiaan

The exact ratio between non-cached memory and cached memory, in access
speed, will be something in the order of 2x to 100x slower, depending on
many factors, such as the speed of the processor, cache-sizes,
write-combine facilities used, etc, etc. This is for the obvious reason:
The non-cached memory isn’t cached internally in the processor, and thus,
it’s going to force the processor to access this memory externally. It is
also likely to not do burst-accesses to non-cached memory, but that’s not a
guarantee, it does definitely depend on the chipset (memory controller)
involved. However, it’s guaranteed that if you do a write to memory,
followed by a read, the write should finish before the memory is allowed to
be read.

If the chipset supports write-combining, and the memory type is set to
write-combine rather than “No caching”, then the processor/chipset is
allowed to combine several writes, out of order, so that they appear as one
or more larger writes. Imaginary example:

mov eax, dword ptr 2000
mov ebx, dword ptr 2008
mov ecx, dword ptr 2004
mov edx, dword ptr 200c
This is not guaranteed to come out in the order of the writes, but could
well come out in the more sequential order of:
2000+2004 as one write
2008+200C as one write

However, for completely non-cacheable memory, the writes HAVE TO be ordered
as the processor completes the instructions (and completion of instructions
must be strongly ordered with respects to memory accesses).

The memory access pattern is configured through a series of Model Specific
Registers called MTRR (Memory Type Range Register). There are several of
these MTRR’s. Memory can be configured as “No caching”, “Write Combine” or
“Cacheable”.

Cacheable is of course the complete opposite of non-cacheable, the
processor is allowed to write in ANY order it likes, and writes may well
happen long after a read for some other region, and the processor may even
read memory on a speculative basis (i.e. you read address 2000, and the
processor decides that “I haven’t got any better idea of what to do, so
I’ll load up 2010 too”).

The actual comprehensive list of memory types are:
NC - Non cacheable. No caching is allowed for this memory. No speculative
reads.
CD - Cache disable. This mode prevents data from being loaded to the cache,
but data in the cache is still availble for Code Cachine, but not for data
caching. .
WC - Write combine. Allow minimal re-ordering of writes so that the number
of writes are reduced. Speculative reads allowed.
WP - Write protected. This means that the cache is write protected, not
that the memory is write protected. So cache-lines are allocated on reads,
but writes go directly to memory and invalidates the cache-line.
Speculative reads allowed.
WT - Write through. Write to memory at the same time as updating the cache.
Speculative reads allowed.
WB - Write back. Write to memory only when modified data in the cache has
to be evicted. Of course, speculative reads are allowed here too.

If you call ExAllocatePoolWithTag, you will get WB memory,
MmAllocateNonCachedMemory, should as far as I understand, return “NC”
memory. You can also allocate “WC” memory by using the
MmAllocateContiguousMemorySpecifyCache, obviously using the MmWriteCombined
cache-type.


Mats

xxxxx@lists.osr.com wrote on 11/15/2004 08:10:57 AM:

Hello all,

I did not made exact measurements for now, but at first sight , it
seems that the access of memory allocate with
‘MmAllocateNonCachedMemory()’ is about 4 times slower than memory
allocated with “ExAllocatePoolWithTag” ( paged or non-paged doesn’t
matter ). Has someone an idea about the cause of this ?

Christiaan


Questions? First check the Kernel Driver FAQ at http://www.
osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com
ForwardSourceID:NT000073BA

Surely noncached memory is slower then cached one.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

----- Original Message -----
From: Christiaan Ghijselinck
To: Windows System Software Devs Interest List
Sent: Monday, November 15, 2004 11:10 AM
Subject: [ntdev] Access speed of memory allocated with ‘MmAllocateNonCachedMemory’

Hello all,

I did not made exact measurements for now, but at first sight , it seems that the access of memory allocate with ‘MmAllocateNonCachedMemory()’ is about 4 times slower than memory allocated with “ExAllocatePoolWithTag” ( paged or non-paged doesn’t matter ). Has someone an idea about the cause of this ?

Christiaan


Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument: ‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

Considering the amount of silicon spent on modern processors for cache
space, one would think that it did make some difference, at least in SOME
applications… :wink:


Mats
xxxxx@lists.osr.com wrote on 11/15/2004 04:19:38 PM:

Surely noncached memory is slower then cached one.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com
----- Original Message -----
From: Christiaan Ghijselinck
To: Windows System Software Devs Interest List
Sent: Monday, November 15, 2004 11:10 AM
Subject: [ntdev] Access speed of memory allocated with
‘MmAllocateNonCachedMemory’

Hello all,

I did not made exact measurements for now, but at first sight , it
seems that the access of memory allocate with
‘MmAllocateNonCachedMemory()’ is about 4 times slower than memory
allocated with “ExAllocatePoolWithTag” ( paged or non-paged doesn’t
matter ). Has someone an idea about the cause of this ?

Christiaan


Questions? First check the Kernel Driver FAQ at http://www.
osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com

Questions? First check the Kernel Driver FAQ at http://www.
osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com
ForwardSourceID:NT000073EA

Super ! These are the “complete” answers that I like :slight_smile:

Big thank you , Mats,

Christiaan

----- Original Message -----
From: “Mats PETERSSON”
To: “Windows System Software Devs Interest List”
Sent: Monday, November 15, 2004 5:04 PM
Subject: Re: [ntdev] Access speed of memory allocated with ‘MmAllocateNonCachedMemory’

>
>
>
>
>
> The exact ratio between non-cached memory and cached memory, in access
> speed, will be something in the order of 2x to 100x slower, depending on
> many factors, such as the speed of the processor, cache-sizes,
> write-combine facilities used, etc, etc. This is for the obvious reason:
> The non-cached memory isn’t cached internally in the processor, and thus,
> it’s going to force the processor to access this memory externally. It is
> also likely to not do burst-accesses to non-cached memory, but that’s not a
> guarantee, it does definitely depend on the chipset (memory controller)
> involved. However, it’s guaranteed that if you do a write to memory,
> followed by a read, the write should finish before the memory is allowed to
> be read.
>
> If the chipset supports write-combining, and the memory type is set to
> write-combine rather than “No caching”, then the processor/chipset is
> allowed to combine several writes, out of order, so that they appear as one
> or more larger writes. Imaginary example:
>
> mov eax, dword ptr 2000
> mov ebx, dword ptr 2008
> mov ecx, dword ptr 2004
> mov edx, dword ptr 200c
> This is not guaranteed to come out in the order of the writes, but could
> well come out in the more sequential order of:
> 2000+2004 as one write
> 2008+200C as one write
>
>
> However, for completely non-cacheable memory, the writes HAVE TO be ordered
> as the processor completes the instructions (and completion of instructions
> must be strongly ordered with respects to memory accesses).
>
> The memory access pattern is configured through a series of Model Specific
> Registers called MTRR (Memory Type Range Register). There are several of
> these MTRR’s. Memory can be configured as “No caching”, “Write Combine” or
> “Cacheable”.
>
> Cacheable is of course the complete opposite of non-cacheable, the
> processor is allowed to write in ANY order it likes, and writes may well
> happen long after a read for some other region, and the processor may even
> read memory on a speculative basis (i.e. you read address 2000, and the
> processor decides that “I haven’t got any better idea of what to do, so
> I’ll load up 2010 too”).
>
> The actual comprehensive list of memory types are:
> NC - Non cacheable. No caching is allowed for this memory. No speculative
> reads.
> CD - Cache disable. This mode prevents data from being loaded to the cache,
> but data in the cache is still availble for Code Cachine, but not for data
> caching. .
> WC - Write combine. Allow minimal re-ordering of writes so that the number
> of writes are reduced. Speculative reads allowed.
> WP - Write protected. This means that the cache is write protected, not
> that the memory is write protected. So cache-lines are allocated on reads,
> but writes go directly to memory and invalidates the cache-line.
> Speculative reads allowed.
> WT - Write through. Write to memory at the same time as updating the cache.
> Speculative reads allowed.
> WB - Write back. Write to memory only when modified data in the cache has
> to be evicted. Of course, speculative reads are allowed here too.
>
> If you call ExAllocatePoolWithTag, you will get WB memory,
> MmAllocateNonCachedMemory, should as far as I understand, return “NC”
> memory. You can also allocate “WC” memory by using the
> MmAllocateContiguousMemorySpecifyCache, obviously using the MmWriteCombined
> cache-type.
>
> –
> Mats
>
>
> xxxxx@lists.osr.com wrote on 11/15/2004 08:10:57 AM:
>
> >
> > Hello all,
> >
> > I did not made exact measurements for now, but at first sight , it
> > seems that the access of memory allocate with
> > ‘MmAllocateNonCachedMemory()’ is about 4 times slower than memory
> > allocated with “ExAllocatePoolWithTag” ( paged or non-paged doesn’t
> > matter ). Has someone an idea about the cause of this ?
> >
> > Christiaan
> >
> >
> > —
> > Questions? First check the Kernel Driver FAQ at http://www.
> > osronline.com/article.cfm?id=256
> >
> > You are currently subscribed to ntdev as: unknown lmsubst tag argument:
> ‘’
> > To unsubscribe send a blank email to xxxxx@lists.osr.com
> > ForwardSourceID:NT000073BA
>
>
> —
> Questions? First check the Kernel Driver FAQ at http://www.osronline.com/article.cfm?id=256
>
> You are currently subscribed to ntdev as: xxxxx@compaqnet.be
> To unsubscribe send a blank email to xxxxx@lists.osr.com
>

Note there is a huge difference between Non-paged and cached, cached means somewhere closer to the speed of light ( ie speed of registers ? ). Non-page only saves the paging trouble …

-pro

This is a very interesting topic, I have a related question:
I have a PCI device without DMA, where I could optimize the write speed
by 15% by changing the MmMapIoSpace from MmNonCached to
MmFrameBufferCached. That’s good.
But I still only get 26 MB/s!

Would it make a difference to change the driver from IOCTL using NEITHER
to WriteFile with DO_DIRECT_IO?

What is the best way to get long PCI bursts without DMA?

By the way, I’m still using NT4.

/Martin Green

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Mats PETERSSON
Sent: Monday, November 15, 2004 5:05 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Access speed of memory allocated with
‘MmAllocateNonCachedMemory’

The exact ratio between non-cached memory and cached memory, in access
speed, will be something in the order of 2x to 100x slower, depending on
many factors, such as the speed of the processor, cache-sizes,
write-combine facilities used, etc, etc. This is for the obvious reason:
The non-cached memory isn’t cached internally in the processor, and
thus, it’s going to force the processor to access this memory
externally. It is also likely to not do burst-accesses to non-cached
memory, but that’s not a guarantee, it does definitely depend on the
chipset (memory controller) involved. However, it’s guaranteed that if
you do a write to memory, followed by a read, the write should finish
before the memory is allowed to be read.

If the chipset supports write-combining, and the memory type is set to
write-combine rather than “No caching”, then the processor/chipset is
allowed to combine several writes, out of order, so that they appear as
one or more larger writes. Imaginary example:

mov eax, dword ptr 2000
mov ebx, dword ptr 2008
mov ecx, dword ptr 2004
mov edx, dword ptr 200c
This is not guaranteed to come out in the order of the writes, but could
well come out in the more sequential order of:
2000+2004 as one write
2008+200C as one write

However, for completely non-cacheable memory, the writes HAVE TO be
ordered as the processor completes the instructions (and completion of
instructions must be strongly ordered with respects to memory accesses).

The memory access pattern is configured through a series of Model
Specific Registers called MTRR (Memory Type Range Register). There are
several of these MTRR’s. Memory can be configured as “No caching”,
“Write Combine” or “Cacheable”.

Cacheable is of course the complete opposite of non-cacheable, the
processor is allowed to write in ANY order it likes, and writes may
well happen long after a read for some other region, and the processor
may even read memory on a speculative basis (i.e. you read address 2000,
and the processor decides that “I haven’t got any better idea of what to
do, so I’ll load up 2010 too”).

The actual comprehensive list of memory types are:
NC - Non cacheable. No caching is allowed for this memory. No
speculative reads. CD - Cache disable. This mode prevents data from
being loaded to the cache, but data in the cache is still availble for
Code Cachine, but not for data caching. . WC - Write combine. Allow
minimal re-ordering of writes so that the number of writes are reduced.
Speculative reads allowed. WP - Write protected. This means that the
cache is write protected, not that the memory is write protected. So
cache-lines are allocated on reads, but writes go directly to memory and
invalidates the cache-line. Speculative reads allowed. WT - Write
through. Write to memory at the same time as updating the cache.
Speculative reads allowed. WB - Write back. Write to memory only when
modified data in the cache has to be evicted. Of course, speculative
reads are allowed here too.

If you call ExAllocatePoolWithTag, you will get WB memory,
MmAllocateNonCachedMemory, should as far as I understand, return “NC”
memory. You can also allocate “WC” memory by using the
MmAllocateContiguousMemorySpecifyCache, obviously using the
MmWriteCombined cache-type.


Mats

xxxxx@lists.osr.com wrote on 11/15/2004 08:10:57 AM:

Hello all,

I did not made exact measurements for now, but at first sight , it
seems that the access of memory allocate with
‘MmAllocateNonCachedMemory()’ is about 4 times slower than memory
allocated with “ExAllocatePoolWithTag” ( paged or non-paged doesn’t
matter ). Has someone an idea about the cause of this ?

Christiaan


Questions? First check the Kernel Driver FAQ at http://www.
osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: unknown lmsubst tag
argument:
‘’
To unsubscribe send a blank email to xxxxx@lists.osr.com
ForwardSourceID:NT000073BA


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@lorensbergs.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

Martin,

In my experience there is no way to guarantee any bursting activity using
CPU controlled (PIO) transfers. Anything you might be able to come up with
would be chipset/PCI bridge specific. The only way to control this with any
level is to incorporate bus master DMA on your card.

The difference is significant, on a 66Mhz PCI bus, it could take 120 to
130ns per transaction for PIO transfers. Burst cycles only incur
15ns/transaction on a 66Mhz bus. Of course these numbers would be different
on a 33Mhz PCI bus, but the same principal applies.

At 10:37 AM 11/16/2004, you wrote:

This is a very interesting topic, I have a related question:
I have a PCI device without DMA, where I could optimize the write speed
by 15% by changing the MmMapIoSpace from MmNonCached to
MmFrameBufferCached. That’s good.
But I still only get 26 MB/s!

Would it make a difference to change the driver from IOCTL using NEITHER
to WriteFile with DO_DIRECT_IO?

What is the best way to get long PCI bursts without DMA?

By the way, I’m still using NT4.

Russ Poffenberger
Credence Systems Corp.
xxxxx@credence.com