Slow memory copying

Hi,

I have a problem with low performance when copying data from my driver into the user space buffer. User space application allocates memory with VirtualAlloc and calls ReadFile on driver. Driver than calls MmGetSystemAddressForMdlSafe and copies the test data with RtlCopyMemory AKA memcpy. I measure execution time of RtlCopyMemory with KeQueryPerformanceCounter and get a result of 160MB/s (tested on 4MB buffer). That seems too slow and I don’t know why that is.
Driver uses DO_DIRECT_IO flag and memory for test data is allocated with MmAllocateContiguousMemorySpecifyCache.

As cached or non-cached memory?

Peter
OSR
@OSRDrivers

Non-cached. I understand that creates performance penalties but that drastic?
I doubt that cache would help because hardware device writes directly to kernel buffer
and I read only once from it. Streaming basically, always new data.

Each read (8 or 16 bytes depending on instruction type) generates access to the DRAM module instead of using cached data copied in bulk using the whole data bus width. The time required to read 8 bytes from DRAM to a register and to bring 64 bytes from DRAM to cache is the same, but in the latter case the following 7 memory reads fetch data from the cache and you have such facility as data prefetching which brings the next 64 bytes in the cache while CPU is busy with copying the current cache line ( 8 read instructions ).

xxxxx@gmail.com wrote:

Non-cached. I understand that creates performance penalties but that drastic?

Yes, the difference is huge. I did some experiments for a client on a
laptop in the Vista timeframe by forcing all memory to non-cached. The
Window boot time went from 2 minutes to 8 minutes.

I doubt that cache would help because hardware device writes directly to kernel buffer
and I read only once from it. Streaming basically, always new data.

In an Intel x86 environment, bus-mastered access is cache-coherent. The
only reason to have a memory region be non-cached is if the memory lives
on a device where the contents are dynamic.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Non cached memory is SLOW, very slow. The CPU cache is critical for
performance.

On Mon, May 8, 2017 at 11:30 AM Tim Roberts wrote:

> xxxxx@gmail.com wrote:
> > Non-cached. I understand that creates performance penalties but that
> drastic?
>
> Yes, the difference is huge. I did some experiments for a client on a
> laptop in the Vista timeframe by forcing all memory to non-cached. The
> Window boot time went from 2 minutes to 8 minutes.
>
>
> > I doubt that cache would help because hardware device writes directly to
> kernel buffer
> > and I read only once from it. Streaming basically, always new data.
>
> In an Intel x86 environment, bus-mastered access is cache-coherent. The
> only reason to have a memory region be non-cached is if the memory lives
> on a device where the contents are dynamic.
>
> –
> Tim Roberts, xxxxx@probo.com
> Providenza & Boekelheide, Inc.
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev&gt;
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:>

It was cache indeed. Now I’m not sure whether to use MmCached or MmHardwareCoherentCached but I guess I should stick to MmCached since the other one is, as MSDN says, reserved for system use.

Thanks, everyone!