DMA Memory - MmAllocateContiguousNodeMemory

xxxxx@gmail.com wrote:

It’s intended for Intel’s architecture streaming (SIMD) load/store
instructions. To be precise, a store instruction requires the support
of SSE2 and a load, consequently, the support of SSE4.1extension.

Write combining started way before that.  It was originally designed as
a way to speed up graphics operations, by allowing bitmap writes to the
frame buffer to exploit bus bursting, but without turning on full
caching.  Without WC, each write becomes a complete bus cycle.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

I think the concept of write combining is a few decades older that video hardware

Sent from Mailhttps: for Windows 10

________________________________
From: xxxxx@lists.osr.com on behalf of xxxxx@probo.com
Sent: Monday, July 16, 2018 11:39:01 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] DMA Memory - MmAllocateContiguousNodeMemory

xxxxx@gmail.com wrote:
>
> It’s intended for Intel’s architecture streaming (SIMD) load/store
> instructions. To be precise, a store instruction requires the support
> of SSE2 and a load, consequently, the support of SSE4.1extension.

Write combining started way before that. It was originally designed as
a way to speed up graphics operations, by allowing bitmap writes to the
frame buffer to exploit bus bursting, but without turning on full
caching. Without WC, each write becomes a complete bus cycle.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></https:>

The hardware does have something similar to write combining its called gathering, https://developer.arm.com/products/architecture/a-profile/docs/100941/latest/memory-types.

I could not say for sure if the proper bits get set to enable this if you use the PAGE_WRITECOMBINE flag on
MmMapIoSpaceEx. If not, it seems like a bug. Write combining I know makes a significant difference in performance on a simple video frame buffer.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com On Behalf Of xxxxx@fastmail.fm
Sent: Sunday, July 15, 2018 3:56 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] DMA Memory - MmAllocateContiguousNodeMemory

Jan Bottorff wrote
> I’m working on Windows ARM64 servers now (Cavium/Marvell ThunderX2)

Great. Could you look, please, do these creatures have PAGE_WRITECOMBINE at all, or it is specific to Intel arch?

Regards,
– pa


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

I have come across the Writing Combining Usage Guidelines which is quite old in http://download.intel.com/design/PentiumII/applnots/24442201.pdf. The document states the following.

“WC is a weakly ordered memory type. System memory locations are not cached and coherency is not enforced by the processor?s bus coherency protocol. Speculative reads are allowed. Writes may be delayed and combined in the write combining buffer to reduce memory accesses”

Hence, I cant use PAGE_WRITECOMBINE for a system memory used as DMA. Better I would use PAGE_NOCACHE.

I had an interesting observation as below.

If I do memcpy(1024 bytes) to DMA memory, it takes 33us. But if I do memcpy(16 bytes) in a loop for 64 times for a 1024 byte copy, it take 1.9us. For cache and writecombine, it takes 0.04us and 0.03us respectively.

So, I am planning to go with 16 byte chunk copy as of now with PAGE_NOCACHE flag.

xxxxx@gmail.com wrote:

I have come across the Writing Combining Usage Guidelines which is quite old in http://download.intel.com/design/PentiumII/applnots/24442201.pdf. The document states the following.

“WC is a weakly ordered memory type. System memory locations are not cached and coherency is not enforced by the processor?s bus coherency protocol. Speculative reads are allowed. Writes may be delayed and combined in the write combining buffer to reduce memory accesses”

Hence, I cant use PAGE_WRITECOMBINE for a system memory used as DMA. Better I would use PAGE_NOCACHE.

How on earth did you come to that conclusion?  The original Pentium had
three caching options for each region, from least to most performant:
uncached, write-combined, fully cached.  This is what the MTRR tables
specify.  In virtually every case of DMA in an x86 or x64 architecture,
you want fully cached.

I had an interesting observation as below.

If I do memcpy(1024 bytes) to DMA memory, it takes 33us. But if I do memcpy(16 bytes) in a loop for 64 times for a 1024 byte copy, it take 1.9us. For cache and writecombine, it takes 0.04us and 0.03us respectively.

I’d like to see your code, because I don’t believe your first two numbers.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Contiguous memory is allocated using the following code snippet! The target system is x64 (Asus Z170 deluxe).

PHYSICAL_ADDRESS LowestAcceptableAddress = { 0 }, HighestAcceptableAddress = { 0 }, BoundaryAddressMultiple = { 0 };

HighestAcceptableAddress.QuadPart = 0xFFFFFFFFFFFFFFFF;
Protect = PAGE_READWRITE | PAGE_NOCACHE;
ChunkSize = 32*1024*1024;
NumaNode = 0;

SystemVA = MmAllocateContiguousNodeMemory(ChunkSize,
LowestAcceptableAddress,
HighestAcceptableAddress,
BoundaryAddressMultiple,
Protect,
NumaNode
);

pMdl = IoAllocateMdl(SystemVA, ChunkSize, FALSE, FALSE, NULL);

MmBuildMdlForNonPagedPool(pMdl);

UserVA =(((ULONG_PTR)PAGE_ALIGN(MmMapLockedPagesSpecifyCache(pMdl, UserMode, MmNonCached, NULL, FALSE, HighPagePriority))) + MmGetMdlByteOffset(pMdl));

MappedPhyAddr = MmGetPhysicalAddress(SystemVA);

The 64 bit application pseudocode is as follows.

  1. Get the UserVA from the driver through an IOCTL call.
  2. Alloc Temp memory for 32MB and initialize with zeros
  3. Do memcpy in mentioned chunks (16 byte or 1024 byte) for 32MB.

Profiling is done using QueryPerformanceCounter() wrapped around memcpy().

As you mentioned, as the system is x64, I would want to definitely take advantage of cache if the system is cache coherent for DMA operations. I would have to test the system for this.

The contiguous buffer which I allocated is used by both the user space appl. and Device for R/W simultaneously. But at any point of time, only one of them access certain memory range in the buffer.