DMA Memory - MmAllocateContiguousNodeMemory

Because Windows provides an architecture that specifically provides this feature, and a set of standardized interfaces. Read:

http:

There is, or was, a long Microsoft white paper on this topic (that I wrote) that describes all the function calls used in both WDM and WDF as well…

Peter
OSR
@OSRDrivers</http:>

yes, software and abstraction always is nice to have but should have at
lease a minimal clue about what are you doing
and what the hardware is all about.

2018-07-13 12:58 GMT+02:00 xxxxx@osr.com :

> Because Windows provides an architecture that specifically provides this
> feature, and a set of standardized interfaces. Read:
>
> http:
>
> There is, or was, a long Microsoft white paper on this topic (that I
> wrote) that describes all the function calls used in both WDM and WDF as
> well…
>
> Peter
> OSR
> @OSRDrivers
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:></http:>

Thanks for the information and DMA article. I had a look at it and summarizing the findings. Please correct me if I am wrong.

  1. If device supports Scatter/Gather DMA and Bus Master, the Device Bus Logical Address = Physical Address. In this case, using physical address (from physically contiguous) for DMA operations obtained from MmGetPhysicalAddress() should work.

  2. x64 based systems are cache coherent and hence no need to flush for DMA (KeFlushIoBuffers is an empty macro). My environment is x64.

If the above two holds good, should I be worried about transferring data between application and device using the approach I followed with PAGE_WRITECOMBINE?

Yes. CLEARLY this is required.

Yes… IF the Bus Master is natively capable of DMA operations that reach all of memory. So, for a conventional PC architecture, the Bus Master is natively capable of 64-bit addressing.

But, of course, this does not relieve you of the architectural requirement to utilize the Windows APIs for your DMA operations… which is still required. You may *not* ever make code assumptions that Physical Address == Device Bus Logical Address. You may *never* call MmGetPhysicalAddress and stuff that address into your DMA adapter, as doing so violates Windows OS architecture rules (and the contract that your driver has with the kernel and the HAL).

Yes… but this does not relieve you of the architectural requirement to call KeFlushIoBuffers… which is still required. The fact that this does or does not generate an actual flush is a matter of implementation and not of architecture.

I wouldn’t use PAGE_WRITECOMBINE. The only place I’ve ever heard of using WRITECOMBINE is on a video frame buffer. So I, personally, wouldn’t be comfortable using WRITECOMBINE for another sort of DMA mapping… and CERTAINLY not one that was shared with user-mode. But, that’s just me… Perhaps you’re smarter than I am and understand all the ramifications of using WRITECOMBINE.

Peter
OSR
@OSRDrivers

…which does not necessarily hold true for a system with IOMMU

In general, these days you have, again, to stop making any assumptions about the target
architecture - these days NT supports, again, more than just x86 and x86_64, and even above mentioned “conventional” architectures still may have “not-so-standard” features like IOMMU…

Anton Basssov

>…which does not necessarily hold true for a system with IOMMU

Outstanding point. And absolutely correct.

(Holy SHIT! Mr. Bassov just made a post that “added value” to a thread! Mr. Bassov… you may take the weekend off, in celebration of this great event!)

Peter
OSR
@OSRDrivers

xxxxx@gmail.com wrote:

So, with MmNonCached in both MmAllocateContiguousNodeMemory & MmMapLockedPagesSpecifyCache, is the huge latency (34us for 1024 bytes) I am seeing is expected?

Yes, absolutely.  Without caching, every dword you write to that region
blocks until the write gets out to RAM.  With caching, your writes go to
the cache, which takes one cycle, and then your app can move on.  The
cache can take its time flushing to RAM later in big block writes,
instead of blocking one dword at a time.

It’s easy to underestimate the enormous effect caching has on system
performance.  About 10 years ago, I did some work for a patent attorney
who wanted to know if it was practical to run Windows without caching. 
When I turned off caching on all of RAM, it took 9 minutes for Vista to
boot.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Are you still desperate to port my account to the new (and, scary, at least for me) platform
in a Troll Mode???

Anton Bassov

> Are you still desperate to port my account…

One post is not quite enough earn you redemption, Mr Bassov.

Peter
OSR
@OSRDrivers

> One post is not quite enough earn you redemption, Mr Bassov.

Oh, come on - I had mainly behaved recently, although still with some VERY infrequent trolling attempts (like the one one C++ thread)…

Anton Bassov

>In general, these days you have, again, to stop making any assumptions about the target

architecture - these days NT supports, again, more than just x86 and x86_

+1

I’m working on Windows ARM64 servers now (Cavium/Marvell ThunderX2). It’s taking a while to discover and unlearn the sometimes subtle x64 assumptions I unconsciously make.

The typical server I work on now has 2 sockets, 64 cores, and 256 logical processors (4 way hyperthreading).

Jan

> It’s taking a while to discover and unlearn the sometimes subtle x64 assumptions

I unconsciously make.

The very first example that gets into my head is an atomic operation like incrementing, adding, exchanging or any other atomic operation with functionality that goes beyond bit-test-and-set…

On x86 performing such an operation may be used an optimisation in some cases, because it is implemented in a hardware.However, some other architectures may not necessarily provide a hardware support for anything atomic, apart from a simple bit-test-and-set.

What do you think using a function like, say, atomic_increment(), may be like on such an arch?
Although such an atomic function may be available on the architecture that does not support atomic addition in a hardware, it is going to be implemented in arch-specific OS layer as a simple addition, guarded by the spinlock that is built around the atomic bit-test-and-set. Although its semantics are going to be exactly the same as they are on x86, using it may, in actuality, result in a performance penalty, rather than an optimisation that it offers on x86…

Anton Bassov
.

Jan Bottorff wrote

I’m working on Windows ARM64 servers now (Cavium/Marvell ThunderX2)

Great. Could you look, please, do these creatures have PAGE_WRITECOMBINE at all, or it is specific to Intel arch?

Regards,
– pa

It’s intended for Intel’s architecture streaming (SIMD) load/store
instructions. To be precise, a store instruction requires the support of
SSE2 and a load, consequently, the support of SSE4.1extension.
In the beginning, the main goal of the write-combine cache type for MMIO
was speeding up the execution of a store instruction. With introducing of
SSE4.1 ( “using a non-temporal memory hint” ) it became possible for load
instructions as well.
I’m using those instructions for the so-called non-DMA approach of data
transmission. Certainly, it slower than normal HW DMA but it very useful in
case your HW hasn’t DMA controller.
Nowadays, this technology able to support the transmission of up to 512 bit
of data per one instruction (128/256/512). It’s possible in case of support
by CPU the AVX and AVX-512 extensions respectively.

Regards,
Dzmitry

On Mon, Jul 16, 2018 at 12:56 AM xxxxx@fastmail.fm
wrote:

> Jan Bottorff wrote
> > I’m working on Windows ARM64 servers now (Cavium/Marvell ThunderX2)
>
> Great. Could you look, please, do these creatures have PAGE_WRITECOMBINE
> at all, or it is specific to Intel arch?
>
> Regards,
> – pa
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev&gt;
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:>

xxxxx@gmail.com wrote:

It’s intended for Intel’s architecture streaming (SIMD) load/store
instructions. To be precise, a store instruction requires the support
of SSE2 and a load, consequently, the support of SSE4.1extension.

Write combining started way before that.  It was originally designed as
a way to speed up graphics operations, by allowing bitmap writes to the
frame buffer to exploit bus bursting, but without turning on full
caching.  Without WC, each write becomes a complete bus cycle.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

I think the concept of write combining is a few decades older that video hardware

Sent from Mailhttps: for Windows 10

________________________________
From: xxxxx@lists.osr.com on behalf of xxxxx@probo.com
Sent: Monday, July 16, 2018 11:39:01 AM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] DMA Memory - MmAllocateContiguousNodeMemory

xxxxx@gmail.com wrote:
>
> It’s intended for Intel’s architecture streaming (SIMD) load/store
> instructions. To be precise, a store instruction requires the support
> of SSE2 and a load, consequently, the support of SSE4.1extension.

Write combining started way before that. It was originally designed as
a way to speed up graphics operations, by allowing bitmap writes to the
frame buffer to exploit bus bursting, but without turning on full
caching. Without WC, each write becomes a complete bus cycle.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></https:>

The hardware does have something similar to write combining its called gathering, https://developer.arm.com/products/architecture/a-profile/docs/100941/latest/memory-types.

I could not say for sure if the proper bits get set to enable this if you use the PAGE_WRITECOMBINE flag on
MmMapIoSpaceEx. If not, it seems like a bug. Write combining I know makes a significant difference in performance on a simple video frame buffer.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com On Behalf Of xxxxx@fastmail.fm
Sent: Sunday, July 15, 2018 3:56 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] DMA Memory - MmAllocateContiguousNodeMemory

Jan Bottorff wrote
> I’m working on Windows ARM64 servers now (Cavium/Marvell ThunderX2)

Great. Could you look, please, do these creatures have PAGE_WRITECOMBINE at all, or it is specific to Intel arch?

Regards,
– pa


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

I have come across the Writing Combining Usage Guidelines which is quite old in http://download.intel.com/design/PentiumII/applnots/24442201.pdf. The document states the following.

“WC is a weakly ordered memory type. System memory locations are not cached and coherency is not enforced by the processor?s bus coherency protocol. Speculative reads are allowed. Writes may be delayed and combined in the write combining buffer to reduce memory accesses”

Hence, I cant use PAGE_WRITECOMBINE for a system memory used as DMA. Better I would use PAGE_NOCACHE.

I had an interesting observation as below.

If I do memcpy(1024 bytes) to DMA memory, it takes 33us. But if I do memcpy(16 bytes) in a loop for 64 times for a 1024 byte copy, it take 1.9us. For cache and writecombine, it takes 0.04us and 0.03us respectively.

So, I am planning to go with 16 byte chunk copy as of now with PAGE_NOCACHE flag.

xxxxx@gmail.com wrote:

I have come across the Writing Combining Usage Guidelines which is quite old in http://download.intel.com/design/PentiumII/applnots/24442201.pdf. The document states the following.

“WC is a weakly ordered memory type. System memory locations are not cached and coherency is not enforced by the processor?s bus coherency protocol. Speculative reads are allowed. Writes may be delayed and combined in the write combining buffer to reduce memory accesses”

Hence, I cant use PAGE_WRITECOMBINE for a system memory used as DMA. Better I would use PAGE_NOCACHE.

How on earth did you come to that conclusion?  The original Pentium had
three caching options for each region, from least to most performant:
uncached, write-combined, fully cached.  This is what the MTRR tables
specify.  In virtually every case of DMA in an x86 or x64 architecture,
you want fully cached.

I had an interesting observation as below.

If I do memcpy(1024 bytes) to DMA memory, it takes 33us. But if I do memcpy(16 bytes) in a loop for 64 times for a 1024 byte copy, it take 1.9us. For cache and writecombine, it takes 0.04us and 0.03us respectively.

I’d like to see your code, because I don’t believe your first two numbers.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Contiguous memory is allocated using the following code snippet! The target system is x64 (Asus Z170 deluxe).

PHYSICAL_ADDRESS LowestAcceptableAddress = { 0 }, HighestAcceptableAddress = { 0 }, BoundaryAddressMultiple = { 0 };

HighestAcceptableAddress.QuadPart = 0xFFFFFFFFFFFFFFFF;
Protect = PAGE_READWRITE | PAGE_NOCACHE;
ChunkSize = 32*1024*1024;
NumaNode = 0;

SystemVA = MmAllocateContiguousNodeMemory(ChunkSize,
LowestAcceptableAddress,
HighestAcceptableAddress,
BoundaryAddressMultiple,
Protect,
NumaNode
);

pMdl = IoAllocateMdl(SystemVA, ChunkSize, FALSE, FALSE, NULL);

MmBuildMdlForNonPagedPool(pMdl);

UserVA =(((ULONG_PTR)PAGE_ALIGN(MmMapLockedPagesSpecifyCache(pMdl, UserMode, MmNonCached, NULL, FALSE, HighPagePriority))) + MmGetMdlByteOffset(pMdl));

MappedPhyAddr = MmGetPhysicalAddress(SystemVA);

The 64 bit application pseudocode is as follows.

  1. Get the UserVA from the driver through an IOCTL call.
  2. Alloc Temp memory for 32MB and initialize with zeros
  3. Do memcpy in mentioned chunks (16 byte or 1024 byte) for 32MB.

Profiling is done using QueryPerformanceCounter() wrapped around memcpy().

As you mentioned, as the system is x64, I would want to definitely take advantage of cache if the system is cache coherent for DMA operations. I would have to test the system for this.

The contiguous buffer which I allocated is used by both the user space appl. and Device for R/W simultaneously. But at any point of time, only one of them access certain memory range in the buffer.