DMA Memory - MmAllocateContiguousNodeMemory

On Jul 12, 2018, at 11:11 PM, xxxxx@gmail.com wrote:
>
> The issue is, when application copies data to this DMA memory, it takes lot of time. For example, to copy 1024 bytes from local malloc() to this DMA memory takes 34us.
>
> Behaviour was almost similar in different test systems. Then I reviewed my implementation again and changed the flag “PAGE_NOCACHE” to “PAGE_WRITECOMBINE” in Step 1. Now, to copy 1024 bytes, it just takes 0.03us.
> …
> Is using “PAGE_WRITECOMBINE” is right way for DMA memory (from app to device)?
> Also, with MmMapLockedPagesSpecifyCache I am setting MmNonCached instead of MmWriteCombined. Is this ok?
>
> Basically, I don’t want any data which my user application is copying to this DMA memory gets cached, in such case, my device will see incorrect data.

No, it won’t. In non-exotic Intel-based architectures, DMA transfers are cache-coherent. However, you should look into KeFlushIoBuffers and FlushAdapterBuffers so your driver works in other architectures.

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

In my case, kernel driver does just the memory allocation and giving it to user application. Data will be transferred directly between application and the device. Also, PCIe device is bus master here and no system DMA is used.

Does MmNonCached setting in MmMapLockedPagesSpecifyCache call doesn’t disable caching for this reserved memory area?

> In my case, kernel driver does just the memory allocation and giving it
to user application. Data will be transferred directly between application
and the device. Also, PCIe device is bus master here and no system DMA is
used.

Does MmNonCached setting in MmMapLockedPagesSpecifyCache call doesn’t
disable caching for this reserved memory area?

No, it doesn’t. I suppose you are digging in the wrong direction. Instead
of fighting with the usage of system API better to check how you configure
your HW DMA controller.
With respect to the caching type, in your case (bus mustering, HW DMA) "
MmNonCached" is correct setting. “MmWriteCombined” is more relevant for the
streaming (non-DMA) approach of data transmission.

Regards,
Dzmitry Altukhou

On Fri, Jul 13, 2018 at 9:26 AM xxxxx@gmail.com
wrote:

> In my case, kernel driver does just the memory allocation and giving it to
> user application. Data will be transferred directly between application and
> the device. Also, PCIe device is bus master here and no system DMA is used.
>
> Does MmNonCached setting in MmMapLockedPagesSpecifyCache call doesn’t
> disable caching for this reserved memory area?
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev&gt;
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:>

So, with MmNonCached in both MmAllocateContiguousNodeMemory & MmMapLockedPagesSpecifyCache, is the huge latency (34us for 1024 bytes) I am seeing is expected?

how can you program a dma without any clue about the achitecture/busses ?
please read about pci/qpi/upi first and see these flags all about and why.
mp

2018-07-13 8:11 GMT+02:00 xxxxx@gmail.com :

> Hi
>
> I need to allocate shared contiguous memory (for DMA) on a particular NUMA
> node to transfer data directly between user space application and PCIe
> device. I have implemented following in kernel driver for buffer allocation
> and following is called in User Application’s Context.
>
> 1) MmAllocateContiguousNodeMemory with “PAGE_READWRITE | PAGE_NOCACHE”
> flags.
> 2) Allocated MDL using IoAllocateMdl with address returned from Step 1 and
> built the MDL using MmBuildMdlForNonPagedPool.
> 3) MmMapLockedPagesSpecifyCache with UserMode, MmNonCached flags to get
> User Space Virtual Address of the allocated memory.
> 4) MmGetPhysicalAddress to get the starting physical address of the
> allocated memory.
>
> With the above, I am able to transfer data between application and device.
>
> The issue is, when application copies data to this DMA memory, it takes
> lot of time. For example, to copy 1024 bytes from local malloc() to this
> DMA memory takes 34us.
>
> Behaviour was almost similar in different test systems. Then I reviewed my
> implementation again and changed the flag “PAGE_NOCACHE” to
> “PAGE_WRITECOMBINE” in Step 1. Now, to copy 1024 bytes, it just takes
> 0.03us.
>
> In https://docs.microsoft.com/en-us/windows-hardware/drivers/
> ddi/content/wdm/ne-wdm-_memory_caching_type it is mentioned that,
> MmNonCached - The requested memory should not be cached by the processor.
> MmWriteCombined - The requested memory should not be cached by the
> processor, “but writes to the memory can be combined by the processor”.
>
> Is using “PAGE_WRITECOMBINE” is right way for DMA memory (from app to
> device)?
> Also, with MmMapLockedPagesSpecifyCache I am setting MmNonCached instead
> of MmWriteCombined. Is this ok?
>
> Basically, I don’t want any data which my user application is copying to
> this DMA memory gets cached, in such case, my device will see incorrect
> data.
>
> Regards,
> MK
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

Because Windows provides an architecture that specifically provides this feature, and a set of standardized interfaces. Read:

http:

There is, or was, a long Microsoft white paper on this topic (that I wrote) that describes all the function calls used in both WDM and WDF as well…

Peter
OSR
@OSRDrivers</http:>

yes, software and abstraction always is nice to have but should have at
lease a minimal clue about what are you doing
and what the hardware is all about.

2018-07-13 12:58 GMT+02:00 xxxxx@osr.com :

> Because Windows provides an architecture that specifically provides this
> feature, and a set of standardized interfaces. Read:
>
> http:
>
> There is, or was, a long Microsoft white paper on this topic (that I
> wrote) that describes all the function calls used in both WDM and WDF as
> well…
>
> Peter
> OSR
> @OSRDrivers
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:></http:>

Thanks for the information and DMA article. I had a look at it and summarizing the findings. Please correct me if I am wrong.

  1. If device supports Scatter/Gather DMA and Bus Master, the Device Bus Logical Address = Physical Address. In this case, using physical address (from physically contiguous) for DMA operations obtained from MmGetPhysicalAddress() should work.

  2. x64 based systems are cache coherent and hence no need to flush for DMA (KeFlushIoBuffers is an empty macro). My environment is x64.

If the above two holds good, should I be worried about transferring data between application and device using the approach I followed with PAGE_WRITECOMBINE?

Yes. CLEARLY this is required.

Yes… IF the Bus Master is natively capable of DMA operations that reach all of memory. So, for a conventional PC architecture, the Bus Master is natively capable of 64-bit addressing.

But, of course, this does not relieve you of the architectural requirement to utilize the Windows APIs for your DMA operations… which is still required. You may *not* ever make code assumptions that Physical Address == Device Bus Logical Address. You may *never* call MmGetPhysicalAddress and stuff that address into your DMA adapter, as doing so violates Windows OS architecture rules (and the contract that your driver has with the kernel and the HAL).

Yes… but this does not relieve you of the architectural requirement to call KeFlushIoBuffers… which is still required. The fact that this does or does not generate an actual flush is a matter of implementation and not of architecture.

I wouldn’t use PAGE_WRITECOMBINE. The only place I’ve ever heard of using WRITECOMBINE is on a video frame buffer. So I, personally, wouldn’t be comfortable using WRITECOMBINE for another sort of DMA mapping… and CERTAINLY not one that was shared with user-mode. But, that’s just me… Perhaps you’re smarter than I am and understand all the ramifications of using WRITECOMBINE.

Peter
OSR
@OSRDrivers

…which does not necessarily hold true for a system with IOMMU

In general, these days you have, again, to stop making any assumptions about the target
architecture - these days NT supports, again, more than just x86 and x86_64, and even above mentioned “conventional” architectures still may have “not-so-standard” features like IOMMU…

Anton Basssov

>…which does not necessarily hold true for a system with IOMMU

Outstanding point. And absolutely correct.

(Holy SHIT! Mr. Bassov just made a post that “added value” to a thread! Mr. Bassov… you may take the weekend off, in celebration of this great event!)

Peter
OSR
@OSRDrivers

xxxxx@gmail.com wrote:

So, with MmNonCached in both MmAllocateContiguousNodeMemory & MmMapLockedPagesSpecifyCache, is the huge latency (34us for 1024 bytes) I am seeing is expected?

Yes, absolutely.  Without caching, every dword you write to that region
blocks until the write gets out to RAM.  With caching, your writes go to
the cache, which takes one cycle, and then your app can move on.  The
cache can take its time flushing to RAM later in big block writes,
instead of blocking one dword at a time.

It’s easy to underestimate the enormous effect caching has on system
performance.  About 10 years ago, I did some work for a patent attorney
who wanted to know if it was practical to run Windows without caching. 
When I turned off caching on all of RAM, it took 9 minutes for Vista to
boot.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Are you still desperate to port my account to the new (and, scary, at least for me) platform
in a Troll Mode???

Anton Bassov

> Are you still desperate to port my account…

One post is not quite enough earn you redemption, Mr Bassov.

Peter
OSR
@OSRDrivers

> One post is not quite enough earn you redemption, Mr Bassov.

Oh, come on - I had mainly behaved recently, although still with some VERY infrequent trolling attempts (like the one one C++ thread)…

Anton Bassov

>In general, these days you have, again, to stop making any assumptions about the target

architecture - these days NT supports, again, more than just x86 and x86_

+1

I’m working on Windows ARM64 servers now (Cavium/Marvell ThunderX2). It’s taking a while to discover and unlearn the sometimes subtle x64 assumptions I unconsciously make.

The typical server I work on now has 2 sockets, 64 cores, and 256 logical processors (4 way hyperthreading).

Jan

> It’s taking a while to discover and unlearn the sometimes subtle x64 assumptions

I unconsciously make.

The very first example that gets into my head is an atomic operation like incrementing, adding, exchanging or any other atomic operation with functionality that goes beyond bit-test-and-set…

On x86 performing such an operation may be used an optimisation in some cases, because it is implemented in a hardware.However, some other architectures may not necessarily provide a hardware support for anything atomic, apart from a simple bit-test-and-set.

What do you think using a function like, say, atomic_increment(), may be like on such an arch?
Although such an atomic function may be available on the architecture that does not support atomic addition in a hardware, it is going to be implemented in arch-specific OS layer as a simple addition, guarded by the spinlock that is built around the atomic bit-test-and-set. Although its semantics are going to be exactly the same as they are on x86, using it may, in actuality, result in a performance penalty, rather than an optimisation that it offers on x86…

Anton Bassov
.

Jan Bottorff wrote

I’m working on Windows ARM64 servers now (Cavium/Marvell ThunderX2)

Great. Could you look, please, do these creatures have PAGE_WRITECOMBINE at all, or it is specific to Intel arch?

Regards,
– pa

It’s intended for Intel’s architecture streaming (SIMD) load/store
instructions. To be precise, a store instruction requires the support of
SSE2 and a load, consequently, the support of SSE4.1extension.
In the beginning, the main goal of the write-combine cache type for MMIO
was speeding up the execution of a store instruction. With introducing of
SSE4.1 ( “using a non-temporal memory hint” ) it became possible for load
instructions as well.
I’m using those instructions for the so-called non-DMA approach of data
transmission. Certainly, it slower than normal HW DMA but it very useful in
case your HW hasn’t DMA controller.
Nowadays, this technology able to support the transmission of up to 512 bit
of data per one instruction (128/256/512). It’s possible in case of support
by CPU the AVX and AVX-512 extensions respectively.

Regards,
Dzmitry

On Mon, Jul 16, 2018 at 12:56 AM xxxxx@fastmail.fm
wrote:

> Jan Bottorff wrote
> > I’m working on Windows ARM64 servers now (Cavium/Marvell ThunderX2)
>
> Great. Could you look, please, do these creatures have PAGE_WRITECOMBINE
> at all, or it is specific to Intel arch?
>
> Regards,
> – pa
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: <
> http://www.osronline.com/showlists.cfm?list=ntdev&gt;
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:>

xxxxx@gmail.com wrote:

It’s intended for Intel’s architecture streaming (SIMD) load/store
instructions. To be precise, a store instruction requires the support
of SSE2 and a load, consequently, the support of SSE4.1extension.

Write combining started way before that.  It was originally designed as
a way to speed up graphics operations, by allowing bitmap writes to the
frame buffer to exploit bus bursting, but without turning on full
caching.  Without WC, each write becomes a complete bus cycle.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.