I need to allocate shared contiguous memory (for DMA) on a particular NUMA node to transfer data directly between user space application and PCIe device. I have implemented following in kernel driver for buffer allocation and following is called in User Application's Context.
1) MmAllocateContiguousNodeMemory with "PAGE_READWRITE | PAGE_NOCACHE" flags.
2) Allocated MDL using IoAllocateMdl with address returned from Step 1 and built the MDL using MmBuildMdlForNonPagedPool.
3) MmMapLockedPagesSpecifyCache with UserMode, MmNonCached flags to get User Space Virtual Address of the allocated memory.
4) MmGetPhysicalAddress to get the starting physical address of the allocated memory.
With the above, I am able to transfer data between application and device.
The issue is, when application copies data to this DMA memory, it takes lot of time. For example, to copy 1024 bytes from local malloc() to this DMA memory takes 34us.
Behaviour was almost similar in different test systems. Then I reviewed my implementation again and changed the flag "PAGE_NOCACHE" to "PAGE_WRITECOMBINE" in Step 1. Now, to copy 1024 bytes, it just takes 0.03us.
it is mentioned that,
MmNonCached - The requested memory should not be cached by the processor.
MmWriteCombined - The requested memory should not be cached by the processor, "but writes to the memory can be combined by the processor".
Is using "PAGE_WRITECOMBINE" is right way for DMA memory (from app to device)?
Also, with MmMapLockedPagesSpecifyCache I am setting MmNonCached instead of MmWriteCombined. Is this ok?
Basically, I don't want any data which my user application is copying to this DMA memory gets cached, in such case, my device will see incorrect data.