Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results

Before Posting... Please check out the Community Guidelines in the
Announcements and Administration Category, below.

DMA Memory - MmAllocateContiguousNodeMemory

Hi

I need to allocate shared contiguous memory (for DMA) on a particular NUMA node to transfer data directly between user space application and PCIe device. I have implemented following in kernel driver for buffer allocation and following is called in User Application's Context.

1) MmAllocateContiguousNodeMemory with "PAGE_READWRITE | PAGE_NOCACHE" flags.
2) Allocated MDL using IoAllocateMdl with address returned from Step 1 and built the MDL using MmBuildMdlForNonPagedPool.
3) MmMapLockedPagesSpecifyCache with UserMode, MmNonCached flags to get User Space Virtual Address of the allocated memory.
4) MmGetPhysicalAddress to get the starting physical address of the allocated memory.

With the above, I am able to transfer data between application and device.

The issue is, when application copies data to this DMA memory, it takes lot of time. For example, to copy 1024 bytes from local malloc() to this DMA memory takes 34us.

Behaviour was almost similar in different test systems. Then I reviewed my implementation again and changed the flag "PAGE_NOCACHE" to "PAGE_WRITECOMBINE" in Step 1. Now, to copy 1024 bytes, it just takes 0.03us.

In https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/ne-wdm-_memory_caching_type it is mentioned that,
MmNonCached - The requested memory should not be cached by the processor.
MmWriteCombined - The requested memory should not be cached by the processor, "but writes to the memory can be combined by the processor".

Is using "PAGE_WRITECOMBINE" is right way for DMA memory (from app to device)?
Also, with MmMapLockedPagesSpecifyCache I am setting MmNonCached instead of MmWriteCombined. Is this ok?

Basically, I don't want any data which my user application is copying to this DMA memory gets cached, in such case, my device will see incorrect data.

Regards,
MK

Comments

  • Tim_RobertsTim_Roberts Posts: 12,567
    On Jul 12, 2018, at 11:11 PM, xxxxx@gmail.com <xxxxx@lists.osr.com> wrote:
    >
    > The issue is, when application copies data to this DMA memory, it takes lot of time. For example, to copy 1024 bytes from local malloc() to this DMA memory takes 34us.
    >
    > Behaviour was almost similar in different test systems. Then I reviewed my implementation again and changed the flag "PAGE_NOCACHE" to "PAGE_WRITECOMBINE" in Step 1. Now, to copy 1024 bytes, it just takes 0.03us.
    > ...
    > Is using "PAGE_WRITECOMBINE" is right way for DMA memory (from app to device)?
    > Also, with MmMapLockedPagesSpecifyCache I am setting MmNonCached instead of MmWriteCombined. Is this ok?
    >
    > Basically, I don't want any data which my user application is copying to this DMA memory gets cached, in such case, my device will see incorrect data.

    No, it won't. In non-exotic Intel-based architectures, DMA transfers are cache-coherent. However, you should look into KeFlushIoBuffers and FlushAdapterBuffers so your driver works in other architectures.

    Tim Roberts, xxxxx@probo.com
    Providenza & Boekelheide, Inc.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • In my case, kernel driver does just the memory allocation and giving it to user application. Data will be transferred directly between application and the device. Also, PCIe device is bus master here and no system DMA is used.

    Does MmNonCached setting in MmMapLockedPagesSpecifyCache call doesn't disable caching for this reserved memory area?
  • > In my case, kernel driver does just the memory allocation and giving it
    to user application. Data will be transferred directly between application
    and the device. Also, PCIe device is bus master here and no system DMA is
    used.

    > Does MmNonCached setting in MmMapLockedPagesSpecifyCache call doesn't
    disable caching for this reserved memory area?

    No, it doesn't. I suppose you are digging in the wrong direction. Instead
    of fighting with the usage of system API better to check how you configure
    your HW DMA controller.
    With respect to the caching type, in your case (bus mustering, HW DMA) "
    MmNonCached" is correct setting. "MmWriteCombined" is more relevant for the
    streaming (non-DMA) approach of data transmission.

    Regards,
    Dzmitry Altukhou

    On Fri, Jul 13, 2018 at 9:26 AM xxxxx@gmail.com
    wrote:

    > In my case, kernel driver does just the memory allocation and giving it to
    > user application. Data will be transferred directly between application and
    > the device. Also, PCIe device is bus master here and no system DMA is used.
    >
    > Does MmNonCached setting in MmMapLockedPagesSpecifyCache call doesn't
    > disable caching for this reserved memory area?
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at: <
    > http://www.osronline.com/showlists.cfm?list=ntdev>;
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    > software drivers!
    > Details at
    >
    > To unsubscribe, visit the List Server section of OSR Online at <
    > http://www.osronline.com/page.cfm?name=ListServer>;
    >
  • So, with MmNonCached in both MmAllocateContiguousNodeMemory & MmMapLockedPagesSpecifyCache, is the huge latency (34us for 1024 bytes) I am seeing is expected?
  • how can you program a dma without any clue about the achitecture/busses ?
    please read about pci/qpi/upi first and see these flags all about and why.
    mp

    2018-07-13 8:11 GMT+02:00 xxxxx@gmail.com :

    > Hi
    >
    > I need to allocate shared contiguous memory (for DMA) on a particular NUMA
    > node to transfer data directly between user space application and PCIe
    > device. I have implemented following in kernel driver for buffer allocation
    > and following is called in User Application's Context.
    >
    > 1) MmAllocateContiguousNodeMemory with "PAGE_READWRITE | PAGE_NOCACHE"
    > flags.
    > 2) Allocated MDL using IoAllocateMdl with address returned from Step 1 and
    > built the MDL using MmBuildMdlForNonPagedPool.
    > 3) MmMapLockedPagesSpecifyCache with UserMode, MmNonCached flags to get
    > User Space Virtual Address of the allocated memory.
    > 4) MmGetPhysicalAddress to get the starting physical address of the
    > allocated memory.
    >
    > With the above, I am able to transfer data between application and device.
    >
    > The issue is, when application copies data to this DMA memory, it takes
    > lot of time. For example, to copy 1024 bytes from local malloc() to this
    > DMA memory takes 34us.
    >
    > Behaviour was almost similar in different test systems. Then I reviewed my
    > implementation again and changed the flag "PAGE_NOCACHE" to
    > "PAGE_WRITECOMBINE" in Step 1. Now, to copy 1024 bytes, it just takes
    > 0.03us.
    >
    > In https://docs.microsoft.com/en-us/windows-hardware/drivers/
    > ddi/content/wdm/ne-wdm-_memory_caching_type it is mentioned that,
    > MmNonCached - The requested memory should not be cached by the processor.
    > MmWriteCombined - The requested memory should not be cached by the
    > processor, "but writes to the memory can be combined by the processor".
    >
    > Is using "PAGE_WRITECOMBINE" is right way for DMA memory (from app to
    > device)?
    > Also, with MmMapLockedPagesSpecifyCache I am setting MmNonCached instead
    > of MmWriteCombined. Is this ok?
    >
    > Basically, I don't want any data which my user application is copying to
    > this DMA memory gets cached, in such case, my device will see incorrect
    > data.
    >
    > Regards,
    > MK
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at: showlists.cfm?list=ntdev>
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    > software drivers!
    > Details at
    >
    > To unsubscribe, visit the List Server section of OSR Online at <
    > http://www.osronline.com/page.cfm?name=ListServer>;
    >
  • Peter_ViscarolaPeter_Viscarola Posts: 6,649
    Because Windows provides an architecture that specifically provides this feature, and a set of standardized interfaces. Read:

    <http://online.osr.com/article.cfm?article=539>;

    There is, or was, a long Microsoft white paper on this topic (that I wrote) that describes all the function calls used in both WDM and WDF as well..

    Peter
    OSR
    @OSRDrivers

    Peter Viscarola
    OSR
    @OSRDrivers

  • yes, software and abstraction always is nice to have but should have at
    lease a minimal clue about what are you doing
    and what the hardware is all about.

    2018-07-13 12:58 GMT+02:00 xxxxx@osr.com :

    > Because Windows provides an architecture that specifically provides this
    > feature, and a set of standardized interfaces. Read:
    >
    >
    >
    > There is, or was, a long Microsoft white paper on this topic (that I
    > wrote) that describes all the function calls used in both WDM and WDF as
    > well..
    >
    > Peter
    > OSR
    > @OSRDrivers
    >
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at: showlists.cfm?list=ntdev>
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    > software drivers!
    > Details at
    >
    > To unsubscribe, visit the List Server section of OSR Online at <
    > http://www.osronline.com/page.cfm?name=ListServer>;
    >
  • Thanks for the information and DMA article. I had a look at it and summarizing the findings. Please correct me if I am wrong.

    1) If device supports Scatter/Gather DMA and Bus Master, the Device Bus Logical Address = Physical Address. In this case, using physical address (from physically contiguous) for DMA operations obtained from MmGetPhysicalAddress() should work.

    2) x64 based systems are cache coherent and hence no need to flush for DMA (KeFlushIoBuffers is an empty macro). My environment is x64.

    If the above two holds good, should I be worried about transferring data between application and device using the approach I followed with PAGE_WRITECOMBINE?
  • Peter_ViscarolaPeter_Viscarola Posts: 6,649
    <quote>
    but should have at
    lease a minimal clue about what are you doing
    and what the hardware is all about
    </quote>

    Yes. CLEARLY this is required.

    <quote>
    If device supports Scatter/Gather DMA and Bus Master, the Device Bus Logical
    Address = Physical Address
    </quote>

    Yes... IF the Bus Master is natively capable of DMA operations that reach all of memory. So, for a conventional PC architecture, the Bus Master is natively capable of 64-bit addressing.

    But, of course, this does not relieve you of the architectural requirement to utilize the Windows APIs for your DMA operations... which is still required. You may *not* ever make code assumptions that Physical Address == Device Bus Logical Address. You may *never* call MmGetPhysicalAddress and stuff that address into your DMA adapter, as doing so violates Windows OS architecture rules (and the contract that your driver has with the kernel and the HAL).

    <quote>
    x64 based systems are cache coherent and hence no need to flush for DMA
    </quote>

    Yes... but this does not relieve you of the architectural requirement to call KeFlushIoBuffers... which is still required. The fact that this does or does not generate an actual flush is a matter of implementation and not of architecture.

    <quote>
    should I be worried about transferring data between
    application and device using the approach I followed with PAGE_WRITECOMBINE
    </quote>

    I wouldn't use PAGE_WRITECOMBINE. The only place I've ever heard of using WRITECOMBINE is on a video frame buffer. So I, personally, wouldn't be comfortable using WRITECOMBINE for another sort of DMA mapping... and CERTAINLY not one that was shared with user-mode. But, that's just me... Perhaps you're smarter than I am and understand all the ramifications of using WRITECOMBINE.

    Peter
    OSR
    @OSRDrivers

    Peter Viscarola
    OSR
    @OSRDrivers

  • anton_bassovanton_bassov Posts: 4,777
    <quote>

    If device supports Scatter/Gather DMA and Bus Master, the Device Bus
    Logical Address = Physical Address

    </quote>


    ...which does not necessarily hold true for a system with IOMMU

    In general, these days you have, again, to stop making any assumptions about the target
    architecture - these days NT supports, again, more than just x86 and x86_64, and even above mentioned "conventional" architectures still may have "not-so-standard" features like IOMMU....



    Anton Basssov
  • Peter_ViscarolaPeter_Viscarola Posts: 6,649
    >...which does not necessarily hold true for a system with IOMMU

    Outstanding point. And absolutely correct.

    (Holy SHIT! Mr. Bassov just made a post that "added value" to a thread! Mr. Bassov... you may take the weekend off, in celebration of this great event!)

    Peter
    OSR
    @OSRDrivers

    Peter Viscarola
    OSR
    @OSRDrivers

  • Tim_RobertsTim_Roberts Posts: 12,567
    xxxxx@gmail.com wrote:
    > So, with MmNonCached in both MmAllocateContiguousNodeMemory & MmMapLockedPagesSpecifyCache, is the huge latency (34us for 1024 bytes) I am seeing is expected?

    Yes, absolutely.  Without caching, every dword you write to that region
    blocks until the write gets out to RAM.  With caching, your writes go to
    the cache, which takes one cycle, and then your app can move on.  The
    cache can take its time flushing to RAM later in big block writes,
    instead of blocking one dword at a time.

    It's easy to underestimate the enormous effect caching has on system
    performance.  About 10 years ago, I did some work for a patent attorney
    who wanted to know if it was practical to run Windows without caching. 
    When I turned off caching on all of RAM, it took 9 minutes for Vista to
    boot.

    --
    Tim Roberts, xxxxx@probo.com
    Providenza & Boekelheide, Inc.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • anton_bassovanton_bassov Posts: 4,777
    <quote>

    (Holy SHIT! Mr. Bassov just made a post that "added value" to a thread! Mr. Bassov... you may take the weekend off, in celebration of this great event!)


    </quote>


    Are you still desperate to port my account to the new (and, scary, at least for me) platform
    in a Troll Mode???


    Anton Bassov
  • Peter_ViscarolaPeter_Viscarola Posts: 6,649
    > Are you still desperate to port my account...

    One post is not quite enough earn you redemption, Mr Bassov.

    Peter
    OSR
    @OSRDrivers

    Peter Viscarola
    OSR
    @OSRDrivers

  • anton_bassovanton_bassov Posts: 4,777
    > One post is not quite enough earn you redemption, Mr Bassov.

    Oh, come on - I had mainly behaved recently, although still with some VERY infrequent trolling attempts (like the one one C++ thread).....

    Anton Bassov
  • Jan_BottorffJan_Bottorff Posts: 464
    >In general, these days you have, again, to stop making any assumptions about the target
    > architecture - these days NT supports, again, more than just x86 and x86_

    +1

    I'm working on Windows ARM64 servers now (Cavium/Marvell ThunderX2). It's taking a while to discover and unlearn the sometimes subtle x64 assumptions I unconsciously make.

    The typical server I work on now has 2 sockets, 64 cores, and 256 logical processors (4 way hyperthreading).

    Jan
  • anton_bassovanton_bassov Posts: 4,777
    > It's taking a while to discover and unlearn the sometimes subtle x64 assumptions
    > I unconsciously make.



    The very first example that gets into my head is an atomic operation like incrementing, adding, exchanging or any other atomic operation with functionality that goes beyond bit-test-and-set.....


    On x86 performing such an operation may be used an optimisation in some cases, because it is implemented in a hardware.However, some other architectures may not necessarily provide a hardware support for anything atomic, apart from a simple bit-test-and-set.



    What do you think using a function like, say, atomic_increment(), may be like on such an arch?
    Although such an atomic function may be available on the architecture that does not support atomic addition in a hardware, it is going to be implemented in arch-specific OS layer as a simple addition, guarded by the spinlock that is built around the atomic bit-test-and-set. Although its semantics are going to be exactly the same as they are on x86, using it may, in actuality, result in a performance penalty, rather than an optimisation that it offers on x86.....


    Anton Bassov
    .
  • Pavel_APavel_A Posts: 2,640
    Jan Bottorff wrote
    > I'm working on Windows ARM64 servers now (Cavium/Marvell ThunderX2)

    Great. Could you look, please, do these creatures have PAGE_WRITECOMBINE at all, or it is specific to Intel arch?

    Regards,
    -- pa
  • It's intended for Intel's architecture streaming (SIMD) load/store
    instructions. To be precise, a store instruction requires the support of
    SSE2 and a load, consequently, the support of SSE4.1extension.
    In the beginning, the main goal of the write-combine cache type for MMIO
    was speeding up the execution of a store instruction. With introducing of
    SSE4.1 ( "using a non-temporal memory hint" ) it became possible for load
    instructions as well.
    I'm using those instructions for the so-called non-DMA approach of data
    transmission. Certainly, it slower than normal HW DMA but it very useful in
    case your HW hasn't DMA controller.
    Nowadays, this technology able to support the transmission of up to 512 bit
    of data per one instruction (128/256/512). It's possible in case of support
    by CPU the AVX and AVX-512 extensions respectively.

    Regards,
    Dzmitry

    On Mon, Jul 16, 2018 at 12:56 AM xxxxx@fastmail.fm
    wrote:

    > Jan Bottorff wrote
    > > I'm working on Windows ARM64 servers now (Cavium/Marvell ThunderX2)
    >
    > Great. Could you look, please, do these creatures have PAGE_WRITECOMBINE
    > at all, or it is specific to Intel arch?
    >
    > Regards,
    > -- pa
    >
    >
    > ---
    > NTDEV is sponsored by OSR
    >
    > Visit the list online at: <
    > http://www.osronline.com/showlists.cfm?list=ntdev>;
    >
    > MONTHLY seminars on crash dump analysis, WDF, Windows internals and
    > software drivers!
    > Details at
    >
    > To unsubscribe, visit the List Server section of OSR Online at <
    > http://www.osronline.com/page.cfm?name=ListServer>;
    >
  • Tim_RobertsTim_Roberts Posts: 12,567
    xxxxx@gmail.com wrote:
    >
    > It's intended for Intel's architecture streaming (SIMD) load/store
    > instructions. To be precise, a store instruction requires the support
    > of SSE2 and a load, consequently, the support of SSE4.1extension.

    Write combining started way before that.  It was originally designed as
    a way to speed up graphics operations, by allowing bitmap writes to the
    frame buffer to exploit bus bursting, but without turning on full
    caching.  Without WC, each write becomes a complete bus cycle.

    --
    Tim Roberts, xxxxx@probo.com
    Providenza & Boekelheide, Inc.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • M_M-2M_M-2 Posts: 833
    I think the concept of write combining is a few decades older that video hardware



    Sent from Mail for Windows 10



    ________________________________
    From: xxxxx@lists.osr.com on behalf of xxxxx@probo.com
    Sent: Monday, July 16, 2018 11:39:01 AM
    To: Windows System Software Devs Interest List
    Subject: Re: [ntdev] DMA Memory - MmAllocateContiguousNodeMemory

    xxxxx@gmail.com wrote:
    >
    > It's intended for Intel's architecture streaming (SIMD) load/store
    > instructions. To be precise, a store instruction requires the support
    > of SSE2 and a load, consequently, the support of SSE4.1extension.

    Write combining started way before that. It was originally designed as
    a way to speed up graphics operations, by allowing bitmap writes to the
    frame buffer to exploit bus bursting, but without turning on full
    caching. Without WC, each write becomes a complete bus cycle.

    --
    Tim Roberts, xxxxx@probo.com
    Providenza & Boekelheide, Inc.


    ---
    NTDEV is sponsored by OSR

    Visit the list online at:

    MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
    Details at

    To unsubscribe, visit the List Server section of OSR Online at
  • Jan_BottorffJan_Bottorff Posts: 464
    The hardware does have something similar to write combining its called gathering, https://developer.arm.com/products/architecture/a-profile/docs/100941/latest/memory-types.

    I could not say for sure if the proper bits get set to enable this if you use the PAGE_WRITECOMBINE flag on
    MmMapIoSpaceEx. If not, it seems like a bug. Write combining I know makes a significant difference in performance on a simple video frame buffer.

    Jan


    -----Original Message-----
    From: xxxxx@lists.osr.com <xxxxx@lists.osr.com> On Behalf Of xxxxx@fastmail.fm
    Sent: Sunday, July 15, 2018 3:56 PM
    To: Windows System Software Devs Interest List <xxxxx@lists.osr.com>
    Subject: RE:[ntdev] DMA Memory - MmAllocateContiguousNodeMemory

    Jan Bottorff wrote
    > I'm working on Windows ARM64 servers now (Cavium/Marvell ThunderX2)

    Great. Could you look, please, do these creatures have PAGE_WRITECOMBINE at all, or it is specific to Intel arch?

    Regards,
    -- pa


    ---
    NTDEV is sponsored by OSR

    Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev>;

    MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
    Details at <http://www.osr.com/seminars>;

    To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer>;
  • I have come across the Writing Combining Usage Guidelines which is quite old in http://download.intel.com/design/PentiumII/applnots/24442201.pdf. The document states the following.

    "WC is a weakly ordered memory type. System memory locations are not cached and coherency is not enforced by the processor?s bus coherency protocol. Speculative reads are allowed. Writes may be delayed and combined in the write combining buffer to reduce memory accesses"

    Hence, I cant use PAGE_WRITECOMBINE for a system memory used as DMA. Better I would use PAGE_NOCACHE.

    I had an interesting observation as below.

    If I do memcpy(1024 bytes) to DMA memory, it takes 33us. But if I do memcpy(16 bytes) in a loop for 64 times for a 1024 byte copy, it take 1.9us. For cache and writecombine, it takes 0.04us and 0.03us respectively.

    So, I am planning to go with 16 byte chunk copy as of now with PAGE_NOCACHE flag.
  • Tim_RobertsTim_Roberts Posts: 12,567
    xxxxx@gmail.com wrote:
    > I have come across the Writing Combining Usage Guidelines which is quite old in http://download.intel.com/design/PentiumII/applnots/24442201.pdf. The document states the following.
    >
    > "WC is a weakly ordered memory type. System memory locations are not cached and coherency is not enforced by the processor?s bus coherency protocol. Speculative reads are allowed. Writes may be delayed and combined in the write combining buffer to reduce memory accesses"
    >
    > Hence, I cant use PAGE_WRITECOMBINE for a system memory used as DMA. Better I would use PAGE_NOCACHE.

    How on earth did you come to that conclusion?  The original Pentium had
    three caching options for each region, from least to most performant:
    uncached, write-combined, fully cached.  This is what the MTRR tables
    specify.  In virtually every case of DMA in an x86 or x64 architecture,
    you want fully cached.


    > I had an interesting observation as below.
    >
    > If I do memcpy(1024 bytes) to DMA memory, it takes 33us. But if I do memcpy(16 bytes) in a loop for 64 times for a 1024 byte copy, it take 1.9us. For cache and writecombine, it takes 0.04us and 0.03us respectively.

    I'd like to see your code, because I don't believe your first two numbers.

    --
    Tim Roberts, xxxxx@probo.com
    Providenza & Boekelheide, Inc.

    Tim Roberts, [email protected]
    Providenza & Boekelheide, Inc.

  • Contiguous memory is allocated using the following code snippet! The target system is x64 (Asus Z170 deluxe).

    PHYSICAL_ADDRESS LowestAcceptableAddress = { 0 }, HighestAcceptableAddress = { 0 }, BoundaryAddressMultiple = { 0 };

    HighestAcceptableAddress.QuadPart = 0xFFFFFFFFFFFFFFFF;
    Protect = PAGE_READWRITE | PAGE_NOCACHE;
    ChunkSize = 32*1024*1024;
    NumaNode = 0;

    SystemVA = MmAllocateContiguousNodeMemory(ChunkSize,
    LowestAcceptableAddress,
    HighestAcceptableAddress,
    BoundaryAddressMultiple,
    Protect,
    NumaNode
    );

    pMdl = IoAllocateMdl(SystemVA, ChunkSize, FALSE, FALSE, NULL);

    MmBuildMdlForNonPagedPool(pMdl);

    UserVA =(((ULONG_PTR)PAGE_ALIGN(MmMapLockedPagesSpecifyCache(pMdl, UserMode, MmNonCached, NULL, FALSE, HighPagePriority))) + MmGetMdlByteOffset(pMdl));

    MappedPhyAddr = MmGetPhysicalAddress(SystemVA);

    The 64 bit application pseudocode is as follows.
    1) Get the UserVA from the driver through an IOCTL call.
    2) Alloc Temp memory for 32MB and initialize with zeros
    3) Do memcpy in mentioned chunks (16 byte or 1024 byte) for 32MB.

    Profiling is done using QueryPerformanceCounter() wrapped around memcpy().

    As you mentioned, as the system is x64, I would want to definitely take advantage of cache if the system is cache coherent for DMA operations. I would have to test the system for this.

    The contiguous buffer which I allocated is used by both the user space appl. and Device for R/W simultaneously. But at any point of time, only one of them access certain memory range in the buffer.
Sign In or Register to comment.

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!