Kernel DMA buffer copy to user buffer too slow

Hi, MSR:

MM_ALLOCATE_REQUIRE_CONTIGUOUS_CHUNKS

Thank you very much!
I will try this approach since customer used Win7 later OS.

Hi, All:
Thank you all for your kindly help and advice.
Firstly, I know this is the fault of hardware design. I have emphasized many times to require hw engineer to support SGL. But obviously, I failed. Anyway, I need to fit the hw design finaly and try to meet performance requirement from customer.
Secondly, the security problem caused by mapping physical memory to user space is important for me. I will not consider to implement it into offical version, and just for test purpose for performance comparing.
Finaly, sincerely speaking, I am a newbee in windows kernel programming even I have spent about 10 years on it. Every time when I encountered problems abou windows driver devs I just want to post thread to ask for advice in OSR. And every time you all help me so much. I remembered you, Tim, Don, Peter, Pavel…
Thank you!!!

Assuming sharing Driver_allocated_memory with user mode (rather the other way) is needed, why below is issue?

<<
http://www.osronline.com/article.cfm?article=39

… Allocating pages from main memory [1] is inherently more secure than using paged or non-paged pool [2], which is never a good idea.

>
>
Don’t allocate the space from non-paged pool. Please. Allocate the space with
AllocatePagesForMdl or something similar.
<<

Sorry if this was explained already or obvious, but why it (ExAlloc(NPaged), MmBuildMdlforNPagedPool()) is not recommended. Is it because it
-uses the scarce NPaged blocks, but even the others consume the same NPaged blocks? OR
-consumes extra map buffers, not sure if that is the case either as MmGetSystemAddressForMdlSafe() is a no-op in this case (and eventual MmMapLockedPagesSpecifyCache() will have AccessMode = User)

Not sure what security above is referring to, both ways will have same security issues?

And MSDN.MmMapLockedPagesSpecifyCache(Mdl…)
"A pointer to the MDL that is to be mapped. This MDL must describe physical pages that are locked down. A locked-down MDL can be built by the MmProbeAndLockPages or MmAllocatePagesForMdlEx routine. *** For mappings to user space, MDLs that are built by the MmBuildMdlForNonPagedPool routine can be used. ***

The only avenue that you can peruse from here is concurrency. This will be highly dependent on the design of the UM application, but if you need the CPU to copy, then try to have many CPUs copy smaller chunks in parallel

Sent from Mailhttps: for Windows 10

From: xxxxx@hotmail.commailto:xxxxx
Sent: January 4, 2017 8:53 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow

Hi, All:
Thank you all for your kindly help and advice.
Firstly, I know this is the fault of hardware design. I have emphasized many times to require hw engineer to support SGL. But obviously, I failed. Anyway, I need to fit the hw design finaly and try to meet performance requirement from customer.
Secondly, the security problem caused by mapping physical memory to user space is important for me. I will not consider to implement it into offical version, and just for test purpose for performance comparing.
Finaly, sincerely speaking, I am a newbee in windows kernel programming even I have spent about 10 years on it. Every time when I encountered problems abou windows driver devs I just want to post thread to ask for advice in OSR. And every time you all help me so much. I remembered you, Tim, Don, Peter, Pavel…
Thank you!!!


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

Marion Bond said:

The only avenue that you can peruse from here is concurrency.? This will be highly dependent on the design of the UM application, but if you need the CPU to >copy, then try to have many CPUs copy smaller chunks in parallel

I would expect less than a stellar improvement by doing this. Wouldn’t main memory bandwidth be the limiting factor? If so, multiple CPUs would not help that much.

* Bob

The OP doesn?t have a lot of choices. It really depends on what his application needs to do with the data and the parallelism needs to be driven from UM

Sent from Mailhttps: for Windows 10

From: Robert Ammermanmailto:xxxxx
Sent: January 5, 2017 8:01 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE: [ntdev] Kernel DMA buffer copy to user buffer too slow

Marion Bond said:
>The only avenue that you can peruse from here is concurrency. This will be highly dependent on the design of the UM application, but if you need the CPU to >copy, then try to have many CPUs copy smaller chunks in parallel

I would expect less than a stellar improvement by doing this. Wouldn’t main memory bandwidth be the limiting factor? If so, multiple CPUs would not help that much.

* Bob


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

Because Non-Paged Pool is commonly used for storage of lots of secure kernel “stuff” and the memory isn’t cleared before allocation – the risk of an information disclosure vulnerability is greater, and needless. Also, because if you get the whole cleanup process wrong (or just don’t do it), you wind-up with a user-mode process that has a mapping into blocks of non-paged pool that have been freed, and subject to subsequent use for secure kernel “stuff”… again, risking an information disclosure vulnerability.

Peter
OSR
@OSRDrivers

The kernel has a virtual address space, which is divided up in a number of ways. One of those ways is the pool (paged & non-paged), which is intended for dynamic allocation by the kernel and by drivers. This is very similar to how your process has an address space, some of which is allocated to Heaps (e.g. the Win32 Heap & the CRT heap) for dynamic allocations.

Kernel virtual address space is a shared resource and it can run low. Less so with 64-bit machines, but still you have a bunch of kernel components fighting over it along with drivers. It also has a cost, because it requires MM to find free page-table entries, which might require allocating page tables, and then assign them to your memory. It’s better to avoid taking it up if you can.

Fortunately in the kernel you have the option to allocate physical pages without having the mapped into kernel virtual address space. That’s what MmAllocatePagesForMdl() does - it finds pages on the free list that meet your criteria, locks them and then gives you back a list of them (in the MDL), but doesn’t map them into KVA. That leaves you free to decide how to use them - you can DMA into them, or map just a portion into the kernel, or the whole thing in to the kernel, or map them into user mode, etc…

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@yahoo.com
Sent: Thursday, January 5, 2017 4:53 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow

Assuming sharing Driver_allocated_memory with user mode (rather the other way) is needed, why below is issue?

<<
http://www.osronline.com/article.cfm?article=39

… Allocating pages from main memory [1] is inherently more secure than using paged or non-paged pool [2], which is never a good idea.
>>
>>
Don’t allocate the space from non-paged pool. Please. Allocate the space with AllocatePagesForMdl or something similar.
<<

Sorry if this was explained already or obvious, but why it (ExAlloc(NPaged), MmBuildMdlforNPagedPool()) is not recommended. Is it because it
-uses the scarce NPaged blocks, but even the others consume the same NPaged blocks? OR
-consumes extra map buffers, not sure if that is the case either as MmGetSystemAddressForMdlSafe() is a no-op in this case (and eventual MmMapLockedPagesSpecifyCache() will have AccessMode = User)

Not sure what security above is referring to, both ways will have same security issues?

And MSDN.MmMapLockedPagesSpecifyCache(Mdl…)
"A pointer to the MDL that is to be mapped. This MDL must describe physical pages that are locked down. A locked-down MDL can be built by the MmProbeAndLockPages or MmAllocatePagesForMdlEx routine. For mappings to user space, MDLs that are built by the MmBuildMdlForNonPagedPool routine can be used.


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

O.k. so MmAllocatePagesForMdlEx() will not by default consume KVA and is just sufficient for the case when only DMA is needed)

I need to check though what the values of below MDL fields will be, particularly the 2nd one
PVOID MappedSystemVa;
PVOID StartVa;
Isn’t a KVA required by ntoskrnl.exe (if not a driver.sys) in some form or other when it is time to free the memory. Of course KVA (and PTEs) is wasted if a faulty driver.sys repeatedly calls MmMapLockedPagesSpecifyCache(accessmode = kernel), where we end up with multiple KVA’s pointing to same PTE/PFN. Not sure when/why any driver would do that.

But either method looks like will have the same (and only) info disclosure issue (ignoring unnecessary KVA done by ExAlloc() if not really needed by driver) unless below are not taken care explicitly

  • maps to user without MdlMappingNoWrite, the user can do exact same damage in both cases.
  • user of ExAlloc/MmBuildMdlForNPagedPool() doesn’t zero-init the memory (and the additional explicit zero-init cost is exactly the same/negligible as it was for zero-filled blocks returned by MmAllocatePagesForMdlEx() - surely somebody/somewhere before did zero-init these pages)

> Wouldn’t main= memory bandwidth be the limiting factor?

If so, multiple CPUs would not he= lp that much.

It depends on your definition of what “main memory” (as well as FSB) is…

Although it is quite easy to define it on a “classical” Intel -based (i.e. UMA) system with FSB and Northbridge, things are not necessarily that easy on the AMD -based (as well as “newer” higher-end Intel ones) NUMA one, with every CPU core potentially having its own memory controller, as well as different bus agents relying upon point-to-point links between one another. On such a system an operation that MM mentioned may be more efficient if performed by CPU core X, rather than Y…

Anton Bassov

The OP is doing a transfer from a contiguous physically address kernel buffer to a likely non-contiguous user buffer.

In a NUMA world, it is a fair bet that the contiguous physical buffer is all on the same node. Thus, only CPUs on that node would be able to do a good job with it, leaving us with a situation no better than the UMA case. It can get even worse if the user mode buffer isn’t all on the same node as the kernel mode buffer.

* Bob

? Bob Ammerman
? xxxxx@ramsystems.biz
716.864.8337

138 Liston St
Buffalo, NY 14223
www.ramsystems.biz

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-622995-
xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
Sent: Thursday, January 05, 2017 5:10 PM
To: Windows System Software Devs Interest List
> Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow
>
> > Wouldn’t main= memory bandwidth be the limiting factor?
> > If so, multiple CPUs would not he= lp that much.
>
> It depends on your definition of what “main memory” (as well as FSB) is…
>
>
> Although it is quite easy to define it on a “classical” Intel -based (i.e. UMA)
> system with FSB and Northbridge, things are not necessarily that easy on the
> AMD -based (as well as “newer” higher-end Intel ones) NUMA one, with every
> CPU core potentially having its own memory controller, as well as different bus
> agents relying upon point-to-point links between one another. On such a system
> an operation that MM mentioned may be more efficient if performed by CPU
> core X, rather than Y…
>
>
> Anton Bassov
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: http:
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at
> http:</http:></http:></http:>

Nope. Well, except for the MDL. The MDL describes the allocated physical pages.

While it’s true that there’s a potential information disclosure vulnerability in all cases if the code is not written correctly, the fact that non-page pool is THE scratch storage region used by drivers makes this a more likely area for storage of system-wide sensitive information. What are the chances of finding something sensitive in the non-paged pool versus in the (much larger and holding everything) random pages of memory? That’s the main point I’m trying to make.

Peter
OSR
@OSRDrivers

>It really depends on what his application needs to do with the data and the parallelism needs to be driven from UM

In the general case, a driver is not designed to service a single app. For example, if the device is a PCIE SCSI or SATA controler and if the system partition is located on a drive that is connected to the controler than, virtually, all running application on the system will directly or indirectly send data to or receive data from the driver.

Secondly, the security problem caused by mapping physical memory to user space is important for me. I will not consider to implement it into offical version, and just for test purpose for performance comparing.

Why should your driver do this ? When an app requests data from the device, a buffer is provided by the app and the driver fills this buffer with the data transfered by the device. When an app sends data to a device, again a buffer is provided by the app and the driver transfers the data from the buffer to the device.

You can use direct I/O with DMA. When a device object is configured to do direct I/O, the I/O manager prepares a MDL that represents the user buffer. This MDL can be used for DMA as explained in the following page.

https://msdn.microsoft.com/en-us/library/windows/hardware/ff565374(v=vs.85).aspx

Note that this MDL could be used with MmMapLockedPagesSpecifyCache to obtain a system mapping of the user buffer. You would then be able to use the buffer in an arbitrary context (Isr or Dpc routine).

When you allocate memory from pool, you use the returned virtual address to refer to the memory (for example, to free it).

When you allocate pages into an MDL, you use the MDL to refer to the memory. MmFreePagesFromMdl will free them.

By default MmAllocatePagesForMdl (and the Ex version) will allocate you zeroed pages that you own. Nothing else in the kernel will write to them (unless you give that other thing the address of your pages), and they won’t contain stale passwords or other secrets. So no disclosure up to user-mode.

The big question, if you’re going to preallocate large physically contiguous data buffers on behalf of your application, is whether your hardware and your app can safely share the buffers. For example, if the buffer contains physical addresses that the hardware will read from or write to, you should not share that into user-mode (since that would allow user-mode to read or write any physical page it wanted). As long as it’s just where the device fetches or dumps its data, it’s reasonable to share it into user mode. Not a best practice, but it might be the only option for your non-SG hardware.

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@yahoo.com
Sent: Thursday, January 5, 2017 1:31 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow

O.k. so MmAllocatePagesForMdlEx() will not by default consume KVA and is just sufficient for the case when only DMA is needed)

I need to check though what the values of below MDL fields will be, particularly the 2nd one
PVOID MappedSystemVa;
PVOID StartVa;
Isn’t a KVA required by ntoskrnl.exe (if not a driver.sys) in some form or other when it is time to free the memory. Of course KVA (and PTEs) is wasted if a faulty driver.sys repeatedly calls MmMapLockedPagesSpecifyCache(accessmode = kernel), where we end up with multiple KVA’s pointing to same PTE/PFN. Not sure when/why any driver would do that.

But either method looks like will have the same (and only) info disclosure issue (ignoring unnecessary KVA done by ExAlloc() if not really needed by driver) unless below are not taken care explicitly
- maps to user without MdlMappingNoWrite, the user can do exact same damage in both cases.
- user of ExAlloc/MmBuildMdlForNPagedPool() doesn’t zero-init the memory (and the additional explicit zero-init cost is exactly the same/negligible as it was for zero-filled blocks returned by MmAllocatePagesForMdlEx() - surely somebody/somewhere before did zero-init these pages)


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

You are missing the point: the Op has a bad HW design he cannot change. If he has a general purpose devise he is totally sunk anyways.

If he has any chance of doing anything he must have control on the UM design. If he does than even all of these problems may yet be mitigated. If he has not even this then he is sunk no mater what we might suggest

Sent from Mailhttps: for Windows 10

From: Robert Ammermanmailto:xxxxx
Sent: January 5, 2017 5:32 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow

The OP is doing a transfer from a contiguous physically address kernel buffer to a likely non-contiguous user buffer.

In a NUMA world, it is a fair bet that the contiguous physical buffer is all on the same node. Thus, only CPUs on that node would be able to do a good job with it, leaving us with a situation no better than the UMA case. It can get even worse if the user mode buffer isn’t all on the same node as the kernel mode buffer.

* Bob

Bob Ammerman
xxxxx@ramsystems.biz
716.864.8337

138 Liston St
Buffalo, NY 14223
www.ramsystems.bizhttp:

> -----Original Message-----
> From: xxxxx@lists.osr.com [mailto:bounce-622995-
> xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
> Sent: Thursday, January 05, 2017 5:10 PM
> To: Windows System Software Devs Interest List
> Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow
>
> > Wouldn’t main= memory bandwidth be the limiting factor?
> > If so, multiple CPUs would not he= lp that much.
>
> It depends on your definition of what “main memory” (as well as FSB) is…
>
>
> Although it is quite easy to define it on a “classical” Intel -based (i.e. UMA)
> system with FSB and Northbridge, things are not necessarily that easy on the
> AMD -based (as well as “newer” higher-end Intel ones) NUMA one, with every
> CPU core potentially having its own memory controller, as well as different bus
> agents relying upon point-to-point links between one another. On such a system
> an operation that MM mentioned may be more efficient if performed by CPU
> core X, rather than Y…
>
>
> Anton Bassov
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: http:
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at
> http:


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

> Thus, only CPUs on that node would be able to do a good= job with it, leaving us

with a situation no better than the UMA case.

Actually, as long as you are able to enforce a strict task - CPU core/memory controller relationship, using a pre-defined CPU core for this task may be much more efficient,compared to UMA system, due to the proximity of the CPU core and its correspondent memory controller. OTOH, this is not exactly what MM was speaking about…

Anton Bassov

Most of this entire thread --my own posts included – are off into the weeds with respect to the OPs issue.

I guess, going back to first principles, I would ask: “How much faster than 20MB in 16ms does this have to be?” As Mr. Roberts observed eons ago, that’s already pretty fast.

Peter
OSR
@OSRDrivers

Pool also allows you to allocate less than a page at a time. Say, for
example, you allocate a 1K non-paged pool buffer and map it to user mode.
The user now has access to 3K of privileged memory because it has a mapping
to the entire page, not just the logically valid portion.

This is solvable of course (just always allocated in page size chunks), but
another unnecessary thing to worry about when dealing with mapping the
memory to user mode.

-scott
OSR
@OSRDrivers

“Peter Wieland” wrote in message news:xxxxx@ntdev…

When you allocate memory from pool, you use the returned virtual address to
refer to the memory (for example, to free it).

When you allocate pages into an MDL, you use the MDL to refer to the memory.
MmFreePagesFromMdl will free them.

By default MmAllocatePagesForMdl (and the Ex version) will allocate you
zeroed pages that you own. Nothing else in the kernel will write to them
(unless you give that other thing the address of your pages), and they won’t
contain stale passwords or other secrets. So no disclosure up to user-mode.

The big question, if you’re going to preallocate large physically contiguous
data buffers on behalf of your application, is whether your hardware and
your app can safely share the buffers. For example, if the buffer contains
physical addresses that the hardware will read from or write to, you should
not share that into user-mode (since that would allow user-mode to read or
write any physical page it wanted). As long as it’s just where the device
fetches or dumps its data, it’s reasonable to share it into user mode. Not
a best practice, but it might be the only option for your non-SG hardware.

-p

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@yahoo.com
Sent: Thursday, January 5, 2017 1:31 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow

O.k. so MmAllocatePagesForMdlEx() will not by default consume KVA and is
just sufficient for the case when only DMA is needed)

I need to check though what the values of below MDL fields will be,
particularly the 2nd one
PVOID MappedSystemVa;
PVOID StartVa;
Isn’t a KVA required by ntoskrnl.exe (if not a driver.sys) in some form or
other when it is time to free the memory. Of course KVA (and PTEs) is wasted
if a faulty driver.sys repeatedly calls
MmMapLockedPagesSpecifyCache(accessmode = kernel), where we end up with
multiple KVA’s pointing to same PTE/PFN. Not sure when/why any driver would
do that.

But either method looks like will have the same (and only) info disclosure
issue (ignoring unnecessary KVA done by ExAlloc() if not really needed by
driver) unless below are not taken care explicitly
- maps to user without MdlMappingNoWrite, the user can do exact same
damage in both cases.
- user of ExAlloc/MmBuildMdlForNPagedPool() doesn’t zero-init the memory
(and the additional explicit zero-init cost is exactly the same/negligible
as it was for zero-filled blocks returned by MmAllocatePagesForMdlEx() -
surely somebody/somewhere before did zero-init these pages)


NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:</http:></http:></http:>

I think we can safely assume two things here about the OP?s problem

  1. There is no answer to the question ?how fast does it have to be? other than ?as fast as possible?; and
  2. The problem has nothing to do with how fast a CPU can copy a 20 MB block of memory, but rather how to achieve application throughput

OP: you may not know anything about the threading and IO model used by the UM application. If you do, please tell us. If not you will need to find out before you can make any improvements to your driver. The most important thing is what this application will do with the data. Does it process a series of blocks of independent data where loss / reordering is irrelevant (like a DNS server)? Does it process a stream of coordinated data where loss / reordering is very important (like a database transaction log) Does it save this data to disk? Or process it in memory and generate some analytics?

I am making the assumption that you don?t have a general purpose device here and have a particular UM application in mind. If that is wrong, then please let us know that too.

Sent from Mailhttps: for Windows 10

From: xxxxx@osr.commailto:xxxxx
Sent: January 6, 2017 8:56 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow

Most of this entire thread --my own posts included – are off into the weeds with respect to the OPs issue.

I guess, going back to first principles, I would ask: “How much faster than 20MB in 16ms does this have to be?” As Mr. Roberts observed eons ago, that’s already pretty fast.

Peter
OSR
@OSRDrivers


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

Hi, All:
Thank you very much for all kindly response.

@Peter Viscarola:
Yes, the problem in this thread basically is to how I can get data from my driver faster. So there are 2 methods to consider this problem.
One way is to improve the memory copy speed from driver to user space. I passed it becasue there is no any clue for me to resolve it.
Another way is to cancel the copy operation to achieve the purpose that improved data transportion speed. Then I want to learn how to map kernel pages into user space like Linux driver sample done. I have implemented the second way in my driver and it can allocated some pages in driver and mapped into user space. I can use these pages to share data between driver and user application. But a new problem is the page numbers. When I try to allocate 20M size, Windows is BSOD. When I try to allocate 1M size, driver returns failure always. Even 512K, 256K, 128K allocation are all failed. I can allocate 1 page(not try other short size) successfully always but 4K is not enough for my DMA transfer.