Avoid buffer locking + MDL extraction overhead

Hi all,
I currently have a driver that supports ReadFile() and WriteFile() APIs to read and write from/to a PCIe device memory.
Upon a ReadFile() request, in some cases, I perform a WdfRequestRetrieveOutputWdmMdl() call : it locks the user buffer in physical memory and retrieves the associated MDL.
Everything works fine.
But, for optimisation purposes, I would like to avoid the extra overhead of locking the buffer and retrieving the MDL just before starting the DMA.
As the application always reuses the same buffers, I would like a way to lock a buffer and retrieve the associated MDL during an initialization phase.
And, of course, I would like to be backward compatible with an application that would not provide an already known buffer to my driver (in this case, the performance penalty could not be avoided).
Do anybody know a simple way (or some kind of design pattern) to do that?
Thanks a lot.
Vincent

No. There’s no way to do this safely AND be “backward compatible” as you say above. The buffer needs to stay locked during the entirety of the DMA operation (obviously… because if it’s not locked, it could be paged out and that physical memory used for another process’ pages).

If you want to do this in a NON-backward compatible way, just keep one read or write operation (with the buffer you want to DMA) pending. Then, during CLEANUP, complete that pending operation.

Peter
OSR

> Hi all,

I currently have a driver that supports ReadFile() and WriteFile() APIs to
read and write from/to a PCIe device memory.
Upon a ReadFile() request, in some cases, I perform a
WdfRequestRetrieveOutputWdmMdl() call : it locks the user buffer in
physical memory and retrieves the associated MDL.
Everything works fine.
But, for optimisation purposes, I would like to avoid the extra overhead
of locking the buffer and retrieving the MDL just before starting the DMA.
As the application always reuses the same buffers, I would like a way to
lock a buffer and retrieve the associated MDL during an initialization
phase.
And, of course, I would like to be backward compatible with an application
that would not provide an already known buffer to my driver (in this case,
the performance penalty could not be avoided).
Do anybody know a simple way (or some kind of design pattern) to do that?
Thanks a lot.
Vincent
****
Define “optimization” in this case. Do you know what the “overhead” is?
Have you measured it? Does it have significant impact on the throughput
of your device?

If you don’t have measurements, then you are “optimizing” something that
may not need optimization. Without hard data, optimization is a waste of
time.

A friend once came to me and said “I just spent a week making this
critical subroutine twice as fast, but my program doesn’t run any faster.
I hear you have a performance measurement tool.” I did, and I ran his
program. The subroutine in question used 0.25% of the total time. Before
he started, it probably used 0.5% of the total time. Whoopee. We found
another part of his program that was taking 30% of the time, doing
something actually completely useless.

Optimization in the absence of data is a silly exercise.

What you are asking for is construably a Bad Idea, and I don’t know of any
way to make it backware compatible. It is considered a programming error
to use the contents of a buffer while the buffer is in transition (being
actively read or written).

Note that it is impossible to do this as you describe. IF you send a
DeviceIoControl down, you can hold it pending for an indefinite time, and
its buffers will still be locked down, but this actually has a serious
negative effect on overall system performance, because those pages are not
available for reuse. So a local optimization might be a global
pessimization. Unless you have hard data to refute me, I suggest you
abandon this idea and just build a regular driver.
joe
****


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Thank you for your reply Peter and Joseph.
Just for the context, I work on a closed system that is dedicated to an industrial application (soft real time imaging).
In fact, it really makes a difference when I use big buffers for my transfers.
Here are some figures for the Direct I/O call duration (time between user ReadFile() call and kernel EvtIoRead callback):

  1. 64-MiB user buffer created with malloc(): 84 ms
  2. 64-MiB user buffer created with malloc(), but with Peter’s trick (i.e. the buffer is already locked by a pending ReadFile() call): 18 ms
  3. 64-MiB user buffer created with AllocateUserPhysicalPages(): 8 ms
    I will use solution 3, even if I don’t understand the difference with solution 2 (in terms of duration): perhaps a smaller number of pages.
    Vincent

When you use AllocateUserPhysicalPages how do you get those pages to your driver? And then how do you do subsequent ReadFile calls into your driver?

PLEASE don’t tell me you’re passing the UserPfnArray directly to your driver, and your driver is using that for DMA… please?? Cuz if that’s what you’re doing, that’s a VERY dangerous and insecure design.

Peter
OSR

Hi Peter,
Here is the user side code (without details nor error handling):
ULONG_PTR NumberOfPages = Size / PageSize();
if ((Size % PageSize()) != 0) { NumberOfPages++; }
PULONG_PTR UserPfnArray = HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, NumberOfPages * sizeof (ULONG_PTR));
AllocateUserPhysicalPages(GetCurrentProcess(), &NumberOfPages, UserPfnArray);
PVOID Address = VirtualAlloc(NULL, Size, MEM_RESERVE | MEM_PHYSICAL, PAGE_READWRITE);
MapUserPhysicalPages(Address, NumberOfPages, UserPfnArray)
Then, I simply perform the ReadFile() using the Address pointer.
No exotic stuff :wink:
Vincent

xxxxx@cea.fr wrote:

Hi Peter,
Here is the user side code (without details nor error handling):
ULONG_PTR NumberOfPages = Size / PageSize();
if ((Size % PageSize()) != 0) { NumberOfPages++; }
PULONG_PTR UserPfnArray = HeapAlloc(GetProcessHeap(), HEAP_ZERO_MEMORY, NumberOfPages * sizeof (ULONG_PTR));
AllocateUserPhysicalPages(GetCurrentProcess(), &NumberOfPages, UserPfnArray);
PVOID Address = VirtualAlloc(NULL, Size, MEM_RESERVE | MEM_PHYSICAL, PAGE_READWRITE);
MapUserPhysicalPages(Address, NumberOfPages, UserPfnArray)
Then, I simply perform the ReadFile() using the Address pointer.
No exotic stuff :wink:

Then you have a very strange definition of “exotic”, because the code
you show above certainly qualifies under my definition.

Further, I don’t think your performance analysis is fair. You’re
comparing the cost of “ReadFile” of a plain malloc-ed buffer to the cost
of “ReadFile” with the exotic sequence you have above, but in the case
above, YOU are doing what the I/O system has to do in the first case.
The same work has to get done, you’re just moving it.

For a fair comparison, you would compare your bare ReadFile against the
time in this sequence from just before HeapAlloc to just after the
ReadFile. That would tell you the REAL cost of the I/O operation.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

We could about each other “exotic” definition, but this code follows Microsoft guidelines: see http://msdn.microsoft.com/en-us/library/windows/desktop/aa366531(v=vs.85).aspx
Moving the work that has to get done is the trick.
The difference is that the allocation is made during an initialization phase; thus, during the real time phase (acquisition), no time is lost locking the pages in physical memory.
And it’s certainly an advantage for my application.

It seems to be an incredibly complex solution to what is basically a
trivial problem. I had this same problem over ten years ago, and I
handled it trivially by opening the device in overapped mode and sending
down a pile of ReadFile requests. Do you have the slightest HINT that
page lockdown is your bottleneck, or is it round-trip time to the
application, because of the scheduler? One of the key ideas to pursue in
“optimizing” anything is to first understand where the real problem is,
and fix that first.

By the way, it took me only five hours to rewrite the app to run with
async I/O, and fully test it, but to rewrite the driver would have taken a
couple weeks.

You are asserting that page locking and MDL overhead are the causes of
your problem. Do you have actual performance measurement data that
supports this? My data said that I reduced the delays to consistently
under 100us. And that’s solid data, because the device would not work
correctly if the delays exceeded 100us. We measured the delay. Its mean
was 80us (plus 10, minus 20, and yes, the distribution was not symmetric).
So why are you trying to come up with a convoluted solution that may not
actually solve the problem?

Note that MEM_PHYSICAL appears from the documentation to require that your
system support AWE, whereas the simpler asynch approach will run on nearly
any system.
joe

We could about each other “exotic” definition, but this code follows
Microsoft guidelines: see
http://msdn.microsoft.com/en-us/library/windows/desktop/aa366531(v=vs.85).aspx
Moving the work that has to get done is the trick.
The difference is that the allocation is made during an initialization
phase; thus, during the real time phase (acquisition), no time is lost
locking the pages in physical memory.
And it’s certainly an advantage for my application.


NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

“Do you have the slightest HINT that page lockdown is your bottleneck, or is it round-trip time to the application, because of the scheduler?”
Yes, I’m sure about the measures I posted previously (that are done using WPP traces in both the application and the driver). They show without ambiguity that the difference in Direct I/O call duration is due to the page lockdown (I was doing nothing else on the computer and I performed the measures several times).

“You are asserting that page locking and MDL overhead are the causes of your problem. Do you have actual performance measurement data that supports this?”
My current application + driver (without any optimization) are completely regular and work perfectly (as far as I know) today with current use cases.
In fact, I’m in the process of evaluating new use cases where data transfers will be much larger than today.
As I write both the PCIe device firmware + the driver + part of the application, I’m completely free of doing what I want, but I want to stick to well designed pieces of code of course (that’s why I care about all your remarks).
The entire system is under our control, so I don’t care about being compatible with anybody’s computer, like it would be the case with consumer hardware/software.
As it’s an embedded application with very little processing power in comparison with what we want to do, we are trying to explore any possibility of optimization.
The root cause of my thread posting is that I was trying to perform large transfers (64 MiB) and I observed that they were much slower than allowed by my PCIe bandwidth; after instrumenting, I noticed that the Direct I/O call duration was dependent on the type of buffer passed in parameter (already locked or not).
This call duration can be masked if I perform the ReadFile() before any data is ready on the device, but there are some cases where it’s not possible, and then the latency between the call and the DMA start will increase because the system has to lock the buffer in physical memory.
Perhaps I can work-around the issue by making the application more asynchronous, but this is not my part, so I cannot answer yet.
In all cases, what is the issue with using AllocateUserPhysicalPages() API if the number of locked pages is small enough and under control?

>The root cause of my thread posting is that I was trying to perform large
transfers (64 MiB) and I observed that they were much slower than allowed by my
PCIe bandwidth; after instrumenting, I noticed that the Direct I/O call duration
was dependent on the type of buffer passed in parameter (already locked or not).

Which parameter of DeviceIoControl is your big buffer, and how you define your IOCTL code?

It isn’t an IOCTL, it’s a ReadFile.
Configured with WdfDeviceInitSetIoType(DeviceInit, WdfDeviceIoDirect)

Are you testing it under x64 OS?

No, it’s WES7 SP1 x86

For your case, an industrial application in a closed system, any of these
and more are ‘ok’ as you don’t really have to worry about what happens when
the hardware changes or some 3rd party software does bad things. Using
AllocateUserPhysicalPages or VirtualAlloc with large pages is probably a
good compromise between safety and performance for you and is far better
than most of the half-baked shared memory schemes that we see here.

On a recent project, the use of large pages for a data storage area, while
less effective than the corresponding data structure reorganization to be
cache line friendly, gave a 20-30% performance improvement. The improvement
in memory access performance was due solely to the difference in the way
that the page tables were organized and I suspect that the same is true for
the use of physical pages - although I’ll defer to one of the Windows Memory
Manager authors / experts who sometimes troll this list.

wrote in message news:xxxxx@ntdev…

Thank you for your reply Peter and Joseph.
Just for the context, I work on a closed system that is dedicated to an
industrial application (soft real time imaging).
In fact, it really makes a difference when I use big buffers for my
transfers.
Here are some figures for the Direct I/O call duration (time between user
ReadFile() call and kernel EvtIoRead callback):

  1. 64-MiB user buffer created with malloc(): 84 ms
  2. 64-MiB user buffer created with malloc(), but with Peter’s trick (i.e.
    the buffer is already locked by a pending ReadFile() call): 18 ms
  3. 64-MiB user buffer created with AllocateUserPhysicalPages(): 8 ms
    I will use solution 3, even if I don’t understand the difference with
    solution 2 (in terms of duration): perhaps a smaller number of pages.
    Vincent