Zero-copy from PCI MMIO to screen

Hello guys,

I’m fairly new to writing a full-scaled device driver, so please don’t get annoyed too soon because of me choosing the wrong sort of glue.

I’m writing a kmdf driver for a very simple PCI device. All the device does is providing a read-only MMIO region with a continuously changing picture in rgb24 (no interrupts generated though). I’d like to display this picture. The absolute priority in this task is to minimize the number of times the bitmap is copied between different system buffers. Ideally there should be no copying at all, so that the corresponding display driver just configures DMA from the MMIO region. But that’s more up to the display driver and I’d first prefer to finish the PCI device handling.

Now comes my choice of glue. The simplest and kinda optimal way seems to map the read-only region directly to the user space and just call SetDIBitsToDevice on that region. Thus the thing is just about the following sequence of calls: MmMapIoSpace->IoAllocateMdl->MmBuildMdlForNonPagedPool->MmMapLockedPagesSpecifyCache(UserMode) . I assume it’s most appropriate to do that in EvtDeviceFileCreate, so that I’m guaranteed in the caller’s context.

The problem with that glue is that I can’t keep track on when to unmap the pages from user space and to free the MDL. That is the device handle can be inherited or duplicated into a different process and I won’t get notified until it closes the handle, which could also happen after the process with the mapped region terminates.

What I also dislike about the above approach (even though not critically) is the inability to set up protection for the userspace mapped pages. What’s the usecase of MmMapLockedPagesSpecifyCache(UserMode) if it always results in unalterable rwx?

Is the handle.duplication concern real or theoretical? If just theoretical, clean up the mapping in EvtFileCleanup. Outside of that, there is no clear callback to unmap from. You could also have the app pend a request that when canceled will unmap, but that requires app cooperation and if you have that, just use the file handle cleanup callback

d

Bent by my phone


From: xxxxx@mailbox.tu-berlin.demailto:xxxxx
Sent: ?10/?11/?2013 5:16 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: [ntdev] Zero-copy from PCI MMIO to screen

Hello guys,

I’m fairly new to writing a full-scaled device driver, so please don’t get annoyed too soon because of me choosing the wrong sort of glue.

I’m writing a kmdf driver for a very simple PCI device. All the device does is providing a read-only MMIO region with a continuously changing picture in rgb24 (no interrupts generated though). I’d like to display this picture. The absolute priority in this task is to minimize the number of times the bitmap is copied between different system buffers. Ideally there should be no copying at all, so that the corresponding display driver just configures DMA from the MMIO region. But that’s more up to the display driver and I’d first prefer to finish the PCI device handling.

Now comes my choice of glue. The simplest and kinda optimal way seems to map the read-only region directly to the user space and just call SetDIBitsToDevice on that region. Thus the thing is just about the following sequence of calls: MmMapIoSpace->IoAllocateMdl->MmBuildMdlForNonPagedPool->MmMapLockedPagesSpecifyCache(UserMode) . I assume it’s most appropriate to do that in EvtDeviceFileCreate, so that I’m guaranteed in the caller’s context.

The problem with that glue is that I can’t keep track on when to unmap the pages from user space and to free the MDL. That is the device handle can be inherited or duplicated into a different process and I won’t get notified until it closes the handle, which could also happen after the process with the mapped region terminates.

What I also dislike about the above approach (even though not critically) is the inability to set up protection for the userspace mapped pages. What’s the usecase of MmMapLockedPagesSpecifyCache(UserMode) if it always results in unalterable rwx?


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer</mailto:xxxxx></mailto:xxxxx>

Thank you for such a quick reply, Doron. My concern is more theoretical, but it seems to be sort of a potential local DoS vulnerability, like when a misbehaving application duplicates the handle from my application’s process. Wouldn’t it cause a system crash to free the MDL outside of the corresponding context?

You could also have the app pend a request that when canceled will unmap, but that requires app cooperation
EvtRequestCancel doesn’t seem to be called in the closer’s context, does it? I mean I see no documented guarantee for that. If I understand the idea correctly, it could be a solution to reference EPROCESS within EvtDeviceFileCreate and to KeStackAttachProcess in EvtRequestCancel in order to unmap and to free the MDL. But then again I’d need a pending request and this way a race is introduced: a misbehaving application could try to duplicate the handle and to send the request before the legitimate one does.

On 12-Oct-2013 04:03, xxxxx@mailbox.tu-berlin.de wrote:

If I understand the idea correctly, it could be a solution to reference EPROCESS within
> EvtDeviceFileCreate and to KeStackAttachProcess in EvtRequestCancel
in order to unmap and to free the MDL.

You do.

But then again I’d need a pending request and this way a race is introduced: a misbehaving application
could try to duplicate the handle and to send the request before the legitimate one does.

Have your original application open the handle correctly
(bInheritHandle in the lpSecurityAttributes arg must be false)
then child apps won’t inherit the handle.

Protect the app against remote DuplicateHandle:
http://msdn.microsoft.com/en-us/library/windows/apps/ms684880(v=vs.85).aspx

Applications that talk directly to drivers (by holding
device handles, mapping memory, etc) assume certain
responsibilities. This is why custom devices often are
protected so that only privileged accounts can access
them. Random ignorant apps indeed can do weird things
to the driver and jeopardize the whole system.
– pa

Generally, any design that starts, “I could map this kernel buffer to user
space…” is going to lead you into a minefield of weird corner conditions
that would frighten most experienced driver writers. The guys at OSR
probably wouldn’t blink at the problem, but most of us would run the other
way as fast as possible.

Part of the question is, how expensive is the copy? One of the issues
about modern architectures is how screamingly fast they are, particularly
with L1/L2 or L1/L2/L3 cache effects. While historically, large block
copies were something to be avoided, today they are often not the real
bottleneck.

Also note that you would not free up resources on the IRP_MJ_CLEANUP
operation, except those which are handle-specific (if any). A duplicated
or inherited handle should pose absolutely no problem, in that you would
not free up the resources until the IRP_MJ_CLOSE operation, meaning all
handles are closed.

DoS attacks by creating handles and holding files open are
next-to-impossible to prevent anyway, so why do you think this poses a
particular problem for your driver?
joe

Hello guys,

I’m fairly new to writing a full-scaled device driver, so please don’t get
annoyed too soon because of me choosing the wrong sort of glue.

I’m writing a kmdf driver for a very simple PCI device. All the device
does is providing a read-only MMIO region with a continuously changing
picture in rgb24 (no interrupts generated though). I’d like to display
this picture. The absolute priority in this task is to minimize the number
of times the bitmap is copied between different system buffers. Ideally
there should be no copying at all, so that the corresponding display
driver just configures DMA from the MMIO region. But that’s more up to the
display driver and I’d first prefer to finish the PCI device handling.

Now comes my choice of glue. The simplest and kinda optimal way seems to
map the read-only region directly to the user space and just call
SetDIBitsToDevice on that region. Thus the thing is just about the
following sequence of calls:
MmMapIoSpace->IoAllocateMdl->MmBuildMdlForNonPagedPool->MmMapLockedPagesSpecifyCache(UserMode)
. I assume it’s most appropriate to do that in EvtDeviceFileCreate, so
that I’m guaranteed in the caller’s context.

The problem with that glue is that I can’t keep track on when to unmap the
pages from user space and to free the MDL. That is the device handle can
be inherited or duplicated into a different process and I won’t get
notified until it closes the handle, which could also happen after the
process with the mapped region terminates.

What I also dislike about the above approach (even though not critically)
is the inability to set up protection for the userspace mapped pages.
What’s the usecase of MmMapLockedPagesSpecifyCache(UserMode) if it always
results in unalterable rwx?


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Pavel A wrote:

You do.
Thank you. It’s important to me to get a confirmation of acceptableness of a potentially possible solution.

Have your original application open the handle correctly (bInheritHandle in the lpSecurityAttributes arg must be false) then child apps won’t inherit the handle.
Protect the app against remote DuplicateHandle
Thanks again. I didn’t have a notion of protected processes. But this still does not protect against multiple malicious processes playing around with handles to my device, assuming at least one of those could open it directly. But… May be I found a solution. If I mark an incoming ioctl request with the requester’s PEPROCESS in EvtIoInCallerContext and map the MMIO region in the corresponding EvtIoDeviceControl (not in EvtDeviceFileCreate) only if I previously processed EvtDeviceFileCreate for that PEPROCESS, then I could peacefully do the unmap and forget that PEPROCESS in EvtRequestCancel. This seems to cover all possible attack scenarios.

This is why custom devices often are protected so that only privileged accounts can access them.
Do you think, it’s enough to protect the device by allowing only privileged processes to access the device? I mean, it could still be a valid possibility to crash the kernel from user mode. Even though that’s sometimes called a feature like in the case of NtSetInformationProcess(ProcessBreakOnTermination).

Joseph M. Newcomer wrote:

Generally, any design that starts, “I could map this kernel buffer to user space…” is going to lead you into a minefield of weird corner conditions that would frighten most experienced driver writers
Maybe, but there’s a potential performance gain. I’d be happy to avoid that if I knew of any other supported way to achieve at least potential zero-copy. E.g., it would be perfect to be able to do the whole displaying thing completely in the kernel, but I’m not aware of any supported graphics output interfaces in the kernel.

Part of the question is, how expensive is the copy?
I feel like I’m going to observe all the stale tomatos thrown in my direction, but I didn’t do the measurements. Just because my common sense tells me it’s very likely to get singnificant perfomance degradation while outputting FullHD at 50 Hz (which could be a bit overestimated, but also may become a possible scenario). In fact, the purpose is not just to ensure fluent output, but also disburden the CPU as much as possible from handling tasks related to the PCI device, because other soft real time tasks like 3D gaming are likely to be executed in parallel.

Also note that you would not free up resources on the IRP_MJ_CLEANUP operation, except those which are handle-specific (if any).
I assume you’re not correctly expressing yourself. The resources are not handle specific, but FILE_OBJECT specific. And this is what actually creates a problem.

A duplicated or inherited handle should pose absolutely no problem, in that you would not free up the resources until the IRP_MJ_CLOSE operation, meaning all handles are closed.
Yes. That’s exactly what creates the problem. If I allocate process specific resources such as a user mode memory mapping, then I also need to free the resources in the context of that process. MSDN suggests using PsSetCreateProcessNotifyRoutine for such cases, but this does not seem to me a clean solution. At least because of the following remark: “but keep in mind that such callbacks are a limited system resource”. 64 callbacks system-wide is really not that many.

DoS attacks by creating handles and holding files open are next-to-impossible to prevent anyway, so why do you think this poses a particular problem for your driver?
That’s not the DoS I’m talking about. I’m talking about beeing able to force the driver to free the process context specific resources in a different process context. I’m not aware of consequences of that, but it doesn’t seem good anyway.

> Pavel A wrote:

> You do.
Thank you. It’s important to me to get a confirmation of acceptableness of
a potentially possible solution.

> Have your original application open the handle correctly (bInheritHandle
> in the lpSecurityAttributes arg must be false) then child apps won’t
> inherit the handle.
> Protect the app against remote DuplicateHandle
Thanks again. I didn’t have a notion of protected processes. But this
still does not protect against multiple malicious processes playing around
with handles to my device, assuming at least one of those could open it
directly. But… May be I found a solution. If I mark an incoming ioctl
request with the requester’s PEPROCESS in EvtIoInCallerContext and map the
MMIO region in the corresponding EvtIoDeviceControl (not in
EvtDeviceFileCreate) only if I previously processed EvtDeviceFileCreate
for that PEPROCESS, then I could peacefully do the unmap and forget that
PEPROCESS in EvtRequestCancel. This seems to cover all possible attack
scenarios.

> This is why custom devices often are protected so that only privileged
> accounts can access them.
Do you think, it’s enough to protect the device by allowing only
privileged processes to access the device? I mean, it could still be a
valid possibility to crash the kernel from user mode. Even though that’s
sometimes called a feature like in the case of
NtSetInformationProcess(ProcessBreakOnTermination).

Joseph M. Newcomer wrote:
> Generally, any design that starts, “I could map this kernel buffer to
> user space…” is going to lead you into a minefield of weird corner
> conditions that would frighten most experienced driver writers
Maybe, but there’s a potential performance gain. I’d be happy to avoid
that if I knew of any other supported way to achieve at least potential
zero-copy. E.g., it would be perfect to be able to do the whole displaying
thing completely in the kernel, but I’m not aware of any supported
graphics output interfaces in the kernel.

> Part of the question is, how expensive is the copy?
I feel like I’m going to observe all the stale tomatos thrown in my
direction, but I didn’t do the measurements. Just because my common sense
tells me it’s very likely to get singnificant perfomance degradation while
outputting FullHD at 50 Hz (which could be a bit overestimated, but also
may become a possible scenario). In fact, the purpose is not just to
ensure fluent output, but also disburden the CPU as much as possible from
handling tasks related to the PCI device, because other soft real time
tasks like 3D gaming are likely to be executed in parallel.

OK, so this is a consumer-oriented product, which is an important
parameter in the design space. However, “common sense” is not a good
gauge of performance, because common sense tends to think of things
scaling linearly. When we talk about such algorithms formally, we
categorize them as O(n) algorithms. However, this “big-O” notation is a
shorthand notation for cost = setup + C* f(n) + teardown, where C is the
“constant of proportionality”. My point is that “common sense” will not
tell you that C is nonlinear with respect to caching, for example, and in
fact the point of multilevel caching is to reduce C to a number smaller
than 1.0. Note also that large block copies are typically done
interruptible, that is, at PASSIVE_LEVEL. If you are dealing with 50Hz
frames, and the frames are only displayed visually, you can easily afford
to lose half of them, reducing your display rate to 25Hz, very close to
the standard movie frame rate of 24Hz.

> Also note that you would not free up resources on the IRP_MJ_CLEANUP
> operation, except those which are handle-specific (if any).
I assume you’re not correctly expressing yourself. The resources are not
handle specific, but FILE_OBJECT specific. And this is what actually
creates a problem.

If they are FILE_OBJECT specific, and you have multiple handles on a
FILE_OBJECT, the IRP_MJ_CLEANUP is not called until the last handle on the
FILE_OBJECT is closed. So I’m not sure why this presents a problem.

> A duplicated or inherited handle should pose absolutely no problem, in
> that you would not free up the resources until the IRP_MJ_CLOSE
> operation, meaning all handles are closed.
Yes. That’s exactly what creates the problem. If I allocate process
specific resources such as a user mode memory mapping, then I also need to
free the resources in the context of that process. MSDN suggests using
PsSetCreateProcessNotifyRoutine for such cases, but this does not seem to
me a clean solution. At least because of the following remark: “but keep
in mind that such callbacks are a limited system resource”. 64 callbacks
system-wide is really not that many.

Why do you need process-specific resources if a handle is duplicated to
multiple processes?

And the whole problem of unmapping memory shared with an app is one of the
many cans of worms that you create as soon as you consider doing
memory-maps of kernel space to user space. It gets rapidly messy.

> DoS attacks by creating handles and holding files open are
> next-to-impossible to prevent anyway, so why do you think this poses a
> particular problem for your driver?
That’s not the DoS I’m talking about. I’m talking about beeing able to
force the driver to free the process context specific resources in a
different process context. I’m not aware of consequences of that, but it
doesn’t seem good anyway.

If your driver can crash the kernel, you have to make sure that does not
happen. No matter what insanely stupid or insanely malicious software
runs, you have to be utterly and completely impervious to it. Hence the
can of worms. Trusting a user process to “do the right thing”? Well, if
you believe that, did I tell you about my uncle in the Nigerian army that
needs some help sneaking his money out of the country?

Note also that, independent of the zero-copy issue, there is still going
to be a copy, as the bits are transferred from the input device to the
display card’s frame buffer. SetDiBits isn’t going to make this copy
operation go away.

You should probably look at DirectX-types of support here, rather than GDI
support.

Expecting any user app to keep up with 50Hz full-HD is asking a lot. How
do you plan to deal with all those times when the threads are being
preempted by higher-priority threads? You have only 20ms per frame, and
if you are preempted for one scheduler cycle, you lose 30ms unless your
preempting thread finishes sooner than a “timeslice”. Expecting to
maintain this data rate, of one frame every 20ms while also running a game
does not seem to be reasonable. Even with the multimedia scheduling
service of Vista+, it does not seem to be sustainable. Perhaps a
DirectXpert has something to say about this.
joe


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Your problem will be that the display driver may use DMA for BitBlt, and not all PCIe platforms allow node to node DMA. So you’re better off with having the driver copy the data to an app memory buffer, and having the application call BitBlt on that memory buffer. Everything will be simpler and more reliable anyway.

Alex Grig wrote:

Your problem will be that the display driver may use DMA for BitBlt
This is exactly what I’m hoping it will do, because it’s the point of zero-copy.

not all PCIe platforms allow node to node DMA
In my case it’s ensured, that DMA will work, if the display driver will be kind enough to use it.

>In my case it’s ensured, that DMA will work, if the display driver will be kind enough to use it.

Have you tested that in pre-Vista OS (without DWM)? Desktop Windows Manager will actually do BitBlt into an offscreen composition buffer, and then do DMA.

Anyway, you’re not saving much in the memory bandwidth. Modern platforms have enormous bandwidth, and simple display update doesn’t stress it at all.

I’ve been thinking about how I might do this, and have come up wit some
design ideas. If they are too far off the mark, I hope someone will jump
in and say what’s wrong.

My current thought: ignore the copy problem.

The more significant cost would be in bringing in a page of the user
buffer when the I/O Manager does the MmProbeAndLockPages. This would take
several frame times. Therefore, you want to keep the user buffer locked
down.

One way is to specify that the buffer must be allocated using
VirtualAlloc/VirtualLock. Then the “Get Data” requests must reference
this buffer. No paging will be required.

Another method is to have a DeviceIoControl “here-there-be-buffer” request
that hands down a buffer of some size to the device. This IRP is held
pending until the app closes.

If a frame comes in and there is no place to put it, drop it. Keep a
count of such dropped frames (for later).

If you need to do multiple buffering (multiple: > 1), you could consider
passing in an address to the “here-there-be-buffer” request which
specifies a location which contains n * framesize bytes. The “Get Data”
request can return the offset in this larger buffer where the
frame-of-intereset is found. You track which buffers have been used in
your driver, and a “Get Data” request might include handing back the
offset previously used (with some designated bogus value, like -1 (32-bit
or 64-bit) to indicate first-time. If you get a frame and there are no
available slots to put it, drop it, increment the dropped counter.

This allows you to use async I/O to manage your input. Because async I/O
doesn’t guarantee delivery order is the same as arrival order, you might
consider adding a sequence number to the frame information you would
return. For performance, use I/O Completion Ports, not callback or
WaitFor… models.

My memory of DirectX was that one of its abilities was to allow a video
stream to be sent directly to the client area of the window which was
displaying the image. I would suggest looking into this rumor to see if I
have remembered it correctly. It might save you months of needless
effort.

My current thought is that if you have to impose a user app in the
critical path, the cost of the copy is going to be the least of the
problems you will have to solve to get a good presentation, and therefore
you should ignore it.

If you depend upon the video card being able to do DMA, you have quite
likely tied your success to the behavior of one vendor’s one model of
card, with one particular driver of theirs that allows this. This seems
to be a shaky premise for a general product. But all vendors are supposed
to support DirectX, and if my memory has any correlation to reality, if
you use DirectX this would already be supported, and therefore you are not
tied to a particular card, vendor, or driver instance.

You are using some very restrictive models for the app, such as GDI
(which, I believe, is thought to be largely “deprecated” for any
sophisticated graphics, and hangs around only for the simple stuff and
backward compatibiity). Look into multimedia support for streaming video
in DirectX as a starting point.
joe

Alex Grig wrote:
> Your problem will be that the display driver may use DMA for BitBlt
This is exactly what I’m hoping it will do, because it’s the point of
zero-copy.

> not all PCIe platforms allow node to node DMA
In my case it’s ensured, that DMA will work, if the display driver will be
kind enough to use it.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Alex Grig wrote:

Have you tested that in pre-Vista OS (without DWM)? Desktop Windows Manager will
actually do BitBlt into an offscreen composition buffer, and then do DMA.
Target system is currently Win 7 and Win 8 will become important soon. Not sure about pre-Vista, but today’s Win 7 test’s didn’t make me confident.

Anyway, you’re not saving much in the memory bandwidth. Modern platforms have enormous bandwidth, and simple display update doesn’t stress it at all.
The idea is about saving CPU time, not the memory bandwidth. Btw. today’s tests resulted in 25% of CPU time eaten up by just memcmp over 900KB (640x480x24) of the device’s memory at 5Hz. SetDIBitsToDevice ate up 40% with same parameters. Thus I think it could really be worth trying to enforce zero-copy. However people responsible for the device promised to improve it’s performance significantly.

Joseph M. Newcomer wrote:

My current thought is that if you have to impose a user app in the critical path, the cost of the copy is going to be the least of the problems you will have to solve to get a good presentation, and therefore you should ignore it.
According to my today’s tests you’re very wrong here.

Currently I have a seemingly secure implementation of mapping device MMIO into userspace based on the Doron’s suggestion to have a pending request.

xxxxx@mailbox.tu-berlin.de wrote:

The idea is about saving CPU time, not the memory bandwidth. Btw. today’s tests resulted in 25% of CPU time eaten up by just memcmp over 900KB (640x480x24) of the device’s memory at 5Hz. SetDIBitsToDevice ate up 40% with same parameters.

That’s almost impossible to believe. You must have bugs in your
hardware that are delaying the PCI cycles. In that case, I’m not sure
bus mastering is going to help you, either. Even a stupid PCI device
ought to be able to do 60 MB/s over MMIO. Remember that the CPU itself
is capable of handling about 5,000x the throughput you’re talking about.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

>Btw. today’s tests resulted in 25% of CPU time eaten up by just memcmp over 900KB (640x480x24) of the device’s memory at 5Hz.

That’s 19 MB/s, 50 ns/byte or 200 ns per word. Which is in line with a typical PCIe roundtrip for a single transaction.

To measure speed, you SHOULD NOT BE USING MEMCMP. memcmp will read it byte by byte. The code will issue single byte PCIe transactions. Use memcpy, it will use larger PCIe transactions.