Request for clarification: ownership of buffer/ MDL memory contents

In another thread it appears that I’ve been under some incorrect understandings of buffer and MDL ownership, I’d like more erudite folks to help me really get this dialed in … :slight_smile:

I’ve read up on the various threads here (quite a few of them, actually!) as well as some of the more recent MSDN documentation … [https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/buffer-descriptions-for-i-o-control-codes] and although there are some threads that touch on this, as we all know Windows changes daily and what was true then is no longer true now … and this isn’t a contrived example, this is from live driver code that I’ve used for quite some time [although to be fair, these days I use the WdfIo functions which hide a bunch of this stuff] …

Suppose that I have a usermode application A that is going to be making a call into a driver using METHOD_BUFFERED and METHOD_DIRECT_X and is going to be using overlapped IO for the call methodology, allocating a buffer in thread (1) and cleaning it up in the callback context. The usermode application is using a file handle to connect to the driver and standard system calls to access the driver, and there are no filter drivers or intermediate drivers between the caller and the driver

For METHOD_BUFFERED A[1] will allocate a buffer [O] with VirtualAlloc() and pass that as the return buffer and build a structure [I] on the stack which is then passed by pointer in the IOCTL call. In the driver both the input [I] and the output [O] buffers supplied have been allocated by the OS and the call buffers copied into it, so when the function in A[1] that called into the driver goes out of scope there’s no issue. The driver fills in the return buffer [O] contents and completes the call, which eventually percolates into the overlapped callback function A[2] which then calls VirtualFree() to clean up buffer [O]. The return buffer [O] is owned by the process, so in the case of process [A] terminating before the IOCTL call can complete that buffer is cleaned up, in the driver the output [O] buffer still points to valid memory (since it was allocated by the OS) and the driver simply cleans things up [Correct?]

My (mis)understand had been that for METHOD_BUFFERED even though the input and output buffers are allocated by the OS on the kernel side, there is still a copy of the return buffer made by the OS into the output buffer … so if I did something like allocate both in and out buffers on the stack, make the call and go out of scope there would be a BSOD in the kernel when the OS attempted to copy the return buffer into the now freed stack frame memory (remember, I’m using overlapped IO in this example) [Yes?]

For METHOD_DIRECT_X A[1] will allocate a buffer [O[ with VirtualAlloc() and pass that as a return/input buffer and build a structure [I] on the stack with is then passed by pointer in the IOCTL call and then the calling function will go out of scope. In the driver the input buffer [I] has been allocated by the OS and the calling buffer copied into it, and is given an MDL of the input/ output buffer [O] which has been pinned and locked, so that when the function in A[1] that called the driver goes out of scope there’s no issue with that, and the input/ output buffer [O] was allocated with VirtualAlloc() so it’s also safe. The driver converts the MDL into a buffer, fills in the buffer [O] contents and completes the call, which eventually percolates into the overlapped callback function A[2] which then call VirtualFree() to clean up buffer [O] [Correct?]

My (mis)understanding had been that although the OS owned the MDL list and pinned the memory associated with that list, the calling program [A] still had ownership of the underlying memory associated with that buffer … and that “pinning” simply told the OS “don’t page this or move it around” so while the IOCTL was outstanding with the driver the calling process [A] could:

  • Change the contents of the buffer, say memset[O] to 0xFF [Yes?] This was actually the topic of an OSR Insider article from 2009 …
  • In the case of an abnormal termination of the process, the OS would free the buffer [Yes?]
  • Call VirtualFree() itself on that buffer, freeing the memory [Yes?]
  • If the buffer were allocated on the stack and passed by pointer to the call as soon as the calling function went out of scope that buffer would be invalid [Yes?]
  • If the IRP exceeded the 5 minute clock then the IRP would be completed but the buffer would still be valid, so I could still write to that buffer even though the IRP had been cancelled [Yes?]

It was recently stated that actually once memory from a usermode process has been pinned and locked by the OS prior to delivery to the driver it now “belongs to the OS”, which implies:

  • In the case of an abnormal termination of the process, the associated VirtualAlloc()'ed memory would not be freed as part of the OS teardown, rather be the responsibility of the OS or driver. This implies first that I could potentially exhaust the system PTE’s by simply having a pathological service allocate, call then crash and restart again and again and letting the driver accumulate IRP’s in it’s queue. It also implies that at some point once I do complete that IRP the OS is going to need to clean up that underlying buffer … but it doesn’t know if that memory was allocated from virtual memory or some kind of process heap
  • In the case of the 5 minute clock when the IRP is cancelled the buffer also needs to be cleaned up (again, by whom and how do they call the right cleanup function) which means if my driver is in the middle of writing to that buffer I’ve got a problem
  • In the case of an allocation made on the stack from a function that goes out of scope, does that mean a portion of the stack space from that thread has now been carved off and made “special”? As we all know stack is simply a chunk of memory allocated to a thread from which local allocations are made and released … which means it depends on the entire range being available with no sandbars … and again, once the thread terminated and it’s stack space freed what would happen to that “special” section I used for the in/out buffer [O]? This would appear to be case for both METHOD_DIRECT and METHOD_BUFFERED calls

This idea of “once the buffer has been pinned/ locked it’s owned by the OS” seems like it would introduce a lot more headaches than how I understood things (memory is owned by the allocating process) … what am I missing?

Thx!

That’s a very long post. I hope I’ve managed to understand exactly what you’re asking.

When you use METHOD_BUFFER, the I/O Manager allocates a System Buffer in non-paged pool, the size of which is the greater of the size of the InBuffer and the size of the OutBuffer. If there’s an InBuffer specified by the app, and its length is non-zero, the I/O Manager copies the contents of the InBuffer to the System Buffer.

The driver retrieves the data from the System Buffer (which is the contents of the app’s InBuffer). The driver does some processing, and puts output data in the System Buffer. The driver completes the Request, specifying a status and a length. When the app is next ready to run (since we’re talking overlapped I/O in your example, this would be the result of The Special Kernel-Mode APC For I/O Completion), the I/O Manager copies the contents of the System Buffer to the app’s address space. If the specified buffer was “out of scope” on the app… it wouldn’t be non-existant, and the app would fail. If the user has actually deallocated the memory (and the pages no longer appears within its address space), the copy-back would raise an exception… but because it is done in a try/except and the I/O Manager is referencing user-mode memory, nothing particularly bad would happen… except the Request gets completed with an error that is basically the exception code that was encountered in copying back the data.

So… METHOD_BUFFERED… all good.

In METHOD_xxx_DIRECT, the OutBuffer is described by an MDL, and the I/O Manager locks the pages by calling MmProbeAndLockPages. The reference count on each page is incremented in the PFN, and the PFN entry is updated to reflect the context in which a mapping to these pages occurs. At some point, the driver gets the request and maps them into the high half of kernel virtual address space, using either MmGetSystemAddressForMdl or WdfRequestRetrieveOutputBuffer.

As you note, there’s a security vulnerability implicit in this: The pages are now actively available in the app and the kernel virtual address space at the same time. If the contents of the buffer isn’t “just data” but rather has some structure, the driver needs to make a copy of this data before validating and trusting it.

In the normal course of things, the driver completes the Request. Again, speaking about METHOD_DIRECT here, the pages comprising the buffer are unlocked, and (again… we’re talking overlapped I/O here) the The Special Kernel-Mode APC For I/O Completion is queued. And the MDL is torn-down. And… all good.

In the case of abnormal termination of the app, the process is not allowed to exit until all the IRPs that are owned by each thread in the process have been completed (or the Famous Five Minute Timer has expired, which we’ll discuss in a minute). On thread exit, the I/O Manager runs the list of IRPs owned by the thread, and calls the cancel routine registered for each one. When the I/O Manager gets to the end of the list, if there are still IRPs in progress, the I/O Manager will wait for up to five minutes for the pending IRPs to complete. Because IoCompleteRequest hasn’t been called, the pages remain locked. Because there’s still an IRP in progress the app (Process) can’t exit. If all the outstanding IRPs gets completed (in the normal way or otherwise) during the five minute interval, then the pages are unlocked (cuz I/O completion) and the app exits as normal. If one or more IRPs do NOT get completed, the pages remain pinned “forever” (and become “lost”), the File Object (which has a reference from the IRP) stays instantiated, and the process gets only partially torn down forevermore, until the system is rebooted. This is called an I/O Rundown Failure.

IF I remember correctly, the app can indeed unmap the pages comprising its buffer while the Direct I/O is in progress. But the the PFN reference count remains incremented (accounting for the locked pages). So, I suspect the page could be mapped again in kernel mode, but it can’t be put into the virtual address space of another user-mode process (consider what a security problem THAT would cause). MmUnlockPages handles the case of the pages not being in the app’s space anymore, but I can’t remember (if I ever knew) how it knows this has occurred (I’d guess it’s one of the counts in the PFN). Consider how very, very, bad it would be if the OS did NOT handle this case! There’d be a very obvious system vulnerability, that every disk I/O in the system was subject to.

once memory from a usermode process has been pinned and locked by the OS prior to delivery to the driver it now “belongs to the OS”

I don’t know who said that, or where they said it, but that’s kinda squirrely. Between the time the app sends the request, and the time the pages are locked, the app can free the pages. In this case, the I/O will fail. If the pages are pinned, and then the app unmaps the pages from its VAD, the pages remain pinned until they are unlocked by MmUnlockPages. I guess it’d be correct to say “they’re owned by the OS”, but… that’s not very helpful, I don’t think.

at some point once I do complete that IRP the OS is going to need to clean up that underlying buffer

Correct.

but it doesn’t know if that memory was allocated from virtual memory or some kind of process heap

It doesn’t care. If the pages aren’t mapped anywhere, it JUST cares about the pages, as described by the PFN.

In the case of the 5 minute clock when the IRP is cancelled the buffer also needs to be cleaned up (again, by whom and how do they call the right cleanup function) which means if my driver is in the middle of writing to that buffer I’ve got a problem

No. After the Famous Five Minute timer expires, if the IRP is still oustanding… the pages are LOST (until the system is rebooted). So, Windows will never allow you to properly write to a buffer that you’ve been handed in an IRP, and allow the write to corrupt… anything. This is where a lot of good devs get “off track” in terms of handling cancel.

Remember: As a general rule, the OS doesn’t care about user-mode stacks, or heaps, or anything else. These are purely user-mode constructs that apply to how user-mode allocates, manages, and uses his memory. The OS cares about whether pages are mapped, whether the page is owned exclusively or shared, and whether they’re locked into physical memory (thereby preventing those pages from paging).

Peter

Got it, much appreciated! I can imagine to took quite awhile to read my wall of text and to write the response, again much appreciated and things are clearer for me now! :slight_smile: