Why doesn't windows provide a simple way for direct I/O for input in DeviceIoControl?

As you all know, when we create an IOCTL using METHOD_IN_DIRECT, the input is still copied from user mode to kernel space, but in the case of read and write requests the input is also direct.

My question is why? why in read and write requests the input is direct, meaning IO manager doesnt copy the buffer from user to kernel, but in case of custom ICOTLs when created with METHOD_x_DIRECT, the input is still copied from user to kernel space?

Will i face any problem if i try to use METHOD_NEITHER and try to lock the user memory to physical memory and use MDLs to directly read from user buffer?

I’m asking this because i am trying to read large buffers sent from user frequently, and i need to avoid the wasteful copies from user buffer to kernel space done in METHOD_x_DIRECT and METHOD_BUFFERED.

You’re missing a crucial,point: Method_out_Direct allows you to use the Output Buffer for both input and output. This is by design.

The Input Buffer is, again by design, specifically provided for short data transfers or control structures.

Do not, I repeat do not use Method_Neither for IOCTLs. Danger…

Peter

@brad_H said:
As you all know, when we create an IOCTL using METHOD_IN_DIRECT, the input is still copied from user mode to kernel space, but in the case of read and write requests the input is also direct.

My question is why? why in read and write requests the input is direct, meaning IO manager doesnt copy the buffer from user to kernel, but in case of custom ICOTLs when created with METHOD_x_DIRECT, the input is still copied from user to kernel space?

Will i face any problem if i try to use METHOD_NEITHER and try to lock the user memory to physical memory and use MDLs to directly read from user buffer?

I’m asking this because i am trying to read large buffers sent from user frequently, and i need to avoid the wasteful copies from user buffer to kernel space done in METHOD_x_DIRECT and METHOD_BUFFERED.

What you are talking about is the methodology used for HPC RDMA transfers; condensing it onto a napkin, the usermode app either wants to send a block of data to the Infiniband network or get a block of data back … for Microsoft this is Network Direct [https://docs.microsoft.com/en-us/previous-versions/windows/desktop/cc904397(v=vs.85)] which is the MS attempt to use this technology for Windows based HPC clusters …

An HPC RDMA transaction essentially (again, condensing a lot) replicates memory paging for a compute cluster rather than for a machine. In single machine memory paging there is a very large virtual address space (gigabytes) and a very small actual physical mapping (megabytes) with a bunch of memory paged out (multi megabytes). The application asks for X bytes at Y address and the OS first checks if it’s in physical mapped space and if not then calls the paging engine to grab that specific block from paged out media, copy it into a DRAM stick at address X for Y bytes and completes the call … the application sees none of that, it’s just “there”. Paging 101.

For HPC compute clusters you are working with datasets in the petabyte range (oil and gas survey results, or Walmart’s daily sales logs) so a single machine won’t hold that much memory but several hundred put together will. RDMA in conjunction with an InfiniBand fabric and with the appropriate Infiniband network card duplicates this virtual/ physical paging method not on a single machine but across dozens of machines. If you’re curious I would google for “fabric” and “network direct” and “infiniband” and “hpc” …

In an HPC application the usermode app wants to read from address X of bytes Y which doesn’t reside on the machine but it resides on a machine somewhere on the fabric … so it allocates a block of that size (which is made up of lot’s of little allocations in usermode). The usermode app then passes this block with address X information to the HPC library (like Network Direct) and that library passes the buffer using METHOD_NEITHER to the driver.

To “receive” that data the driver takes the supplied block, pins and locks it, builds an MDL chain and makes a request to the fabric for X bytes at Y virtual address … some computer out there on the fabric who just happens to have that block of data residing in it’s physical memory will then send that block to the requesting chip … the data arrives on the chip, it’s DMA’ed into pinned and locked buffer, the interrupt tells the driver to unpin/ unlock the buffer and completes the IRP (that’s why it’s called zero-copy, data goes directly from the chip into the usermode app buffer) … the application sees none of that, it’s just “there”. Paging 201.

So METHOD_NEITHER works, it works well, a very large number of people use it … but there are some very important considerations for this:

  • The usermode block cannot be touched by the anyone (the issuing app or the OS) while it’s in play, else you will get a BSOD. In an HPC setting that’s not an issue, there are call gates that the usermode app uses while it’s waiting for the IRP to return. The HPC application is also the only thing that the machine is running and it’s in a sealed room/ building so there’s no reason for the OS to terminate the process so again, no danger of the buffer being touched while it’s in play
  • The block of memory needs to be self contained, everything needed is there going in and everything expected comes back out
  • There are OS limitations on how much memory you can pin and lock at one time that you need to respect. On an HPC machine the ecosystem is very tightly controlled so there aren’t any very many drivers that are doing DMA so there’s very little contention for those resources but still you have to be careful about minding your resources. RDMA transactions ultimately depend upon the transfer size of the fabric and the speeds of the other machines in the cluster so generally you do 64K at a time (although typically you will have as many transactions in flight as are DMA engines on the chip, Mellanox and QLogic cards have 64 engines)

In your application you need to be mindful of both of these; crash the usermode app and you crash the machine, run out of resources you crash the machine or choke it to uselessness … and so Peter is quite correct, METHOD_NEITHER is really not the way to proceed for what you’re trying to do. The amount of time the OS will use in pinning and locking your buffer is probably close to what it would spend in copying your usermode buffer into kernelmode (memcpy speeds are actually astonishingly fast these days) so using METHOD_IN or METHOD_OUT will work fine for you …

@craig_howard … I think you’re over-complicating things for the OP. If I understand correctly, he doesn’t want to do RDMA or anything remotely like it. He just wants “Direct I/O” with an IOCTL… and he needs to understand that the Input Buffer is entirely optional, and the Output Buffer can be used in both directions. This isn’t obvious to most folks.

So METHOD_NEITHER works, it works well, a very large number of people use it … but there are some very important considerations

You are conflating the use of METHOD_NEITHER with a shared memory scheme; They are not the same, and the use of one does not necessarily imply the use of the other. There are shared memory schemes that never use METHOD_NEITHER. There are uses of METHOD_NEITHER that do not involve any shared memory.

Using METHOD_NEITHER is pretty extreme for most uses, and it’s particularly nasty for IOCTLs. This is because for IOCTLs Windows does not do any validation, none at all, on the values passed as the Inbuffer, InLength, OutBuffer, and OutLength arguments. This has a (horrible, annoying) historical precedent, in which people abused these 4 arguments to allow them to pass arbitrary PVOIDs (unrelated to buffer arguments) into their drivers.

Most driver devs don’t understand how to properly validate buffers passed by METHOD_NEITHER. And they don’t understand the risks of the driver and the app having potential shared/simultaneous access … thus leading to TOCTOU vulnerabilities. Once they “capture” the buffer described by METHOD_NEITHER, they have effectively replicated Buffered I/O. So, you know, might as well just use METHOD_BUFFERED in that case.

So, yeah… on it goes.

But, I think the OP just wanted to know how to use Direct I/O on the OutBuffer to provide for both Input and Output. If I’ve misunderstood, and he’s looking for a much more complex answer such as that you provided, I apologize.

Peter

ETA: I also wanted to add that your list of problems with the use of shared memory approaches is a good one… but doesn’t include things like dealing with the issue of “what happens when somebody duplicates a handle to a device when a shared memory buffer is in use” and what happens when that handle gets closed. Things get… interesting.

@“Peter_Viscarola_(OSR)” said:
You’re missing a crucial,point: Method_out_Direct allows you to use the Output Buffer for both input and output. This is by design.

The Input Buffer is, again by design, specifically provided for short data transfers or control structures.

Do not, I repeat do not use Method_Neither for IOCTLs. Danger…

Peter

So what you’re saying is this :

If a user mode service wants to send a lot of data to the driver very frequently, it can use Method_out_Direct and provide NULL for the input and just give the actual buffer/input as the output pointer argument to the DeviceIoControl, right?

Because what got me confused was that i read in a book that even in the case of Method_out_Direct, IO manager still does a copy of the user input buffer to the system space (AssociatedIrp.SystemBuffer), which is very redundant in my case and i need to avoid any copy, i just need to access the user buffer and read it in a safe way.

And thank you craig for the detailed response, so now that i have rephrased my problem in the above sentence, should i just go for the Method_out_Direct approach and just provide the user input buffer as the output buffer argument and therefore avoid any copy? Because the user needs to send these buffers to the kernel a large number of times per second.

@brad … Method_Out_Direct should work for you … and in fact on this very list (the ‘search’ function of which should be any poster’s first, second and third stop) there was a discussion of this very topic [https://community.osr.com/discussion/113720/is-it-efficient] and which is chock full of good information about IOCTL methods …

But as @Peter has pointed out, and which is something which you really can’t gloss over, is that with any of the “direct” methods there is no memory checking of any kind, and the consequences of that are severe … if you pass in a bad buffer, you’ll BSOD. If something changes the underlying buffer nature (like the OS tearing down your usermode process because something happened) you’ll BSOD. In some cases if the memory alignment of the buffer is off then you will get a BSOD. There are a lot of corner cases that can goof you up, it would be well worth browsing this list for ‘IOCTL’ and seeing what knots people get themselves tied into before you go that route …

If this is a graduate project, where it just has to run on the instructor’s machine once then DIRECT will work. If this is an “I’m bored” project then once you get it running then you can call victory then again DIRECT will work. If it’s production code which will be installed on a customer’s machine however then you really, really want to avoid BSOD’s … so you need a really, really good reason (as in write a sample driver that uses METHOD_BUFFERED, another one which uses METHOD_OUT_DIRECT and run some timings on both of them with the data sizes you want to do at the frequency you want to do them at and compare the two numbers) to justify it …

@craig_howard said:
@brad … Method_Out_Direct should work for you … and in fact on this very list (the ‘search’ function of which should be any poster’s first, second and third stop) there was a discussion of this very topic [https://community.osr.com/discussion/113720/is-it-efficient] and which is chock full of good information about IOCTL methods …

But as @Peter has pointed out, and which is something which you really can’t gloss over, is that with any of the “direct” methods there is no memory checking of any kind, and the consequences of that are severe … if you pass in a bad buffer, you’ll BSOD. If something changes the underlying buffer nature (like the OS tearing down your usermode process because something happened) you’ll BSOD. In some cases if the memory alignment of the buffer is off then you will get a BSOD. There are a lot of corner cases that can goof you up, it would be well worth browsing this list for ‘IOCTL’ and seeing what knots people get themselves tied into before you go that route …

If this is a graduate project, where it just has to run on the instructor’s machine once then DIRECT will work. If this is an “I’m bored” project then once you get it running then you can call victory then again DIRECT will work. If it’s production code which will be installed on a customer’s machine however then you really, really want to avoid BSOD’s … so you need a really, really good reason (as in write a sample driver that uses METHOD_BUFFERED, another one which uses METHOD_OUT_DIRECT and run some timings on both of them with the data sizes you want to do at the frequency you want to do them at and compare the two numbers) to justify it …

Why would i have a risk of BSOD when using METHOD_OUT_DIRECT?!
If i recall correctly the IO manager locks the pages into physical memory so there is no risk of paged out user buffer, and then maps it to kernel (instead of copying it). so… why would this have a risk of BSOD?!

@brad_H said:
… snip …
Why would i have a risk of BSOD when using METHOD_OUT_DIRECT?!
If i recall correctly the IO manager locks the pages into physical memory so there is no risk of paged out user buffer, and then maps it to kernel (instead of copying it). so… why would this have a risk of BSOD?!

The IO manager will give you a pinned and locked MDL of the buffer AT THAT INSTANT, but a femtosecond later something could change the underlying buffer the MDL points to and invalidate it (such as the process being torn down due to the OS terminating it or the user terminating it) and if you access it after that then BSOD. There are also some memory operations (like the NDL’s I’m sure) that are sensitive to alignment; the MDL that the OS gives you is most certainly not aligned the way you might want it to be and when you use it that’s a BSOD that’s going to be tough to track down as it will only happen seemingly at random …

Anytime you’re working with memory that a usermode program allocated that needs to be done inside of a __try/__except block and you need to have business logic for how to handle an exception … by using a system allocated buffer such as with METHOD_BUFFERED you don’t have any of those worries (including alignment) and can focus on the actual business logic of your driver. I remember well a Peter adage [paraphrased] that “unwritten code will have no bugs” and every line of exception handling or buffer alignment is another line that needs to be tested and debugged and that’s a lot of _try/_except’s you will be writing …

Again, do some timings with METHOD_BUFFERED and METHOD_DIRECT and see how much of a performance hit there is … I think you will find it’s much less than you might think, and increased safety and decreased complexity is well worth it …

Where did you get the idea that the MDL or the memory in the way you are describing? The MDL and its memory will be intact until the request is completed, if not many drivers will break.

Also, on you earlier comments on HPC RDMA I can state that several of the open source demo drivers that use METHOD_NEITHER had serious bugs in them. I consulted for a couple of firms trying to do work in that area, and spent most of my time fixing problems in “working drivers”!

If a user mode service wants to send a lot of data to the driver very frequently, it can use Method_out_Direct and provide NULL for the input and just give the actual buffer/input as the output pointer argument to the DeviceIoControl, right?

Yes. Correct. You pass data TO the driver using the OutBuffer in the IOCTL. You can then optionally even return output data to this same buffer. This is the way its typically done in Windows. It’s very standard.

i read in a book that even in the case of Method_out_Direct, IO manager still does a copy of the user input buffer to the system space

When you use METHOD_OUT_DIRECT, if you specify an InBuffer then the contents of that buffer are copied to nonpaged pool. Just leave the InBuffer unused (input pointer nullptr, length = 0).

should i just go for the Method_out_Direct approach and just provide the user input buffer as the output buffer argument and therefore avoid any copy

Yes, yes… a thousand times yes. This is how Windows IOCTLs were designed to be used. See the description of METHOD_IN_DIRECT (which means the OutBuffer is used to send data TO the driver). METHOD_OUT_DIRECT means that you’re sending data FROM the driver, and optionally sending also sending data TO the driver.

with any of the “direct” methods there is no memory checking of any kind

@craig_howard … This is not correct. The whole “no parameter checking” thing applies ONLY TO METHOD_NEITHER IOCTLs.

If i recall correctly the IO manager locks the pages into physical memory so there is no risk of paged out user buffe

You are correct. And in so doing, the I/O Manager calls ProbeAndLockPages (with the AccessMode argument set to UserMode) which ensures the pages are in fact accessible by the caller.

Gosh… it would just be so much easier if MSFT decided to make a few selected modules from the I/O Manager available for reference.

Peter

1 Like

I have long thought those buffers should be called “first buffer” and “second buffer”, instead of “input” and “output”. That would eliminate the connotations that cause so much confusion.

I have long thought those buffers should be called

Exactly.

Standard line from my teaching our WDF seminar: “Ignore the names of these buffers. They should have been called BufferA and BufferB… but they didn’t ask me.”

What’s worse is that this hideous terminology forced the adoption of the obscure naming convention into WDF (to avoid terminal confusion). So we have WdfRequestRetrieveInputBuffer (for the buffer accompanying a write) and WdfRequestRetrieveOutputBuffer (for a buffer accompanying a read). You haven’t lived until you’ve seen the faces on experienced devs when they first hear this.

The level ridiculousness is rivaled only by (a) The fact that METHOD_BUFFERED uses the same buffer for both buffers, (b) the fact that the InBuffer pointer is in Irp->AssociatedIrp.SystemBuffer… but this has nothing to do with Associated IRPs. Nothing at all. But, you know, “it had to go somewhere” – which is what the guy who decided where to put it told me when I asked him why in the world he chose that location.

Sorry… tangent.

Peter

The IO manager will give you a pinned and locked MDL of the buffer AT THAT INSTANT, but a femtosecond later something could change the underlying buffer the MDL points to and invalidate it (such as the process being torn down due to the OS terminating it or the user terminating it) and if you access it after that then BSOD.

That, sir, is also grossly incorrect.

Consider, if you will, that even if the process gets torn down, it can’t exit cleanly until the IRP has completed, and the pages once locked will have their reference counts incremented in the PFN. In short, once locked, those pages aren’t going anywhere. Direct I/O… is entirely reliable. As it better be, considering is the method used throughout the storage stack.

Mr. @craig_howard …. While I applaud your desire to help your colleagues here, you might consider taking a step back to reinforce your Windows architectural knowledge before offering too much more assistance on topics where you don’t have definitive knowledge and personal experience.

Peter

So we have WdfRequestRetrieveInputBuffer (for the buffer accompanying a write) and WdfRequestRetrieveOutputBuffer (for a buffer accompanying a read).

Well, for me it makes sense. It is matter of perspective. From your point of view data you’re going to write for caller are in your input buffer and data you read for caller you’re going to put to your output buffer. Am I missing something?

@“Peter_Viscarola_(OSR)” said:

The IO manager will give you a pinned and locked MDL of the buffer AT THAT INSTANT, but a femtosecond later something could change the underlying buffer the MDL points to and invalidate it (such as the process being torn down due to the OS terminating it or the user terminating it) and if you access it after that then BSOD.

That, sir, is also grossly incorrect.

Consider, if you will, that even if the process gets torn down, it can’t exit cleanly until the IRP has completed, and the pages once locked will have their reference counts incremented in the PFN. In short, once locked, those pages aren’t going anywhere. Direct I/O… is entirely reliable. As it better be, considering is the method used throughout the storage stack.

Umm … not quite …

To put this into context, I was explaining to the OP why there was a danger of a BSOD occurring with a DIRECT_OUT call pinning and locking the usermode memory, as if that would somehow freeze the memory in place, nice and safe and secure. I was giving an example of several instances when that assumption would be incorrect (although they happen rarely) to hopefully encourage him to choose a safer way to move data from usermode to kernelmode.

You are correct, once those pages are locked they aren’t going anywhere but note the loophole here; until the IRP has completed. If that IRP gets marked as completed by something other than the driver that received the buffer (a scenario which indeed you yourself wrote about some time ago [https://www.osronline.com/article.cfm^article=72.htm], or if the calling thread is unloaded by the OS as part of a teardown, or if it is just taking too darn long to complete [https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/canceling-irps]) then a DIRECT_X buffer is left flying in the wind. Many things can complete an IRP; a filter driver in the stack, an exception handler in a library someplace, the OS tearing down a thread/ process, an IRP timeout, you just don’t know if that IRP is going to get completed behind the driver’s back and therein lies the danger I was attempting to communicate to the OP.

The OS itself will attempt to cancel outstanding IRP’s in a teardown or a timeout (hence the whole idea behind a cancel callback) to prevent a denial of service of the OS … consider that I had a service that did three things: called into a driver with a METHOD_DIRECT and used overlapped IO, had another thread that dereferenced a null function pointer and was set up to auto-restart. That’s all the service does all day … call the driver, pin and lock a buffer, crash, restart. If the OS did not have a way to cleanse the outstanding IRP it would eventually exhaust the PTE table with all of those pinned and locked buffers, create all kinds of memory sandbars with what remained and essentially become unusable over a period of time … which isn’t quite the user experience that MS wants, so they will cancel outstanding IRP’s when possible in that situation. The called driver, of course, knows nothing of this drama, all that it knows is that it’s working with a block of pinned and locked memory that it would like to write into and has an IRP that needs to be completed. Hilarity will ensue when the driver attempts to access that memory of course, as when it attempts to complete the IRP …

That was the point: memory from userland is always dangerous to use no matter how many locks and pins and such you have on it. @Burns mentioned that it’s very hard to get RDMA right because of that and he’s quite correct: only QLogic and Mellanox were able to pull that trick off, and if you go back a few decades (oy!) to messages I posted here you know exactly why the QLogic driver was one of the few ones to get it right …

Mr. @craig_howard …. While I applaud your desire to help your colleagues here, you might consider taking a step back to reinforce your Windows architectural knowledge before offering too much more assistance on topics where you don’t have definitive knowledge and personal experience.

People generally respond to questions on an online forum for one of three reasons; a) they have some personal experience with a topic that they would like to share, b) they like to make themselves heard as the all-knowing, omniscient expert looking down from on high or c) they are looking for that answer themselves. It’s generally really easy to tell which is which from the signal to noise ratio: folks who reply back with links or examples are in camp a), folks who reply back with statements like “you insufferable idiot, you have no idea of what you’re talking about, go crawl back to where you came from and bother us no more with this twaddle!” fall into camp b) and you will sometimes have folks sometimes reply back with c).

As I am by no means an omniscient expert who goes golfing with God on Sundays but have had a fair amount of time in the saddle over my 40+ years of working with OS internals, 35+ of which with Windows (all the way back to Windows 1.0) I do have some vague and general knowledge and personal experience in some topics mentioned here … that said, I definitely fall into camp a) on topics of which I’ve run into and usually camp c) and am always looking to learn a bit more, even if I have to wade through all of the camp b) replies to find those camp a) nuggets (although this list is much nicer now than it was in the mid 2000’s when the signal to noise ratio was nearly zero) …

When I post something incorrect or misleading by all means I would like that information corrected (but use links and code samples, not invectives and diatribe) so that two people can learn, the OP and me …

Peter

1 Like

It is matter of perspective.

True. In fact, everything is Mr. @Michal_Vodicka … everything is a matter of perspective.

:wink:

Peter

once those pages are locked they aren’t going anywhere but note the loophole here; until the IRP has completed. (emphasis mine)

Hmmmm… Well, (a) you left that important detail (bold above) out of your explanation to the OP, and (b) this applies to ANY method… including METHOD_BUFFERED, right? But you said

by using a system allocated buffer such as with METHOD_BUFFERED you don’t have any of those worries

But, according to you, you DO! There’s the constant danger of IRP cancellation.

But, really there isn’t. Not in Direct I/O NOR in Buffered I/O. It is a basic tenet of Windows I/O Subsystem architecture that the buffer associated with an I/O Request is only valid while you own the I/O Request. There’s not much to say about that.

The OS itself will attempt to cancel outstanding IRP’s in a teardown or a timeout

or if it is just taking too darn long to complete

Well… there’s something called I/O Rundown, yes. But there is no I/O Timeout in Windows. There is no such thing as an IRP taking “too darn long to complete.”

The only “timeout” involved is the one that happens during I/O Rundown… when a thread is exiting… AFTER an IRP has been canceled. That’s the infamous “five minute timer” after the I/O Manager runs the list of IRPs outstanding on a given thread, calling the cancel routine for each. If after canceling the IRP, and waiting for 5 minutes, there remain any IRPs which have not been completed by their owning drivers, the buffers (and File Object) involved STILL remain valid. So the driver won’t “do the wrong thing” even in that case and is in no danger of crashing the system.

you just don’t know if that IRP is going to get completed behind the driver’s back

Well… yes and no. Again, you cannot – absolutely cannot – play with buffers or IRPs that you do not “own.” That included Direct I/O and Buffered I/O buffers that you receive. Once you send the IRP to another driver, the IRP is gone – you don’t “own” it, and you can’t access the buffers associated with the IRP for the reasons you mention.

This is why you can’t have a cancel routine for an IRP that you don’t own.

While you own the IRP, it will not get completed behind your back. Ever.

What you’re saying does apply – sort of – to the way folks often use Neither I/O: They get a buffer from an app, they map that buffer into the high half of kernel virtual address space, and then complete the Request that provided the buffer. Now they have a buffer that has no IRP associated with it. This is fine… the driver allocated the memory and created the mapping… but the driver needs to remember to unmap and deallocate the pages. This is why you need to take a reference on the Process, and ensure that you unmap (and free) the buffer when the process that caused you to create that mapping closes his handle to your device (or, exits… same thing) as part of Cleanup processing (and deref the Process). You can’t just create a buffer, map it to the process, and then allow the process to close its handle… while you leave the buffer extant. But… even in this scheme… you are not in danger of having the buffer go away “behind your back.” In this case, it’s your buffer.

to hopefully encourage him to choose a safer way to move data from usermode to kernelmode.

The rules for using Direct I/O and Buffered I/O are identical in terms of ownership of buffers. In this respect, Buffered I/O and Direct I/O are equally “safe”. Yes, you have to write your driver according to the I/O Subsystem architecture, but that’s the same as for Buffered I/O or for anything, right?

Peter

Good info, thanks! There are some assumptions about IRP cancellation and the ownership of associated buffers/ MDL’s on my part that I’m glad to have corrected … thanks for taking the time to fix those up for me!

Thank you, Mr @craig_howard, for being a good sport.

And I apologize if I offended you earlier with my comment. That was absolutely not my intention.

Peter

1 Like

No offense intended, no offense taken, no offense implied … it’s all good! :slight_smile:

1 Like