Using shared memory at Dispatch-Level

How can I created shared memory that can be accessed at Dispatch-Level in my driver and accessed from user-mode? I tried to do the following but I am fairly sure this created a pageable section of memory.

        sectionSize.QuadPart = sizeof(INPUT_DATA);
        RtlInitUnicodeString(&sectionUniName, L"\\BaseNamedObjects\\SharedSection");
        InitializeObjectAttributes(&objAttributes, &sectionUniName, OBJ_KERNEL_HANDLE, NULL, NULL);

        if (!NT_SUCCESS(status = ZwCreateSection(&g_hSection, PAGE_READWRITE, &objAttributes, &sectionSize, PAGE_READWRITE, SEC_COMMIT, NULL))) {
            DebugMessage("Failed mouse ZwCreateSection \n");
        }
        else if (!NT_SUCCESS(status = ZwMapViewOfSection(g_hSection, NtCurrentProcess(), &g_pSharedBuffer, 0, sizeof(INPUT_DATA), NULL, &viewSize, ViewUnmap, 0, PAGE_READWRITE))) {
            DebugMessage("Failed mouse ZwMapViewOfSection \n");
        }

I also tried to create the memory in my C# application and then send a pointer to that address along with other info like the memory’s size and such but I am fairly sure this is also pageable and maybe not an ideal way of doing it.

        HANDLE usermodeHandle = verifyData->SharedMemoryHandle;

        // Convert the user-mode handle to a kernel object.
        PVOID sharedSection;
        status = ObReferenceObjectByHandle(
            usermodeHandle,
            SECTION_MAP_WRITE | SECTION_MAP_READ, 
            NULL, // Use NULL for type to bypass strict type checking.
            UserMode,
            (PVOID*)&sharedSection,
            NULL
        );

        if (!NT_SUCCESS(status)) {
            // Handle error.
            DebugMessage("Failed ObReferenceObjectByHandle \n");
            break;
        }

        // Map the section into the system address space.
        SIZE_T viewSize = sizeof(INPUT_DATA);
        status = MmMapViewInSystemSpace(sharedSection, &g_SharedBuffer, &viewSize);

        if (!NT_SUCCESS(status)) {
            ObDereferenceObject(sharedSection);
            break;
        }

Anyways I just am trying to figure out the best way to go about this which I think might be some variation of that first solution but I am not totally sure what the standard way of accomplishing would be. I rather not create a ‘KDPC’ and go through all of the work of dropping the IRQL especially since I need these operations done quickly. Also I plan to call ‘RtlCopyMemory’ at Dispatch-Level to copy data to this shared memory which hopefully can then be accessed from my client application after using ‘KeSetEvent’ after it to denote a change in memory. My driver wont need to read any memory from this shared memory it just needs to write to it.

Mouse input is not a low latency stack. Why do you need shared memory? Instead just pend a bunch of ioctls with the driver and complete one with the data you want to share.

@Doron_Holan said:
Mouse input is not a low latency stack. Why do you need shared memory? Instead just pend a bunch of ioctls with the driver and complete one with the data you want to share.
What do you mean Mouse Input is not low latency? An upper mouse filter would be very low latency compared to other methods of receiving input. But that being said speed is not the only factor here. I am translating mouse data into a virtual plane joy stick movement and from my research sharing this data between the kernel and user-space is best done with shared memory instead of IOCTL because IOCTL does not seem to be made for the kernel sending a user program tons of requests per second. On the other hand just writing the new input to shared memory and raising an event seems quick and easy assuming I get it to work. Although that being said it probably more complex than that since a buffer sent by a usermode program will use virtual memory which is a different address space from the kernel. So I assume I would need to get a handle on that process or something in order to translate it which might create more latency issues. On the other hand I could use an IOCTL polling system where the usermode process poles my driver at maybe 100 times a second. This would mean all of the work is done at Passive-Level and removes the need to translate between physical and virtual memory. What do you think is more efficient.

Mouse inputs simply don’t happen that fast. Mice are usually USB low-speed devices, so AT BEST you can get one update per millisecond, and the typical mouse driver bundles up 4 packets, so you get 4ms response. You can make hundreds of ioctl calls in 4 milliseconds.

@Tim_Roberts said:
Mouse inputs simply don’t happen that fast. Mice are usually USB low-speed devices, so AT BEST you can get one update per millisecond, and the typical mouse driver bundles up 4 packets, so you get 4ms response. You can make hundreds of ioctl calls in 4 milliseconds.

Some mice have the ability to have a polling rate of up to 8khz although I have not tested the frequency at which the inputs come in at / how many get bundled at a time. I have mostly tested in a VM which does not bundle packets but at some point I should test those mice out and see if the extra hz actually improves latency or if it just bundles more packets together. Anyways my take away from your response is that I should just use the IOCTL polling technique rather than shared memory is that correct?

Why do you need polling? You can send a number of ioctls at once and have them all pending. When the kernel driver has something, it just completes the next ioctl in line. That combines data transfer with immediate notification.

@Tim_Roberts said:
Why do you need polling? You can send a number of ioctls at once and have them all pending. When the kernel driver has something, it just completes the next ioctl in line. That combines data transfer with immediate notification.

I assume you mean something like EventHandling? I have not used that before but I will look into it. Basically you are saying when new data is received via the Callback I raise an Event that the user space is listening to and then the user program sends an IOCTL request to the driver for that data? Is that the quickest most efficient way you can think of doing this task?

No, not event handling. Research IOCP and overlapped IO. Using this model, you can easily achieve 100k IOPS +. The completions are often faster than SetEvent.

The key is to make sure that you pend many OVERLAPPED requests so that your KM code always has a queue of pending IRPs. When new data arrives, it simply completes the next one. The UM code does something with the new data, and then pends the buffer again with another ReadFile/ DeviceIOControl etc. call.

If the order of data in these requests matters (i.e. you must process block 1 before block 2 etc.) then you either do single threaded completion, or extend the OVERLAPPED with a sequence

Basically you are saying when new data is received via the Callback I raise an Event that the user space is listening to and then the user program sends an IOCTL request to the driver for that data?

No. I’m saying, when the program starts up and opens the driver handle, you immediately submit (for example) 5 ioctls. The driver queues them all, so they all remain outstanding. When the driver has something to say, it pops the next empty ioctl from its queue, fills it in, and completes it.

Meanwhile, the application has a thread blocked on WaitForMultipleObjects with all 5 handles. As soon as the driver completes an ioctl, WFMO returns, telling the app which one was completed. It handles the data, and resubmits the ioctl to be queued up again.

Technically, yes, there is an event involved; the OVERLAPPED structure includes an event and the I/O manager fires the event when the driver completes the ioctl, but it’s all built-in. When the event fires, the data is already there. This is THE high-performance way to do I/O on Windows.

For a single HANDLE and a queue depth less than 64 (MAX_WAIT_OBJECTS) Tim’s suggestion is identical to mine. The IOCP design scales more, but is conceptually identical.

The key idea is pre-allocating so that when the hardware is ready, it does not wait on the software

@MBond2 said:
For a single HANDLE and a queue depth less than 64 (MAX_WAIT_OBJECTS) Tim’s suggestion is identical to mine. The IOCP design scales more, but is conceptually identical.

The key idea is pre-allocating so that when the hardware is ready, it does not wait on the software

@Tim_Roberts said:

Basically you are saying when new data is received via the Callback I raise an Event that the user space is listening to and then the user program sends an IOCTL request to the driver for that data?

No. I’m saying, when the program starts up and opens the driver handle, you immediately submit (for example) 5 ioctls. The driver queues them all, so they all remain outstanding. When the driver has something to say, it pops the next empty ioctl from its queue, fills it in, and completes it.

Meanwhile, the application has a thread blocked on WaitForMultipleObjects with all 5 handles. As soon as the driver completes an ioctl, WFMO returns, telling the app which one was completed. It handles the data, and resubmits the ioctl to be queued up again.

Technically, yes, there is an event involved; the OVERLAPPED structure includes an event and the I/O manager fires the event when the driver completes the ioctl, but it’s all built-in. When the event fires, the data is already there. This is THE high-performance way to do I/O on Windows.

Alright thank you for great suggestions! How many queued requests would you suggest I set as the max? I managed to figure out how to do it (I am using WdfIoQueueDispatchSequential for this) and I can now read the data from my C# application however, I am wondering if anyone has any advice for synchronizing the requests in kernel and user space. I am using spinlocks and InterlockedExchange for ensuring data can’t be read and written to at the same time. Spinlocks ensure everything runs at Dispatch-Level so tasks wont supersede eachother. I also write my keyboard data to a buffer (max of 20 just in case) and then I take an entry from a list of queued requests if theres any and fulfill all requests till we run out of data in the buffer. Both of these use Spinlocks. I use InterlockedExchange for global fields that work to toggle on and off the filtering which is sent by a separate queue.

I of course would also need a new dedicated thread or worker thread in my C# application for for each IO call since they are blocking. I use a Semaphore to ensure we dont exceed a max amount of pending requests (set at 30 right now). Then after each thread gets un blocked via having their IO request answered (they idol and dont use an infinite while loop) they will add the data to a BlockingCollection which is concurrent and linked which gets managed by a single separate thread that will handle processing this data. My hypothetical concern here is that due to some weird C# optimization Thread A could get its request returned first but it does not get to processing the data till after Thread B which would screw up the order of data. This would probably be rare but a larger concern in my user application than in the kernel (not sure if other issues can arise when answering these requests that could modify the order but I assume my driver will send them in order). Although rare I would like to mitigate these sorts of issues so what’s your suggestion? Should I use something like packet numbers to ensure order? Any advice for a possible implementation of this or something else?

There are a couple of things here

First, you do not synchronize access to these buffers in the traditional sense. You pass ownership of the buffer from one place to another. Only the owner should access the buffer (fill it in KM, read it in UM).

Second, you do not want to use blocking IO. OVERLAPPED IO is inherently non-blocking and that’s what you want to use. It is harder to do this in C# than in native code, but it can still be done. Because this is your custom driver, you should probably use pinvoke to make a ReadFile call instead of relying on what the framework does in ReadAsync. It is of course possible to spin up a thread for each IOP, but the whole point of OVERLAPPED IO is to separate the thread context (stack etc.) from the IO context (data to be processed) and thereby both reduce the thread count and reduce the overhead of thread scheduling and other factors.

re the number of requests to pend, the right choice is almost impossible to determine. There are so many factors that affect this, that it is hard to even make a suggestion. One guideline is the number of CPUs in the system * 4. You can also do a calculation based on the data rate and expected UM processing time. Most people just pick a number like 100 or 1000 and don’t attempt to be any smarter than that. The consequence of overestimating and pending too many IRPs is almost nothing (increased memory consumption) versus the consequence of pending too few (lost data / performance degradation) so aim high

As an aside, someone needs to help the folks working on the .NET framework to understand IO performance. I was reading the release notes for .NET 8 and there are some serious problems with the conclusions. Especially for network IO performance. I think they all learned on Linux and don;t know anything else; but I digress

@MBond2 said:
There are a couple of things here

First, you do not synchronize access to these buffers in the traditional sense. You pass ownership of the buffer from one place to another. Only the owner should access the buffer (fill it in KM, read it in UM).

Second, you do not want to use blocking IO. OVERLAPPED IO is inherently non-blocking and that’s what you want to use. It is harder to do this in C# than in native code, but it can still be done. Because this is your custom driver, you should probably use pinvoke to make a ReadFile call instead of relying on what the framework does in ReadAsync. It is of course possible to spin up a thread for each IOP, but the whole point of OVERLAPPED IO is to separate the thread context (stack etc.) from the IO context (data to be processed) and thereby both reduce the thread count and reduce the overhead of thread scheduling and other factors.

re the number of requests to pend, the right choice is almost impossible to determine. There are so many factors that affect this, that it is hard to even make a suggestion. One guideline is the number of CPUs in the system * 4. You can also do a calculation based on the data rate and expected UM processing time. Most people just pick a number like 100 or 1000 and don’t attempt to be any smarter than that. The consequence of overestimating and pending too many IRPs is almost nothing (increased memory consumption) versus the consequence of pending too few (lost data / performance degradation) so aim high

As an aside, someone needs to help the folks working on the .NET framework to understand IO performance. I was reading the release notes for .NET 8 and there are some serious problems with the conclusions. Especially for network IO performance. I think they all learned on Linux and don;t know anything else; but I digress

Yes that ownership passing is what I am doing I have a have a different thread for each IOCTL request up to the max requests (which right now is at 30). The semaphore ensures I don’t exceed the max pending requests and it ensures each of the threads dedicated for IOCTL requests abide by the limit. When each thread in my C# application sends out a request it waits until it gets a response without modifying the buffer meaning the kernel has ownership till it gets returned. I use spinlocks in my kernel driver when adding the IOCTL requests to a list and when completing the request.

Based on the rest of your response it does not seem like it matters much how many threads (or more actuarily how many requests) I have dedicated for sending IOCTL requests to the driver. For keyboards I feel like 30 is fine then assuming these requests get created quickly enough but I do understand your point about running out of requests. I guess I would have to test it to see if two keyboards with a button each being held down is enough to reduce the available request to 0.

Right now I do use DeviceIOControl and I am not exactly sure how to repurpose ReadFile to send requests and receive the output buffer. The idea makes sense though since its non blocking I wont need 30 threads dedicated to sending requests but I am not exactly sure what the implementation would look like. What seems to be the IO performance issue with C# based on your analysis? If I was to stick with dedicating a thread for each request would you recommend I create a new Thread or use one of the worker threads via Task.Run? Also based on the data I provided above and my previous response does the kernel side of things seem like it would complete requests in the correct order because that’s what it seems like to me. Now for my C# code that may be another story from testing it seems like it works but I rather cover edge cases before they become a problem. Really the only solution I can think of is to use some sort of packet ID or thread ID system so I can ensure data is read in order.

ReadFile, WriteFile and DeviceIOControl are fundamentally the same. Use the one that makes the most sense for your application. In every case, open your HANDLE with FILE_FLAG_OVERLAPPED and use the OVERLAPPED parameter. DeviceIOControl is expected to fail and the last error is expected to be ERROR_IO_PENDING (997). This indicates that there is now a pending IRP that will complete some time later. The thread that called DeviceIOControl is now free to do something else including call DeviceIoControl again to pend another IRP. Do it in a loop until you have enough pending IRPs.

Assuming that you follow Tim’s advise and allocate an Event HANDLE for each pending call, your thread then calls WaitForMultipleObjects which blocks until at least one operation is completed. If you use manual reset events, you can rely on the return code to tell you which one, and then you can access your data, calculate a million digits of pi or whatever you need to do, and then call DeviceIoControl again to pend a new IRP. There is no material difference between using a C# Thread object and Task.Run. The only difference is in how the object will be disposed - I don’t expect this to be an issue since there should be only 1 thread and it should run for the lifetime of your program.

Multi-threaded completion and IOCP can dramatically improve the scalability, but the concept is the same. The biggest difference is that completion order becomes undefined because of thread pre-emption

1 Like

@MBond2 said:
ReadFile, WriteFile and DeviceIOControl are fundamentally the same. Use the one that makes the most sense for your application. In every case, open your HANDLE with FILE_FLAG_OVERLAPPED and use the OVERLAPPED parameter. DeviceIOControl is expected to fail and the last error is expected to be ERROR_IO_PENDING (997). This indicates that there is now a pending IRP that will complete some time later. The thread that called DeviceIOControl is now free to do something else including call DeviceIoControl again to pend another IRP. Do it in a loop until you have enough pending IRPs.

Assuming that you follow Tim’s advise and allocate an Event HANDLE for each pending call, your thread then calls WaitForMultipleObjects which blocks until at least one operation is completed. If you use manual reset events, you can rely on the return code to tell you which one, and then you can access your data, calculate a million digits of pi or whatever you need to do, and then call DeviceIoControl again to pend a new IRP. There is no material difference between using a C# Thread object and Task.Run. The only difference is in how the object will be disposed - I don’t expect this to be an issue since there should be only 1 thread and it should run for the lifetime of your program.

Multi-threaded completion and IOCP can dramatically improve the scalability, but the concept is the same. The biggest difference is that completion order becomes undefined because of thread pre-emption

Thank you for this reply I finally managed to get it all working from a single thread! Now it should be a lot easier and more efficient to increase the pending IOCTL requests I can make to the driver.

I’m glad i could help