[POSSIBLE DUPLICATE] User-mode Kernel-mode real-time communication

This could be a duplicate because this is a question that have been asked a lot of times before. But I will share my experience and my requests about what I want to achieve. Here is the methods I tried and the results:

  1. IOCTL Polling: A dedicated thread in user-mode application constantly sends IOCTL with pre-defined buffer. The driver has a queue (KeInitializeQueue) that it uses to queue events the user-mode application should be notified of. When the IOCTL is sent, the driver fills the buffer with the array of events in queue. The problems in this was that: (1) the constant polling meant it wasn't real-time, (2) the buffer might not be enough, which meant that user-mode app had to send 2 IOCTLs (one for buffer size, one for the actual data) to get the events.
  2. WSK: The user-mode and kernel-mode application handshakes a TCP connection on predefined port on localhost. It seems great in theory, but the problems are that: (1) there is no "on-connect", or "on-receive" events in WSK, which meant I had to use PsCreateSystemThread to poll for accepting connections and handling clients with while(TRUE), which I suspect might end with THREAD_STUCK_IN_DEVICE_DRIVER, (2) the WSK is somewhat buggy, I had to use KsGetAddrInfo to initialize the IP and port the kernel should listen on, also most of functions resulted in a LIST_ENTRY corruption BSOD.
  3. Sharing Events: I read the article on NT Insider about the sharing events between kernel and user mode. As the other methods I tried, it still looks great. But the problem in that is with sync between user and kernel-mode. What I tried was that user-mode application would allocate a buffer and then send its address to kernel driver to inform where to write. Whenever something happened, the driver would write to buffer, and KeSetEvent to notify user-mode application something happened. The problem was: (1) it was hard to sync the access to buffer, there could be another event that took interest of the driver and it could change the buffer while user-mode application is still working on it, (2) and again, predefined buffer size issue.
  4. Shared Memory: Didn't even consider due to being almost same as sharing events with worse in notification.

I want to achieve a real-time communication between my driver and application. The implementation difficulty really doesn't matter. A latency of 500ms is ok. But it should be easily bi-directional. I am currently thinking about registering a interrupt handler and doing an INT in user-mode application, which I didn't quite researched about and probably won't work. I have doubts that ISRs in Windows work for software interrupts.
Another thing to note that, I will have multiple "clients". Multiple clients that will receive the events that they will be interested in.

If real time design is a requirement, then you must consider other operating systems that is designed to provide a real time support, like the latest Linux kernels with PREEMPT_RT config options.

There is no magic trick that makes Windows a real time system.

Probably, a shared circular buffer with shared events is the best option for minimal delays. You can implement a batch mode when kernel thread continues to poll on the buffer for some time after fetching some entries, then enters into a wait mode on the event. This helps to improve delays by reducing calls to the scheduler when processing an irregular stream of data.

Look at io_uring implementation in the Linux kernel. The io_uring developers designed it to minimise usage of kernel provided synchronisation primitives by relying on direct user-to-kernel communication through a shared memory buffer.

Windows kernel is not a real time operating system kernel and cannot guarantee latency. Windows scheduler and synchronisation mechanisms in the kernel (locks, mutexes, etc) don't provide real time guarantee.

That means you cannot claim and promise real time support for customers, but can provide some probability estimation for latency, like processing events in less than 500ms for 95% of events.

To achieve this you might want to

  • Prevent critical threads migration between CPUs, have a pull of dedicated CPUs that are not used for tasks not related to "real time" subsystem
  • Prefetch and lock user mode pages with critical code and data, to prevent them being paged out
  • Use only NonPaged kernel mode allocations
  • Prevent kernel mode stack being paged out when waiting by setting a wait mode to Kernel.
  • Prevent pages reclaiming by tuning the system to have enough free physical pages.
  • Preventing non-essential interrupts being dispatched by APIC to CPUs reserved for "real-time".

I think I might have been a bit misleading about the "real-time" thing there. What I meant was a minimum latency solution, a few milliseconds, up to half a second. A total real-time solution would be an overkill, especially on Windows.

Polling the buffer in kernel mode seems unusual for me. How many times? How the kernel would know?

Well, I can make two events. A kernel "Here you go" event, which signals the application something happened and should look into buffer, and a "OK" which signals the kernel that user-mode application is done with it, and its (potential) response is in buffer.

But that introduces delays, because there may be 2 events that should catch an attention, and if I were to use events per "event type" happened in kernel mode, the code would be a big mess of allocations and WaitForSingleObjects.

I suggest you to look at io_uring implementation in the Linux kernel that supports poling in the kernel mode before calling a thread scheduler. Search for IORING_SETUP_SQPOLL and IORING_SQ_NEED_WAKEP in the code. These flags are used for kernel side polling.

You can allocate a shared ring buffer for each CPU / thread / application / channel / data stream depending on the design requirement. Just remember about events ordering if that is important. If ordering does matter then there should be some way to dispatch dependent events in the same buffer to keep them ordered or use other way to order them on receiving side.

Allocate two buffers for TX and RX streams, i.e. one buffer for user-to-kernel and another for kernel-to-user or use a hybrid model, e.g. a shared buffer for kernel-to-user and IOCTL for user-to-kernel. This simplifies design.

There is another example of shared circular buffer implementation in the Linux kernel - BPF ring buffer, this is a unidirectional kernel-to-user implementation.

Ordering doesn't matter, but immediate attention does. Processing, sending, receiving of an event shouldn't block another's. (e.g. waiting for user-mode application to be done with the buffer before sending the event happened, a minimal delay that would be like in namedpipes is acceptible). A shared single buffer would destroy this, and 2 buffers would result in a sync issue between two events. Like:

1.Kernel sends event A.
2.Kernel waits on "out" event for event A to get signaled.
3.App acknowledges, copies buffer and begins processing.
4.Kernel sends the event B.
5.Kernel waits on "out" event for event A to get signaled.
6.App acknowledges, copies buffer and begins processing.
7.App is done with event B, writes its response to buffer and signals the out event.
8.Kernel begins processing event B's response.
9.App is done with event A, writes its response to buffer and signals the out event.
10. ---Now we are getting interesting here.

If the out and in buffers were different, this would require an allocation per event.
If the out and in buffers were same, this would literally destroy everything.
And how would kernel know which response is which event's without an "id" per event which would require me to use a new list structure in kernel mode?

I never synced events between kernel and usermode. But the way just this issue pops up, I think that trying to sync this thing isn't worth the time spent. One thing to note that, if you did tell me to consider shared buffers due to speed, the transfer speed (as long as its not more than 500ms) is not a problem. As said so, I can have tolerance up to that level which user won't notice.

Having multiple threads on while(TRUE) and WaitForSingleObject doesn't seem like a great idea either. I don't know. Maybe I am too perfectionist about this.

Also, are you sure that I can implement that io_uring thing in Windows? Can you tell me a bit about the implementation and difficulty?

I do no suggest to implement io_uring for Windows. I advise to research an existing implementation(s) and borrow ideas. The problem you are trying to solve is not unique and has been around for 60+ years.

Probably, what you want to do is to have your UM application open the HANDLE as OVERLAPPED and make a number of ReadFile, DeviceIoControl etc. calls expecting that your driver will mark them all as pending. Then, as your driver has data to communicate, it completes one of the IRPs. The application processes these results and re-issues the call. In the mean time, any number of the other pending IRPs have been completed

This is the standard way of issuing what you call 'minimum latency' IO. It is wrong to say that this is real time, but a crappy laptop can process 100,000 IOPS this way with low CPU usage, so unless you really mean real time, I doubt that you will have a problem.

Note that all of the other methods you propose will have much worse performance. Especially Sockets and shared events. Shared memory is the only option that could be better but it has important issues that should not be ignored. Only the polling method would be comparable

Looks like this is my best bet. Even though I'm not really ok with multiple threads on while(TRUE), waiting for OVERLAPPED event. What you describe is an inverted call, but I also saw some other members tell about it being only for "asking for user-mode application to do something" rather than continously notifying the application. What about that?