Windows System Software -- Consulting, Training, Development -- Unique Expertise, Guaranteed Results
The free OSR Learning Library has more than 50 articles on a wide variety of topics about writing and debugging device drivers and Minifilters. From introductory level to advanced. All the articles have been recently reviewed and updated, and are written using the clear and definitive style you've come to expect from OSR over the years.
Check out The OSR Learning Library at: https://www.osr.com/osr-learning-library/
My driver needs to communicate with an app very frequently(it sends a notification to the app every time a keypress happens). I have tried with pending IOCTLs and using a dedicated thread that opens a IOCTL every time after it gets a reply from the driver. However, when doing that too fast, the driver doesn't have a pending queue anymore on which it can send a reply. Currently, I'm using multiple threads inside the app that all creates it's own pending queue so the driver should always have one present, even while the app is processing the response from a previous queue. I'm planning to use a safe-thread queue inside the app (something like https://github.com/cameron314/concurrentqueue ) to then send the responses from the worked threads that opened the queues to the main thread of the app. Is an event created with CreateEvent method (that I can send the handle to the driver so the driver can call it) be queued if an event is already being processed by the app (let's say the app got a response from the event, while it's processing that response and doing something based on it, if the driver calls the event again, would the app process that event after it finishes the current execution, or it would just be lost like it happens when there is no pending queue available with inverted IOCTL calls?)? Can a driver send data to the app using an event like it can do using the OutputBuffer in a pending IOCTL? I prefer to recode the inverted calls part to replace it with events, than have to deal with thread safety inside the app.
Upcoming OSR Seminars | ||
---|---|---|
OSR has suspended in-person seminars due to the Covid-19 outbreak. But, don't miss your training! Attend via the internet instead! | ||
Kernel Debugging | 13-17 May 2024 | Live, Online |
Developing Minifilters | 1-5 Apr 2024 | Live, Online |
Internals & Software Drivers | 11-15 Mar 2024 | Live, Online |
Writing WDF Drivers | 26 Feb - 1 Mar 2024 | Live, Online |
Comments
This has to be your app taking too long. I've developed KMDF drivers that were able to process 100,000 IOCTL's per second, without any special processing, and as many as 600,000 with some special techniques on a dual processor system.
Event are not a great idea here, remember these have two states, set or not. If you want to signal with the driver that there is a keystroke how do you handle when the app hasn't processed it and N more come in. All the driver can do is set or clear the event, there is no data with it.
You need to change your app to either process data faster, or make the IOCTL be able to report N events.
I know it's not the driver which can not process the IOCTL, it's the app that can't process that many IOCTLs in a single thread. I can't make the IOCTL report N events instead of one, as the app needs to get the event and process it as fast as possible (latency is an issue here). As I said, by creating more threads (even 2 threads instead of one fixes it most of the time) in the app that all of them creates IOCTL requests that are gonna be used by the driver to notify, then it works just fine, but now I have to implement a way to send the data from all the threads that got a response from the driver, to the main thread and still keep latency as low as possible. In no way my driver or app would ever need to process 100k IOCTLs per second, it'll probably never pass over 1k IOCTLs per second. As you said that an event can't also send data, I have no other option than go with the method I'm currently using and implement a semaphore/concurrent queue system to pass the data to the main thread.
This sentence makes me think you're not using overlapped I/O, so you're sending one request at a time per file handle. Is that possibly correct? Because there's no way to implement what you're talking about without using overlapped I/O. Your app can submit 15 ioctls, if it wants, thereby ensuring the driver queue never runs dry.
Also remember that completing an ioctl does sets an event. The event communication is already done for you, PLUS you get to return data.
Tim Roberts, [email protected]
Providenza & Boekelheide, Inc.
You are correct, I'm opening a file handle per each thread without using overlapped I/O. Wouldn't overlapped I/O have a similar behavior tho? Except that it would use just one file handle, if there are too many calls to be processed in one thread the system would create more threads for the app so it can process all the data? If that's the case, it means I still need a semaphore/concurrent queue/message loop to send the data to the main thread when that happens, or maybe I could rethink the app a little bit more so there wouldn't be a need to process the data in the main thread and each threads can process the data without having to send it to the main thread.
Edit: By main thread, I do not mean the main thread of the app, but ONE thread that the data is gonna be processed (doesn't needs to be the main thread of the app, but it doesn't needs to be processed in many threads)
Sigh. We’ve seen a several posts like this over the past two weeks. Odd.
Have the app open one handle for overlapped I/O. Have the app queue many (10, 20, 175) IOCTLs in advance. They will sit on the drivers Queue... and there will always be plenty of them. Have the app handle the completions via a completion port. As soon as the app gets one completion, have it send another IOCTL to the driver to replace the one that was just completed.
Problem solved.
“One IOCTL per keypress” isn’t fast, or even a lot of I/O. Do not use an event for this. Use an IOCTL like you’re doing. Your problem seems to ME to be entirely with your user mode handling of the operations.
Peter
Peter Viscarola
OSR
@OSRDrivers
Switched to overlapped I/O, currently putting 15 IOCTLs in queue (using a single file handle) and it works fine now (the speed is great and it seems that it doesn't miss any "event" (the driver always has a pending IOCTL on which it can reply)). What would be the safest/recommended way to open a IOCTL again after receiving a response from a previous one? The completion port callback seems to be on another thread (well, it makes sense, otherwise it would block the main thread), but the file handle is created in the main thread. Isn't a handle thread-specific and not safe to call from another thread(if that's not the case, it means I can call DeviceIoControl to open another IOCTL directly inside the completion port callback and I do not have to find other solutions or use a mutex)? In this case, my idea is to keep a global integer with how many IOCTLs are currently opened, whenever a IOCTL gets completed, decrease this number, whenever it is below a threshold (a while loop in the main thread(the thread that opened the file handle) to check it's value maybe?) open another IOCTL and I would use a mutex to safely access and edit that count variable, would that be a good implementation?
There are good books such as ‘windows system programming’ by Johnson Hart and ‘windows internals’ by Mark Russinovich
No. Handles are PROCESS specific (they live in the process handle table) not thread specific.
That's exactly what most people do.
Peter
Peter Viscarola
OSR
@OSRDrivers
ROFLMAO.....
Just to give you an idea, it takes you approx.150ms to blink an eye. Although it may seem just instantaneous to you, this time is,
in actuality, sufficient for 10 threads to use up their quantum. I hope this example is sufficient for convincing you that if you look at the whole thing from the OS's perspective, any human-generated events happen to be, in actuality, EXTREMELY infrequent ones.
Just look at the numbers that Don provided, and recall that a typical human would hardly generate even a dozen of keystrokes per second. Therefore, you most certainly don't need to use events, let alone to share a buffer between an app and a driver, here.
Assuming that we are speaking about a properly-written app, this scenario may occur only if your target app is starved of CPU time.
Although it may happen under some special circumstances (for example, when the network traffic is really high, or if some high-priority
threads take up all available CPU time), this is very unlikely to occur on more or less regular basis. Therefore, the only conclusion here is that your app is just poorly designed. Fix this part, and I can assure you that the whole problem will be gone.....
Anton Bassov
Anton beat me too it, but keyboard input might have been high speed in 1980, but it is certainly not in 2020. But it is also clear that the threading model needs work too. With or without IOCP, the stock thread pool, or anything custom, you should be able to handle this level of load with a single thread and >1% CPU usage. You should also evaluate your expectations if you think that 15 pending IRPs is a lot. Think 1,500 before you start to worry on commodity hardware and many thousands on specialized systems
now before the OP runs away, understand that we contribute here with the hope of helping and not to hindering your progress. The big question to how this should work was posed around 1990 and was solved in favor of this pattern. Although many others think that the solutions that other systems have might be better, nothing has been proven or demonstrated as better. Possibly pmem will prove to be, and NUMA remains a challenge to correctly support, but from a certain point of view that's like saying it is better when we teach that 2 +2 make 5 because it will be harder to explain later the exceptional cases where that might result versus the normal ones where it can't
is less frequent to more frequent.
So Anton got a point
Pro
On what operating system?
Not on Windows, certainly.
Peter
Peter Viscarola
OSR
@OSRDrivers
the presence of high frequency interrupts ?
Pro
Seriously?
On modern processors, and with ISRs of reasonable length, everything gets serviced just fine.
The whole idea of “make the most urgent device the most important IRQ” ceased being important around the time that we stopped using wire cutters to set the IRQ of plug-in boards.
Windows establishes IRQL by round robin assignment.
P
Peter Viscarola
OSR
@OSRDrivers
You mean 40+-year old UNIX versions written for PDP-11,right? These systems,indeed, made a heavy use of spl(), because the hardware interrupts had to be prioritized to one another on PDP-11. In act, the very concept of spl() (as well as its IRQL cousin) is based solely upon PDP-11 specifics.
Therefore, it lost any practical meaning when UNIX got ported to other architectures. In practical terms, it hung on for quite a while, because,
as long as we are speaking about the UP systems, it may work as a synchronization method just fine, so that no one really bothered to clean it up. However, it does not work this way with the MP ones, so that it had to get replaced with mutexes when MP systems became common.
Therefore, no major modern UNIX derivative (at least open-source one) really uses it any more. For example, FreeBSD had abandoned it and replaced it with mutexes ages ago, and Solaris/Illumos uses it only for disabling interrupts. It is still used by both NetBSD and OpenBSD, but these UNIX derivatives would hardly qualify for being called the major ones,right?
Anton Bassov
pro
Right after my assertions I realised I was talking about old approach. Not
being in windows kernel for two years or more, it struck my mind about iRQL
and either round robins along with coalesced...
Pro
Have you so quickly forgotten the lessons I taught you?
You continue to misunderstand and repeat incorrect info about the PDP-11 and SPL. Do not talk about things you know nothing about, at least when there are folks who know better are within earshot.
Peter
Peter Viscarola
OSR
@OSRDrivers
Oh, come on.....
I really hope that you are not going to argue against my assertion that PDP-11 prioritized interrupts to one another at the hardware level, right.
Although I haven't had a chance to get any personal experience with PDP-11 (ironically, we are of the same age, i.e. "born in 1969"),
I am still in a position to get the publicly available documentation.
http://gordonbell.azurewebsites.net/digital/pdp 11 handbook 1969.pdf
This doc is pretty long, but here is a Wiki article that makes a reference to it, and this article happens to be much shorter
https://en.wikipedia.org/wiki/PDP-11
Here is the relevant excerpt from it
[begin quote]
The PDP-11 operated at a priority level from 0 through 7, declared by three bits in the Processor Status Word (PSW), and high-end models could operate in a choice of modes, Kernel (privileged), User (application), and sometimes Supervisor, according to two bits in the PSW.
To request an interrupt, a bus device would assert one of four common bus lines, BR4 through BR7, until the processor responded. Higher numbers indicated greater urgency, perhaps that data might be lost or a desired sector might rotate out of contact with the read/write heads unless the processor responded quickly. The printer's readiness for another character was the lowest priority (BR4), as it would remain ready indefinitely. If the processor were operating at level 5, then BR6 and BR7 would be in order. If the processor were operating at 3 or lower, it would grant any interrupt; if at 7, it would grant none. Bus requests that were not granted were not lost but merely deferred. The device needing service would continue to assert its bus request.
Whenever an interrupt exceeded the processor's priority level, the processor asserted the corresponding bus grant, BG4 through BG7. The bus-grant lines were not common lines but were a daisy chain: The input of each gate was the output of the previous gate in the chain. A gate was on each bus device, and a device physically closer to the processor was earlier in the daisy chain. If the device had made a request, then on sensing its bus-grant input, it could conclude it was in control of the bus, and did not pass the grant signal to the next device on the bus. If the device had not made a request, it propagated its bus-grant input to its bus-grant output, giving the next closest device the chance to reply. (If devices did not occupy adjacent slots to the processor board, "grant continuity cards" inserted into the empty slots propagated the bus-grant line.)
Once in control of the bus, the device dropped its bus request and placed on the bus the memory address of its two-word vector. The processor saved the program counter (PC) and PSW, entered Kernel mode, and loaded new values from the specified vector. For a device at BR6, the new PSW in its vector would typically specify 6 as the new processor priority, so the processor would honor more urgent requests (BR7) during the service routine, but defer requests of the same or lower priority. With the new PC, the processor jumped to the service routine for the interrupting device. That routine operated the device, at least removing the condition that caused the interrupt. The routine ended with the RTI (ReTurn from Interrupt) instruction, which restored PC and PSW as of just before the processor granted the interrupt.
If a bus request were made in error and no device responded to the bus grant, the processor timed out and performed a trap that would suggest bad hardware.
[end quote]
Anton Bassov
OMG - it looks like, in actuality, I was not THAT wrong on that particular occasion.
Look at the following lines taken from the excerpt that I quoted in my previous post.
[begin quote]
Whenever an interrupt exceeded the processor's priority level, the processor asserted the corresponding bus grant, BG4 through BG7. The bus-grant lines were not common lines but were a daisy chain: The input of each gate was the output of the previous gate in the chain. A gate was on each bus device, and a device physically closer to the processor was earlier in the daisy chain. If the device had made a request, then on sensing its bus-grant input, it could conclude it was in control of the bus, and did not pass the grant signal to the next device on the bus. If the device had not made a request, it propagated its bus-grant input to its bus-grant output, giving the next closest device the chance to reply. (If devices did not occupy adjacent slots to the processor board, "grant continuity cards" inserted into the empty slots propagated the bus-grant line.)
[end quote]
Anton Bassov
tl;dr (your posts above.)
Anton, stop it. You asserted in the previous thread (a) that the PDP-11 has a non-uniform cost for accessing hardware... which is false, and (b) that "the OS prioritising hardware interrupts to one another" was unique to the PDP-11, or done in software, or something... I'm honestly not sure WTF you're saying. Interrupt priorities on the PDP-11 are a hardware concept, like they are on the IBM PC. They are reflected in the PSW, where the priority is set by either the SPL or MTPS instruction. Hardware interrupt priorities was not a new, or unique, concept to the PDP-11.
So, be quiet Anton. Contribute something useful to the questioners or go back to being quiet... as you have been for the past several months.
Your algorithm for being here should be:
Peter
Peter Viscarola
OSR
@OSRDrivers
I have to laugh when Anton starts talking about PDP-11's. Until Windows the I had written drivers for PDP-11 OS'es (DOS, RT-11, RSX-11M, and Unix) than any other system, and I never thought about "interrupt priority". I even worked on a few research OS'es on 11's and never thought about the subject. I didn't look at Anton's references, I just grabbed my hardcopy of the PDP-11 manual from 1969.
Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com
-----Original Message-----
From: Peter_Viscarola_(OSR) [mailto:[email protected]]
Sent: Monday, September 21, 2020 3:12 PM
To: Don_Burn [email protected]
Subject: Re: [NTDEV] IOCTL or shared event for fast(frequently) communication
OSR https://community.osr.com/
Peter_Viscarola_(OSR) commented on IOCTL or shared event for fast(frequently) communication
tl;dr (your posts above.)
It's interesting that I missed the whole PDP-11 revolution. I went straight from mainframes to Windows. Now, when the conversation drifts over to the peripheral processors on Control Data Cyber machines, I'll be ready.
Tim Roberts, [email protected]
Providenza & Boekelheide, Inc.
Ah, yes... The Minicomputer Revolution.
It was a pretty terrific learning-ground for OS development. One person could understand (and, er, write) the entire OS... and the OS source code (in assembly language) probably didn’t run to 500 printed pages.
And the assembly langue was MUCH easier than IBM BAL. hmmmm.... Now, there’s a topic. JCL anyone?
Peter
Peter Viscarola
OSR
@OSRDrivers