IOCTL as an alternative to WINSOCK for the most direct data path to user space ?

OSR_Community_User · August 30, 2016, 10:31am

Hi everyone !

I am working on a virtual audio driver to send audio data from kernel
to a user space application in charge of streaming received audio data
over a network.

I am actually using a socket to send audio data from kernel to user space.

Because some of you told me I should use IOCTL instead of WINSOCK, I
would like to have your opinion about IOCTL and be sure it will be a
better solution than using WINSOCK.

So, here is what I do :
* I am developping a virtual audio driver, playback only, and WaveCyclic based
* I want to send audio data as soon as possible from kernel space to user space
* I am actually copying audio data and scheduling a DPC each time the
IDMAChannel::CopyTo is called
* I am sending audio data using a kernel socket each time my custom DPC
is executed
* the communication is essentially from kernel to user space

What is the best strategy, in my case, to send audio data as soon as
possible, using IOCTL ? Do I still need to use a DPC ? Is sending data
from kernel to user space synchronous ?

Is there any other alternative ?

Thanks in advance.

Matt

Slava_Imameev · August 30, 2016, 10:59am

If you want to continue with IOCTLs then a simplest solution is to issue an IOCTL from a user application and either block IRP in the driver or return STATUS_PENDING(do not forget IoMarkIrpPending ). In DPC you fill the IRP’s buffer and call IoCompleteRequest for the pending/blocked IRP.

An evolution of this approach to reduce the latency and increase throughput is issuing multiple asynchronous IOCTLS and waiting for their completion ( overlapped IO ). The driver puts IRPs in a list and returns STATUS_PENDING. DPC removes an IRP from the list, fills data and completes the IRP.

There is another solution which probably has the lowest latency and highest throughput - circular buffer. A user application allocates a buffer and an event. Sends them to a driver via IOCTL. The driver locks the buffer(MmProbeAndLockPages) so it can be used in DPC and gets a pointer to an event object ( ObReferenceObjectByHandle ). Both the driver and the user application implement a circular buffer. The driver writes in it and the application reads from the buffer. A single reader/single write circular buffer can be implemented w/o any lock as long as the pointer arithmetic is atomic, which is a case on IA-32 and AMD-64 architectures for aligned pointers, i.e. when (address mod sizeof(void*)) == 0 or address%sizeof(void*) == 0 . The DPC routine writes into the circular buffer and sets the event in a signal state. The user application waits on the event. When WaitForSingleObject returns it reads data from the buffer until it becomes empty and returns to waiting on the event.

Slava_Imameev · August 30, 2016, 11:11am

If you decides to continue with a circular buffer you need to unlock the buffer in IRP_MJ_CLEANUP in case an application crashes without gracefully closing connection via IOCTL( or you decide not to implement an explicit connection termination ). The application should hold a handle to a driver communication object until it finished with data stream processing. If you fail to do this the system will BSOD with Bug Check 0x76: PROCESS_HAS_LOCKED_PAGES.

OSR_Community_User · August 30, 2016, 11:12am

On 2016-08-30 14:31:46 +0000, Matthieu Collette said:

Hi everyone !

I am working on a virtual audio driver to send audio data from kernel
to a user space application in charge of streaming received audio data
over a network.

I am actually using a socket to send audio data from kernel to user space.

Because some of you told me I should use IOCTL instead of WINSOCK, I
would like to have your opinion about IOCTL and be sure it will be a
better solution than using WINSOCK.

So, here is what I do :
* I am developping a virtual audio driver, playback only, and WaveCyclic based
* I want to send audio data as soon as possible from kernel space to user space
* I am actually copying audio data and scheduling a DPC each time the
IDMAChannel::CopyTo is called
* I am sending audio data using a kernel socket each time my custom DPC
is executed
* the communication is essentially from kernel to user space

What is the best strategy, in my case, to send audio data as soon as
possible, using IOCTL ? Do I still need to use a DPC ? Is sending data
from kernel to user space synchronous ?

Is there any other alternative ?

Thanks in advance.

Matt

I found the following interesting article about IOCTL and the Inverted
Call Model

Don_Burn · August 30, 2016, 11:13am

Premature optimization, such as the circular buffer suggested below, is the
biggest cause of unreliable drivers. Start with the IOCTL approach, and
only after you have measured performance and if it is unacceptable profiled
the driver to determine that the IOCTL mechanism is the problem should you
consider other mechanisms.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
Sent: Tuesday, August 30, 2016 10:59 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
data path to user space ?

If you want to continue with IOCTLs then a simplest solution is to issue an
IOCTL from a user application and either block IRP in the driver or return
STATUS_PENDING(do not forget IoMarkIrpPending ). In DPC you fill the IRP’s
buffer and call IoCompleteRequest for the pending/blocked IRP.

An evolution of this approach to reduce the latency and increase throughput
is issuing multiple asynchronous IOCTLS and waiting for their completion (
overlapped IO ). The driver puts IRPs in a list and returns STATUS_PENDING.
DPC removes an IRP from the list, fills data and completes the IRP.

There is another solution which probably has the lowest latency and highest
throughput - circular buffer. A user application allocates a buffer and an
event. Sends them to a driver via IOCTL. The driver locks the
buffer(MmProbeAndLockPages) so it can be used in DPC and gets a pointer to
an event object ( ObReferenceObjectByHandle ). Both the driver and the user
application implement a circular buffer. The driver writes in it and the
application reads from the buffer. A single reader/single write circular
buffer can be implemented w/o any lock as long as the pointer arithmetic is
atomic, which is a case on IA-32 and AMD-64 architectures for aligned
pointers, i.e. when (address mod sizeof(void*)) == 0 or
address%sizeof(void*) == 0 . The DPC routine writes into the circular buffer
and sets the event in a signal state. The user application waits on the
event. When WaitForSingleObject returns it reads data from the buffer until
it becomes empty and returns to waiting on the event.

—
NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:</http:></http:></http:>

OSR_Community_User · August 30, 2016, 11:16am

On 2016-08-30 14:58:51 +0000, xxxxx@hotmail.com said:

If you want to continue with IOCTLs then a simplest solution is to
issue an IOCTL from a user application and either block IRP in the
driver or return STATUS_PENDING(do not forget IoMarkIrpPending ). In
DPC you fill the IRP’s buffer and call IoCompleteRequest for the
pending/blocked IRP.

An evolution of this approach to reduce the latency and increase
throughput is issuing multiple asynchronous IOCTLS and waiting for
their completion ( overlapped IO ). The driver puts IRPs in a list and
returns STATUS_PENDING. DPC removes an IRP from the list, fills data
and completes the IRP.

There is another solution which probably has the lowest latency and
highest throughput - circular buffer. A user application allocates a
buffer and an event. Sends them to a driver via IOCTL. The driver locks
the buffer(MmProbeAndLockPages) so it can be used in DPC and gets a
pointer to an event object ( ObReferenceObjectByHandle ). Both the
driver and the user application implement a circular buffer. The driver
writes in it and the application reads from the buffer. A single
reader/single write circular buffer can be implemented w/o any lock as
long as the pointer arithmetic is atomic, which is a case on IA-32 and
AMD-64 architectures for aligned pointers, i.e. when (address mod
sizeof(void*)) == 0 or address%sizeof(void*) == 0 . The DPC routine
writes into the circular buffer and sets the event in a signal state.
The user application waits on the event. When WaitForSingleObject
returns it reads data from the buffer until it becomes empty and
returns to waiting on the event.

Hi !

Multiple solution to try. I’ll probably start with the two first ones
which seems simpler.

Thanks for your answser !

OSR_Community_User · August 30, 2016, 11:19am

On 2016-08-30 15:12:55 +0000, Don Burn said:

Premature optimization, such as the circular buffer suggested below, is the
biggest cause of unreliable drivers. Start with the IOCTL approach, and
only after you have measured performance and if it is unacceptable profiled
the driver to determine that the IOCTL mechanism is the problem should you
consider other mechanisms.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

Hi !

By IOCTL approach you refer to the two first solutions suggested by slavaim ?

I have not yet read the article about the Inverted Call Model, what’s
your opinion about that ?

Thanks for you help !

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
Sent: Tuesday, August 30, 2016 10:59 AM
To: Windows System Software Devs Interest List
> Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
> data path to user space ?
>
> If you want to continue with IOCTLs then a simplest solution is to issue an
> IOCTL from a user application and either block IRP in the driver or return
> STATUS_PENDING(do not forget IoMarkIrpPending ). In DPC you fill the IRP’s
> buffer and call IoCompleteRequest for the pending/blocked IRP.
>
> An evolution of this approach to reduce the latency and increase throughput
> is issuing multiple asynchronous IOCTLS and waiting for their completion (
> overlapped IO ). The driver puts IRPs in a list and returns STATUS_PENDING.
> DPC removes an IRP from the list, fills data and completes the IRP.
>
> There is another solution which probably has the lowest latency and highest
> throughput - circular buffer. A user application allocates a buffer and an
> event. Sends them to a driver via IOCTL. The driver locks the
> buffer(MmProbeAndLockPages) so it can be used in DPC and gets a pointer to
> an event object ( ObReferenceObjectByHandle ). Both the driver and the user
> application implement a circular buffer. The driver writes in it and the
> application reads from the buffer. A single reader/single write circular
> buffer can be implemented w/o any lock as long as the pointer arithmetic is
> atomic, which is a case on IA-32 and AMD-64 architectures for aligned
> pointers, i.e. when (address mod sizeof(void*)) == 0 or
> address%sizeof(void*) == 0 . The DPC routine writes into the circular buffer
> and sets the event in a signal state. The user application waits on the
> event. When WaitForSingleObject returns it reads data from the buffer until
> it becomes empty and returns to waiting on the event.
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at:
> http:
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
> drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at
> http:</http:></http:></http:>

Slava_Imameev · August 30, 2016, 11:20am

Some words on circular buffer implementation. The solution with IRP_MJ_CLEANUP is not perfect because of handle duplication ( if somebody decides to attack the system ). The safer solution is holding an IRP( i.e. IOCTL) with a cancel routine and detecting abnormal process termination in it. But it looks more elaborate.

Don_Burn · August 30, 2016, 11:24am

Yes, the IOCTL approach is what should be used, and is basically the
inverted call if you use multiple IOCTL and STATUS_PENDING.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Matthieu Collette
Sent: Tuesday, August 30, 2016 11:20 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
data path to user space ?

On 2016-08-30 15:12:55 +0000, Don Burn said:

> Premature optimization, such as the circular buffer suggested below, is
the
> biggest cause of unreliable drivers. Start with the IOCTL approach, and
> only after you have measured performance and if it is unacceptable
> profiled the driver to determine that the IOCTL mechanism is the
> problem should you consider other mechanisms.
>
>
> Don Burn
> Windows Driver Consulting
> Website: http://www.windrvr.com

Hi !

By IOCTL approach you refer to the two first solutions suggested by slavaim
?

I have not yet read the article about the Inverted Call Model, what’s your
opinion about that ?

Thanks for you help !

>
>
>
>
> -----Original Message-----
> From: xxxxx@lists.osr.com
> [mailto:xxxxx@lists.osr.com] On Behalf Of
> xxxxx@hotmail.com
> Sent: Tuesday, August 30, 2016 10:59 AM
> To: Windows System Software Devs Interest List
> Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most
> direct data path to user space ?
>
> If you want to continue with IOCTLs then a simplest solution is to
> issue an IOCTL from a user application and either block IRP in the
> driver or return STATUS_PENDING(do not forget IoMarkIrpPending ). In
> DPC you fill the IRP’s buffer and call IoCompleteRequest for the
pending/blocked IRP.
>
> An evolution of this approach to reduce the latency and increase
> throughput is issuing multiple asynchronous IOCTLS and waiting for
> their completion ( overlapped IO ). The driver puts IRPs in a list and
returns STATUS_PENDING.
> DPC removes an IRP from the list, fills data and completes the IRP.
>
> There is another solution which probably has the lowest latency and
> highest throughput - circular buffer. A user application allocates a
> buffer and an event. Sends them to a driver via IOCTL. The driver
> locks the
> buffer(MmProbeAndLockPages) so it can be used in DPC and gets a
> pointer to an event object ( ObReferenceObjectByHandle ). Both the
> driver and the user application implement a circular buffer. The
> driver writes in it and the application reads from the buffer. A
> single reader/single write circular buffer can be implemented w/o any
> lock as long as the pointer arithmetic is atomic, which is a case on
> IA-32 and AMD-64 architectures for aligned pointers, i.e. when
> (address mod sizeof(void*)) == 0 or
> address%sizeof(void*) == 0 . The DPC routine writes into the circular
> buffer and sets the event in a signal state. The user application
> waits on the event. When WaitForSingleObject returns it reads data
> from the buffer until it becomes empty and returns to waiting on the
event.
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at:
> http:
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at
> http:

—
NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:</http:></http:></http:></http:></http:></http:>

OSR_Community_User · August 30, 2016, 11:46am

On 2016-08-30 15:24:31 +0000, Don Burn said:

Yes, the IOCTL approach is what should be used, and is basically the
inverted call if you use multiple IOCTL and STATUS_PENDING.

Ok

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Matthieu Collette
Sent: Tuesday, August 30, 2016 11:20 AM
To: Windows System Software Devs Interest List
> Subject: Re:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
> data path to user space ?
>
> On 2016-08-30 15:12:55 +0000, Don Burn said:
>
>> Premature optimization, such as the circular buffer suggested below, is
> the
>> biggest cause of unreliable drivers. Start with the IOCTL approach, and
>> only after you have measured performance and if it is unacceptable
>> profiled the driver to determine that the IOCTL mechanism is the
>> problem should you consider other mechanisms.
>>
>>
>> Don Burn
>> Windows Driver Consulting
>> Website: http://www.windrvr.com
>
> Hi !
>
> By IOCTL approach you refer to the two first solutions suggested by slavaim
> ?
>
> I have not yet read the article about the Inverted Call Model, what’s your
> opinion about that ?
>
> Thanks for you help !
>
>>
>>
>>
>>
>> -----Original Message-----
>> From: xxxxx@lists.osr.com
>> [mailto:xxxxx@lists.osr.com] On Behalf Of
>> xxxxx@hotmail.com
>> Sent: Tuesday, August 30, 2016 10:59 AM
>> To: Windows System Software Devs Interest List
>> Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most
>> direct data path to user space ?
>>
>> If you want to continue with IOCTLs then a simplest solution is to
>> issue an IOCTL from a user application and either block IRP in the
>> driver or return STATUS_PENDING(do not forget IoMarkIrpPending ). In
>> DPC you fill the IRP’s buffer and call IoCompleteRequest for the
> pending/blocked IRP.
>>
>> An evolution of this approach to reduce the latency and increase
>> throughput is issuing multiple asynchronous IOCTLS and waiting for
>> their completion ( overlapped IO ). The driver puts IRPs in a list and
> returns STATUS_PENDING.
>> DPC removes an IRP from the list, fills data and completes the IRP.
>>
>> There is another solution which probably has the lowest latency and
>> highest throughput - circular buffer. A user application allocates a
>> buffer and an event. Sends them to a driver via IOCTL. The driver
>> locks the
>> buffer(MmProbeAndLockPages) so it can be used in DPC and gets a
>> pointer to an event object ( ObReferenceObjectByHandle ). Both the
>> driver and the user application implement a circular buffer. The
>> driver writes in it and the application reads from the buffer. A
>> single reader/single write circular buffer can be implemented w/o any
>> lock as long as the pointer arithmetic is atomic, which is a case on
>> IA-32 and AMD-64 architectures for aligned pointers, i.e. when
>> (address mod sizeof(void*)) == 0 or
>> address%sizeof(void*) == 0 . The DPC routine writes into the circular
>> buffer and sets the event in a signal state. The user application
>> waits on the event. When WaitForSingleObject returns it reads data
>> from the buffer until it becomes empty and returns to waiting on the
> event.
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> Visit the list online at:
>> http:
>>
>> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>> software drivers!
>> Details at http:
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http:
>
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at:
> http:
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
> drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at
> http:</http:></http:></http:></http:></http:></http:>

OSR_Community_User · August 30, 2016, 11:47am

On 2016-08-30 15:19:51 +0000, xxxxx@hotmail.com said:

Some words on circular buffer implementation. The solution with
IRP_MJ_CLEANUP is not perfect because of handle duplication ( if
somebody decides to attack the system ). The safer solution is holding
an IRP( i.e. IOCTL) with a cancel routine and detecting abnormal
process termination in it. But it looks more elaborate.

Ok

OSR_Community_User · August 30, 2016, 12:41pm

Isn’t a device’s driver supposed to transfer data from it’s … EvtIoRead callback routine ?

Tim_Roberts · August 30, 2016, 2:03pm

xxxxx@gmail.com wrote:

Isn’t a device’s driver supposed to transfer data from it’s … EvtIoRead callback routine ?

No. A device driver transfers its data at whatever point it actually
has data. If you happen to have data already queued up, then you can
certainly transfer it in EvtIoRead or EvtIoDeviceControl, but most
drivers aren’t that lucky. They have to tuck those requests into a
queue somewhere. Later on, when the driver actually receives data,
whether from a device, or a bus, or another driver, it can pop the next
waiting request and complete it.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

MBond · August 30, 2016, 6:15pm

More elaborate? From what point of view?

The system does all of the work for you. You don?t need to worry about tracking when it is safe to use the buffer or when it should be mapped / unmapped and handling all of those corner cases and race conditions. The engineers at Microsoft have done this for you and what use would it be to re-implement their work ? even assuming you could without spending years working on it.

In general, IMHO, shared buffer schemes that are implemented correctly are generally not any more efficient than simply using IRPs as the overhead of doing it right is exactly what the Microsoft engineers have coded ? do you think they purposely make these calls slow? Having said that the long lived IRP can be an exception to that rule and is used safely in specific cases, but as others have said, this should not be your first design regardless of what performance you think you need

Sent from Mailhttps: for Windows 10

From: xxxxx@hotmail.com mailto:xxxxx
Sent: August 30, 2016 11:21 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct data path to user space ?

Some words on circular buffer implementation. The solution with IRP_MJ_CLEANUP is not perfect because of handle duplication ( if somebody decides to attack the system ). The safer solution is holding an IRP( i.e. IOCTL) with a cancel routine and detecting abnormal process termination in it. But it looks more elaborate.

—
NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

anton_bassov · August 30, 2016, 10:46pm

> The engineers at Microsoft have done this for you and what use would it be to re-implement

their work ? even assuming you could without spending years working on it.

As it usually happens, anything that gets blown out of proportion starts looking/sounding ridiculous, and the above statement is not an exception. Surely sharing a buffer gives you few more things to worry about, and, as it had been already pointed out by other posters, in most cases this extra pain is simply unnecessary. However, “spending years working on it” is just a gross exaggeration that sounds more of propaganda from our “Professor Joe Flounder” .

In general, IMHO, shared buffer schemes that are implemented correctly are generally
not any more efficient than simply using IRPs as the overhead of doing it right is exactly
what the Microsoft engineers have coded ?

Yet another piece of nonsense. It is well known fact that sharing memory between an app and driver may offer dramatic performance enhancement, compared to file IO operations interface (including ioctl()). This is what mmap() was designed for.

do you think they purposely make these calls slow?

Well, they just had no option other than following the instructions of someone who is known to think of UNIX (as well as of GUI) as of a life-long foe. Do you really think they would be allowed to implement a single and uniform interface for both disk file and driver IO operations under these circumstances??? As a result, now you have to invent various mechanism(like sharing events) for working around various shortcomings and limitations arising from the lack of mmap() system call. As someone said, “Those who don’t understand UNIX are doomed to re-invent it. Poorly”…

Anton Bassov

Slava_Imameev · August 31, 2016, 12:42am

Shared buffer has nearly zero overhead compared to IOCTL implementation, the only calls are a DPC call to KeSetEvent that results in a scheduler being called and WaitForSingleObject by an application. Shared buffer is mapped in system and process space, its physical pages are locked. So no page faults generated. As the buffer pages are locked the data is hold in the CPU cache( at least L2 ) when a user process is woken up on the same CPU. There is no cache thrashing as in case of buffers copying when CPU cache has to evict two times more data to accommodate the copied data. IOCTL implementation in addition to the same KeSetEvent in IRP completion requires entering the kernel for file IO, which is not cheaper than entering the kernel to wait for event(WaitForSingleObject) as it encompasses it. IOCTL requires multiple memory allocations(Irp+buffer), memory releasing and copying buffers between kernel and user buffer OR the Memory Management involvement to lock and map a user buffer and then unlock and unmap it. IOCTL implementation involves a couple of order of magnitude more code being executed to transfer data from DPC to a user application in addition to the overhead of a shared buffer implementation. IOCTL also has a burden on CPU cache( both code and data caches), PTE and TLB management.

Don_Burn · August 31, 2016, 6:53am

>Yet another piece of nonsense. It is well known fact that sharing

memory between an app and driver may offer dramatic performance
enhancement, compared to file IO operations interface (including ioctl()).

Note the “MAY OFFER” in this statement, unfortunately plenty of
implementations
don’t offer performance improvements. In fact Anton, please give some
references
to your “well known fact”, the few I know of have a lot of qualifiers, that
make that
statement closer to urban legend than fact.

Well, they just had no option other than following the instructions of
someone who
is known to think of UNIX (as well as of GUI) as of a life-long foe

A number of the senior folks who worked on the I/O subsystem over the years
came
from a UNUX background. For example Nar Ganapathy who was one of the
architects
of KMDF was previously an architect as Sun. Anton your conspiracy theories
are
becoming ridiculous.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

Don_Burn · August 31, 2016, 7:00am

Have you measured the cost of KeSetEvent/WaitForSingleObject pair versus a
DeviceIoControl
for all situations and approaches. You may be surprised to find that the
overhead is similar
and with some of the optimizations provided for user space applications for
I/O calls that
DeviceIoControl can actually be made faster.

Of course this whole discussion is based on the assumption that the OP needs
an extremely
fast data path, in almost all cases when you really pin people down what
they think is fast for
a modern OS is actually pretty much average.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

Slava_Imameev · August 31, 2016, 7:50am

It is amazing that I should remember you that DeviceIoControl solution includes exactly the same calls to KeSetEvent and WaitForSingleObject/KeWaitForSingleObject.

Actually the OP needs a real time data path for audio processing. Contrary to common believe RT is not about fast but about predictable and nothing beats shared buffer in this.

Don_Burn · August 31, 2016, 8:10am

So you have read the Windows source and seen this? And of course calls like
SetFileCompletionNotificationModes or the
use of completion ports do not impact this mechanism at all? You obviously
have limited knowledge of the capabilities of
Windows in the I/O path.

Actually, share buffers have little or nothing to do with predictability.
This is a scheduling issue which is dependent on a number of things, but way
that data is copied from kernel to user space is not one of them. Don’t
apply the biases of another OS to Windows, each OS is different and assuming
that they will react the same without testing the hypothesis just shows
ignorance.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com