IOCTL as an alternative to WINSOCK for the most direct data path to user space ?

OSR_Community_User · August 30, 2016, 11:46am

On 2016-08-30 15:24:31 +0000, Don Burn said:

Yes, the IOCTL approach is what should be used, and is basically the
inverted call if you use multiple IOCTL and STATUS_PENDING.

Ok

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of Matthieu Collette
Sent: Tuesday, August 30, 2016 11:20 AM
To: Windows System Software Devs Interest List
> Subject: Re:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
> data path to user space ?
>
> On 2016-08-30 15:12:55 +0000, Don Burn said:
>
>> Premature optimization, such as the circular buffer suggested below, is
> the
>> biggest cause of unreliable drivers. Start with the IOCTL approach, and
>> only after you have measured performance and if it is unacceptable
>> profiled the driver to determine that the IOCTL mechanism is the
>> problem should you consider other mechanisms.
>>
>>
>> Don Burn
>> Windows Driver Consulting
>> Website: http://www.windrvr.com
>
> Hi !
>
> By IOCTL approach you refer to the two first solutions suggested by slavaim
> ?
>
> I have not yet read the article about the Inverted Call Model, what’s your
> opinion about that ?
>
> Thanks for you help !
>
>>
>>
>>
>>
>> -----Original Message-----
>> From: xxxxx@lists.osr.com
>> [mailto:xxxxx@lists.osr.com] On Behalf Of
>> xxxxx@hotmail.com
>> Sent: Tuesday, August 30, 2016 10:59 AM
>> To: Windows System Software Devs Interest List
>> Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most
>> direct data path to user space ?
>>
>> If you want to continue with IOCTLs then a simplest solution is to
>> issue an IOCTL from a user application and either block IRP in the
>> driver or return STATUS_PENDING(do not forget IoMarkIrpPending ). In
>> DPC you fill the IRP’s buffer and call IoCompleteRequest for the
> pending/blocked IRP.
>>
>> An evolution of this approach to reduce the latency and increase
>> throughput is issuing multiple asynchronous IOCTLS and waiting for
>> their completion ( overlapped IO ). The driver puts IRPs in a list and
> returns STATUS_PENDING.
>> DPC removes an IRP from the list, fills data and completes the IRP.
>>
>> There is another solution which probably has the lowest latency and
>> highest throughput - circular buffer. A user application allocates a
>> buffer and an event. Sends them to a driver via IOCTL. The driver
>> locks the
>> buffer(MmProbeAndLockPages) so it can be used in DPC and gets a
>> pointer to an event object ( ObReferenceObjectByHandle ). Both the
>> driver and the user application implement a circular buffer. The
>> driver writes in it and the application reads from the buffer. A
>> single reader/single write circular buffer can be implemented w/o any
>> lock as long as the pointer arithmetic is atomic, which is a case on
>> IA-32 and AMD-64 architectures for aligned pointers, i.e. when
>> (address mod sizeof(void*)) == 0 or
>> address%sizeof(void*) == 0 . The DPC routine writes into the circular
>> buffer and sets the event in a signal state. The user application
>> waits on the event. When WaitForSingleObject returns it reads data
>> from the buffer until it becomes empty and returns to waiting on the
> event.
>>
>>
>> —
>> NTDEV is sponsored by OSR
>>
>> Visit the list online at:
>> http:
>>
>> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
>> software drivers!
>> Details at http:
>>
>> To unsubscribe, visit the List Server section of OSR Online at
>> http:
>
>
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at:
> http:
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
> drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at
> http:</http:></http:></http:></http:></http:></http:>

OSR_Community_User · August 30, 2016, 11:47am

On 2016-08-30 15:19:51 +0000, xxxxx@hotmail.com said:

Some words on circular buffer implementation. The solution with
IRP_MJ_CLEANUP is not perfect because of handle duplication ( if
somebody decides to attack the system ). The safer solution is holding
an IRP( i.e. IOCTL) with a cancel routine and detecting abnormal
process termination in it. But it looks more elaborate.

Ok

OSR_Community_User · August 30, 2016, 12:41pm

Isn’t a device’s driver supposed to transfer data from it’s … EvtIoRead callback routine ?

Tim_Roberts · August 30, 2016, 2:03pm

xxxxx@gmail.com wrote:

Isn’t a device’s driver supposed to transfer data from it’s … EvtIoRead callback routine ?

No. A device driver transfers its data at whatever point it actually
has data. If you happen to have data already queued up, then you can
certainly transfer it in EvtIoRead or EvtIoDeviceControl, but most
drivers aren’t that lucky. They have to tuck those requests into a
queue somewhere. Later on, when the driver actually receives data,
whether from a device, or a bus, or another driver, it can pop the next
waiting request and complete it.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

MBond · August 30, 2016, 6:15pm

More elaborate? From what point of view?

The system does all of the work for you. You don?t need to worry about tracking when it is safe to use the buffer or when it should be mapped / unmapped and handling all of those corner cases and race conditions. The engineers at Microsoft have done this for you and what use would it be to re-implement their work ? even assuming you could without spending years working on it.

In general, IMHO, shared buffer schemes that are implemented correctly are generally not any more efficient than simply using IRPs as the overhead of doing it right is exactly what the Microsoft engineers have coded ? do you think they purposely make these calls slow? Having said that the long lived IRP can be an exception to that rule and is used safely in specific cases, but as others have said, this should not be your first design regardless of what performance you think you need

Sent from Mailhttps: for Windows 10

From: xxxxx@hotmail.com mailto:xxxxx
Sent: August 30, 2016 11:21 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct data path to user space ?

Some words on circular buffer implementation. The solution with IRP_MJ_CLEANUP is not perfect because of handle duplication ( if somebody decides to attack the system ). The safer solution is holding an IRP( i.e. IOCTL) with a cancel routine and detecting abnormal process termination in it. But it looks more elaborate.

—
NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

anton_bassov · August 30, 2016, 10:46pm

> The engineers at Microsoft have done this for you and what use would it be to re-implement

their work ? even assuming you could without spending years working on it.

As it usually happens, anything that gets blown out of proportion starts looking/sounding ridiculous, and the above statement is not an exception. Surely sharing a buffer gives you few more things to worry about, and, as it had been already pointed out by other posters, in most cases this extra pain is simply unnecessary. However, “spending years working on it” is just a gross exaggeration that sounds more of propaganda from our “Professor Joe Flounder” .

In general, IMHO, shared buffer schemes that are implemented correctly are generally
not any more efficient than simply using IRPs as the overhead of doing it right is exactly
what the Microsoft engineers have coded ?

Yet another piece of nonsense. It is well known fact that sharing memory between an app and driver may offer dramatic performance enhancement, compared to file IO operations interface (including ioctl()). This is what mmap() was designed for.

do you think they purposely make these calls slow?

Well, they just had no option other than following the instructions of someone who is known to think of UNIX (as well as of GUI) as of a life-long foe. Do you really think they would be allowed to implement a single and uniform interface for both disk file and driver IO operations under these circumstances??? As a result, now you have to invent various mechanism(like sharing events) for working around various shortcomings and limitations arising from the lack of mmap() system call. As someone said, “Those who don’t understand UNIX are doomed to re-invent it. Poorly”…

Anton Bassov

Slava_Imameev · August 31, 2016, 12:42am

Shared buffer has nearly zero overhead compared to IOCTL implementation, the only calls are a DPC call to KeSetEvent that results in a scheduler being called and WaitForSingleObject by an application. Shared buffer is mapped in system and process space, its physical pages are locked. So no page faults generated. As the buffer pages are locked the data is hold in the CPU cache( at least L2 ) when a user process is woken up on the same CPU. There is no cache thrashing as in case of buffers copying when CPU cache has to evict two times more data to accommodate the copied data. IOCTL implementation in addition to the same KeSetEvent in IRP completion requires entering the kernel for file IO, which is not cheaper than entering the kernel to wait for event(WaitForSingleObject) as it encompasses it. IOCTL requires multiple memory allocations(Irp+buffer), memory releasing and copying buffers between kernel and user buffer OR the Memory Management involvement to lock and map a user buffer and then unlock and unmap it. IOCTL implementation involves a couple of order of magnitude more code being executed to transfer data from DPC to a user application in addition to the overhead of a shared buffer implementation. IOCTL also has a burden on CPU cache( both code and data caches), PTE and TLB management.

Don_Burn · August 31, 2016, 6:53am

>Yet another piece of nonsense. It is well known fact that sharing

memory between an app and driver may offer dramatic performance
enhancement, compared to file IO operations interface (including ioctl()).

Note the “MAY OFFER” in this statement, unfortunately plenty of
implementations
don’t offer performance improvements. In fact Anton, please give some
references
to your “well known fact”, the few I know of have a lot of qualifiers, that
make that
statement closer to urban legend than fact.

Well, they just had no option other than following the instructions of
someone who
is known to think of UNIX (as well as of GUI) as of a life-long foe

A number of the senior folks who worked on the I/O subsystem over the years
came
from a UNUX background. For example Nar Ganapathy who was one of the
architects
of KMDF was previously an architect as Sun. Anton your conspiracy theories
are
becoming ridiculous.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

Don_Burn · August 31, 2016, 7:00am

Have you measured the cost of KeSetEvent/WaitForSingleObject pair versus a
DeviceIoControl
for all situations and approaches. You may be surprised to find that the
overhead is similar
and with some of the optimizations provided for user space applications for
I/O calls that
DeviceIoControl can actually be made faster.

Of course this whole discussion is based on the assumption that the OP needs
an extremely
fast data path, in almost all cases when you really pin people down what
they think is fast for
a modern OS is actually pretty much average.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

Slava_Imameev · August 31, 2016, 7:50am

It is amazing that I should remember you that DeviceIoControl solution includes exactly the same calls to KeSetEvent and WaitForSingleObject/KeWaitForSingleObject.

Actually the OP needs a real time data path for audio processing. Contrary to common believe RT is not about fast but about predictable and nothing beats shared buffer in this.

Don_Burn · August 31, 2016, 8:10am

So you have read the Windows source and seen this? And of course calls like
SetFileCompletionNotificationModes or the
use of completion ports do not impact this mechanism at all? You obviously
have limited knowledge of the capabilities of
Windows in the I/O path.

Actually, share buffers have little or nothing to do with predictability.
This is a scheduling issue which is dependent on a number of things, but way
that data is copied from kernel to user space is not one of them. Don’t
apply the biases of another OS to Windows, each OS is different and assuming
that they will react the same without testing the hypothesis just shows
ignorance.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

Slava_Imameev · August 31, 2016, 8:21am

Yes, I did.

You obviously trying to start personal attacks( not the first time ) to cover your limited knowledge of OS design in general. When you are going to wait for data you have to wait. Waiting means either polling or releasing CPU the latter means a thread is removed from a running queue and inserted into some event’s queue.

Actually it does. I am not going educate you on RT design here.

Don_Burn · August 31, 2016, 8:54am

Sorry you did not read the source well, or you would have seen that there
are mechanisms for getting the event out of the I/O path, and other
optimizations. In fact I have implemented a number of high speed I/O
models on Windows for clients including shared memory models, and none of
these use a user space event directly since it was not efficient.

As far as my OS knowledge, I’ve worked on OS’es and system software for over
40 years. In that time I have been involved with 5 commercal OS’es four of
which I was part of the original team.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
Sent: Wednesday, August 31, 2016 8:22 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
data path to user space ?

Yes, I did.

You obviously trying to start personal attacks( not the first time ) to
cover your limited knowledge of OS design in general. When you are going to
wait for data you have to wait. Waiting means either polling or releasing
CPU the latter means a thread is removed from a running queue and inserted
into some event’s queue.

Actually it does. I am not going educate you on RT design here.

—
NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:</http:></http:></http:>

Slava_Imameev · August 31, 2016, 9:17am

Don, I am sorry for you if you believe that waiting for data that has not arrived is possible without some sort of blocking event mechanism at the core level, which always involves scheduler. I am not consider polling with a blocked scheduler as an extreme case.

In Windows blocking event mechanism is implemented by objects containing DISPATCHER_HEADER, so it it doesn’t matter whether you are talking about event or something else like process, thread, semapthore, mutex, timer. You are actually talking about DISPATCHER_HEADER and all blocking synchronization ends up in manipulating with DISPATCHER_HEADER. I believe it is not a secret for you that KEVENT is just a wrapper for DISPATCHER_HEADER.

You will always correct me if I am wrong.

A good man knows his limitations.

Daniel_Terhell · August 31, 2016, 9:26am

They do. Because the memory is locked down they prevent hard pagefaults.
DeviceIoControls incur user/kernel switches which in turn might lead to the
scheduler running another thread on the CPU. They may be good for general
throughput but not if latency is a concern.

//Daniel

Don_Burn · August 31, 2016, 9:44am

Shared buffers do not equal locked down memory, they are independent items.
I have seen implementations of share buffers that used pageable memory, and
I have seen IOCTL mechanisms where the user space program through one of
several mechanisms handed the kernel non-paged buffers.

There are ways to limit the scheduler impact in the IOCTL path (FastIO,
using SetFileCompletionNotificationModes and completion ports, etc).

Overall there are a lot of tools, assuming shared buffers is going to be the
fastest is a poor design decision, and that what lead to this long thread.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@resplendence.com
Sent: Wednesday, August 31, 2016 9:27 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
data path to user space ?

They do. Because the memory is locked down they prevent hard pagefaults.
DeviceIoControls incur user/kernel switches which in turn might lead to the
scheduler running another thread on the CPU. They may be good for general
throughput but not if latency is a concern.

//Daniel

—
NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:</http:></http:></http:>

Slava_Imameev · August 31, 2016, 10:10am

We are talking here about locked buffers in case you didn’t notice the word DPC.

We all have seen this. This is called file system cache or do you want to tell us that you have seen a driver that mapped a user space address range backed by a pagefile/file to a system space and the pages were not locked. Think twice before answering.

We all have seen this. This is called buffered IO or do you want to tell that a driver used a locked user space address range in arbitrary context (we are talking about DPC here).

Your personal attacks lead to this long thread.

Tim_Roberts · August 31, 2016, 12:26pm

xxxxx@hotmail.com wrote:

Shared buffer has nearly zero overhead compared to IOCTL implementation, the only calls are a DPC call to KeSetEvent that results in a scheduler being called and WaitForSingleObject by an application.

Yes, but how is that any different than a long-term IRP? You have the
exact same transactions: set an event and wake the user-mode process.

The shared buffer FEELS like it ought to be more efficient, but the
overhead is essentially identical

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

anton_bassov · August 31, 2016, 12:41pm

Whenever you see a statement like that from Mr.Burn you can be at least 500% sure his previous statements on a given topic are, softly speaking, “not-so-sound” from the technical standpoint, which had been pointed out to him. Whenever he puts a foot into his mouth he starts referring to his 50+ years experience( as well as throwing personal attacks and ad hominem arguments). The best option in such situation is simply to ignore this part, and to request him to back up his statements from the technical standpoint. At this point it becomes plainly obvious that he is just full of shit We’ve been here on quite a few occasions, Don, don’t you think (if a backup of my statements is needed, links to at least dozen of threads on NTDEV are available upon the request). As we are going to see shortly, this particular thread is not an exception to above mentioned general rule.

You must have implemented them yourself then…

Look - we are speaking about writing to memory buffer in context of DPC. If a driver does not lock an MDL that describes shared memory buffer in advance its very first page fault is going to result in BSOD. What are you arguing about here???

In general, " I have seen XYZ" is a pretty weak argument in itself. For example, if you speak to any ambulance or emergency departments doctor you will hear multiple stories about things
that were intentionally used in “very non-conventional” ways by the patients - from peas in the nose to electrical bulbs up the arses. However, it does not necessarily imply that peas and bulbs are inherently dangerous items per se, don’t you think. The situation is exactly the same - if XYZ has been misused by someone it doesnot necessarily mean that it is bad.

Well, your statement may make sense only if we assume that it was you who had provided these implementations, effectively washing away all the potential performance benefits of shared memory. If you need more precise explanations/proof of above mentioned benefits I would suggest reading post N 16 on this thread carefully - I simply don’t want to replicate the work that has already been done on this thread

The “only” problem here is that the whole NT IO subsystem was architecturally defined
in the very first version of NT, i.e. more than a decade before these folks had turned up, and had not changed since. The only thing that these folks could do was improving certain implementation details, which, however, cannot change the original design choices. For example, AFAIK, FastIO was introduced exactly for the purpose of fixing the deficiencies and shortcoming of IRP-based IO model that became obvious at the very first stages of NT’s life (IIRC, it was mentioned in Rajeev Nagar’s book).

Anton your conspiracy theories are becoming ridiculous.

Sorry, Don, but the only ridiculous thing on this thread in so far is your participation in it - unfortunately, practically all the statements that you made here are nonsensical. You seem to be arguing for the very sake of arguing about something…

Anton Bassov

Slava_Imameev · August 31, 2016, 12:44pm

The overhead is much less if you consider CPU caches, PTE and TLB management by CPU and the system. I already outlined this above.

Shared buffer is much better when outburst happens. Consecutive data packets can be processed without entering kernel mode at all.

If you consider a latency for a relatively slow communication it can be the same just because most of the time is spent in waiting. So it is just being amortized by long waiting for data. But the overhead is never the same - the number of CPU instructions being executed is orders of magnitude more for IRP case. This has its toll on CPU caches, TLB and branch prediction table in CPU.