IOCTL as an alternative to WINSOCK for the most direct data path to user space ?

Shared buffer has nearly zero overhead compared to IOCTL implementation, the only calls are a DPC call to KeSetEvent that results in a scheduler being called and WaitForSingleObject by an application. Shared buffer is mapped in system and process space, its physical pages are locked. So no page faults generated. As the buffer pages are locked the data is hold in the CPU cache( at least L2 ) when a user process is woken up on the same CPU. There is no cache thrashing as in case of buffers copying when CPU cache has to evict two times more data to accommodate the copied data. IOCTL implementation in addition to the same KeSetEvent in IRP completion requires entering the kernel for file IO, which is not cheaper than entering the kernel to wait for event(WaitForSingleObject) as it encompasses it. IOCTL requires multiple memory allocations(Irp+buffer), memory releasing and copying buffers between kernel and user buffer OR the Memory Management involvement to lock and map a user buffer and then unlock and unmap it. IOCTL implementation involves a couple of order of magnitude more code being executed to transfer data from DPC to a user application in addition to the overhead of a shared buffer implementation. IOCTL also has a burden on CPU cache( both code and data caches), PTE and TLB management.

>Yet another piece of nonsense. It is well known fact that sharing

memory between an app and driver may offer dramatic performance
enhancement, compared to file IO operations interface (including ioctl()).

Note the “MAY OFFER” in this statement, unfortunately plenty of
implementations
don’t offer performance improvements. In fact Anton, please give some
references
to your “well known fact”, the few I know of have a lot of qualifiers, that
make that
statement closer to urban legend than fact.

Well, they just had no option other than following the instructions of
someone who
is known to think of UNIX (as well as of GUI) as of a life-long foe

A number of the senior folks who worked on the I/O subsystem over the years
came
from a UNUX background. For example Nar Ganapathy who was one of the
architects
of KMDF was previously an architect as Sun. Anton your conspiracy theories
are
becoming ridiculous.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

Have you measured the cost of KeSetEvent/WaitForSingleObject pair versus a
DeviceIoControl
for all situations and approaches. You may be surprised to find that the
overhead is similar
and with some of the optimizations provided for user space applications for
I/O calls that
DeviceIoControl can actually be made faster.

Of course this whole discussion is based on the assumption that the OP needs
an extremely
fast data path, in almost all cases when you really pin people down what
they think is fast for
a modern OS is actually pretty much average.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

It is amazing that I should remember you that DeviceIoControl solution includes exactly the same calls to KeSetEvent and WaitForSingleObject/KeWaitForSingleObject.

Actually the OP needs a real time data path for audio processing. Contrary to common believe RT is not about fast but about predictable and nothing beats shared buffer in this.

So you have read the Windows source and seen this? And of course calls like
SetFileCompletionNotificationModes or the
use of completion ports do not impact this mechanism at all? You obviously
have limited knowledge of the capabilities of
Windows in the I/O path.

Actually, share buffers have little or nothing to do with predictability.
This is a scheduling issue which is dependent on a number of things, but way
that data is copied from kernel to user space is not one of them. Don’t
apply the biases of another OS to Windows, each OS is different and assuming
that they will react the same without testing the hypothesis just shows
ignorance.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

Yes, I did.

You obviously trying to start personal attacks( not the first time ) to cover your limited knowledge of OS design in general. When you are going to wait for data you have to wait. Waiting means either polling or releasing CPU the latter means a thread is removed from a running queue and inserted into some event’s queue.

Actually it does. I am not going educate you on RT design here.

Sorry you did not read the source well, or you would have seen that there
are mechanisms for getting the event out of the I/O path, and other
optimizations. In fact I have implemented a number of high speed I/O
models on Windows for clients including shared memory models, and none of
these use a user space event directly since it was not efficient.

As far as my OS knowledge, I’ve worked on OS’es and system software for over
40 years. In that time I have been involved with 5 commercal OS’es four of
which I was part of the original team.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com
Sent: Wednesday, August 31, 2016 8:22 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
data path to user space ?



Yes, I did.



You obviously trying to start personal attacks( not the first time ) to
cover your limited knowledge of OS design in general. When you are going to
wait for data you have to wait. Waiting means either polling or releasing
CPU the latter means a thread is removed from a running queue and inserted
into some event’s queue.



Actually it does. I am not going educate you on RT design here.


NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:</http:></http:></http:>

Don, I am sorry for you if you believe that waiting for data that has not arrived is possible without some sort of blocking event mechanism at the core level, which always involves scheduler. I am not consider polling with a blocked scheduler as an extreme case.

In Windows blocking event mechanism is implemented by objects containing DISPATCHER_HEADER, so it it doesn’t matter whether you are talking about event or something else like process, thread, semapthore, mutex, timer. You are actually talking about DISPATCHER_HEADER and all blocking synchronization ends up in manipulating with DISPATCHER_HEADER. I believe it is not a secret for you that KEVENT is just a wrapper for DISPATCHER_HEADER.

You will always correct me if I am wrong.

A good man knows his limitations.

They do. Because the memory is locked down they prevent hard pagefaults.
DeviceIoControls incur user/kernel switches which in turn might lead to the
scheduler running another thread on the CPU. They may be good for general
throughput but not if latency is a concern.

//Daniel

Shared buffers do not equal locked down memory, they are independent items.
I have seen implementations of share buffers that used pageable memory, and
I have seen IOCTL mechanisms where the user space program through one of
several mechanisms handed the kernel non-paged buffers.

There are ways to limit the scheduler impact in the IOCTL path (FastIO,
using SetFileCompletionNotificationModes and completion ports, etc).

Overall there are a lot of tools, assuming shared buffers is going to be the
fastest is a poor design decision, and that what lead to this long thread.

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@resplendence.com
Sent: Wednesday, August 31, 2016 9:27 AM
To: Windows System Software Devs Interest List
Subject: Re:[ntdev] IOCTL as an alternative to WINSOCK for the most direct
data path to user space ?



They do. Because the memory is locked down they prevent hard pagefaults.
DeviceIoControls incur user/kernel switches which in turn might lead to the
scheduler running another thread on the CPU. They may be good for general
throughput but not if latency is a concern.

//Daniel


NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:</http:></http:></http:>

We are talking here about locked buffers in case you didn’t notice the word DPC.

We all have seen this. This is called file system cache or do you want to tell us that you have seen a driver that mapped a user space address range backed by a pagefile/file to a system space and the pages were not locked. Think twice before answering.

We all have seen this. This is called buffered IO or do you want to tell that a driver used a locked user space address range in arbitrary context (we are talking about DPC here).

Your personal attacks lead to this long thread.

xxxxx@hotmail.com wrote:

Shared buffer has nearly zero overhead compared to IOCTL implementation, the only calls are a DPC call to KeSetEvent that results in a scheduler being called and WaitForSingleObject by an application.

Yes, but how is that any different than a long-term IRP? You have the
exact same transactions: set an event and wake the user-mode process.

The shared buffer FEELS like it ought to be more efficient, but the
overhead is essentially identical


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Whenever you see a statement like that from Mr.Burn you can be at least 500% sure his previous statements on a given topic are, softly speaking, “not-so-sound” from the technical standpoint, which had been pointed out to him. Whenever he puts a foot into his mouth he starts referring to his 50+ years experience( as well as throwing personal attacks and ad hominem arguments). The best option in such situation is simply to ignore this part, and to request him to back up his statements from the technical standpoint. At this point it becomes plainly obvious that he is just full of shit We’ve been here on quite a few occasions, Don, don’t you think (if a backup of my statements is needed, links to at least dozen of threads on NTDEV are available upon the request). As we are going to see shortly, this particular thread is not an exception to above mentioned general rule.

You must have implemented them yourself then…

Look - we are speaking about writing to memory buffer in context of DPC. If a driver does not lock an MDL that describes shared memory buffer in advance its very first page fault is going to result in BSOD. What are you arguing about here???

In general, " I have seen XYZ" is a pretty weak argument in itself. For example, if you speak to any ambulance or emergency departments doctor you will hear multiple stories about things
that were intentionally used in “very non-conventional” ways by the patients - from peas in the nose to electrical bulbs up the arses. However, it does not necessarily imply that peas and bulbs are inherently dangerous items per se, don’t you think. The situation is exactly the same - if XYZ has been misused by someone it doesnot necessarily mean that it is bad.

Well, your statement may make sense only if we assume that it was you who had provided these implementations, effectively washing away all the potential performance benefits of shared memory. If you need more precise explanations/proof of above mentioned benefits I would suggest reading post N 16 on this thread carefully - I simply don’t want to replicate the work that has already been done on this thread

The “only” problem here is that the whole NT IO subsystem was architecturally defined
in the very first version of NT, i.e. more than a decade before these folks had turned up, and had not changed since. The only thing that these folks could do was improving certain implementation details, which, however, cannot change the original design choices. For example, AFAIK, FastIO was introduced exactly for the purpose of fixing the deficiencies and shortcoming of IRP-based IO model that became obvious at the very first stages of NT’s life (IIRC, it was mentioned in Rajeev Nagar’s book).

Anton your conspiracy theories are becoming ridiculous.

Sorry, Don, but the only ridiculous thing on this thread in so far is your participation in it - unfortunately, practically all the statements that you made here are nonsensical. You seem to be arguing for the very sake of arguing about something…

Anton Bassov

The overhead is much less if you consider CPU caches, PTE and TLB management by CPU and the system. I already outlined this above.

Shared buffer is much better when outburst happens. Consecutive data packets can be processed without entering kernel mode at all.

If you consider a latency for a relatively slow communication it can be the same just because most of the time is spent in waiting. So it is just being amortized by long waiting for data. But the overhead is never the same - the number of CPU instructions being executed is orders of magnitude more for IRP case. This has its toll on CPU caches, TLB and branch prediction table in CPU.

xxxxx@hotmail.com wrote:

If you consider a latency for a relatively slow communication it can be the same just because most of the time is spent in waiting. So it is just being amortized by long waiting for data. But the overhead is never the same - the number of CPU instructions being executed is orders of magnitude more for IRP case.

All of which are MANY orders of magnitude less than any I/O operation
that was involved. The difference is insignificant.

This has its toll on CPU caches, TLB and branch prediction table in CPU.

Insignificant. CPU cycles are not the only metric of efficiency in the
world today. If it costs me additional design, programming, debugging,
and maintenance time to get a result that micro-optimizes the branch
prediction tables, then my work was a fool’s errand.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Not so. If this was considered as a generally acceptable idea Windows NT would have been implemented as a mini-kernel as it was originally thought of.

When we are talking about code execution overhead this is one of the most important metrics. If it was not significant we don’t have superscallar CPUs with more than 100 instructions in flight, huge register files, huge reorder buffers, out of order execution, branch prediction tables, huge caches.

You can develop over metrics depending on the goal. For example man hours spent on design and development or number of pizzas eaten by the team.

Actually it doesn’t cost much than IOCTL implementation. All you need is to implement a single reader/single write lock free circular buffer over an allocated and locked user buffer. Such buffer implementations are available for no charge from many sources.

xxxxx@hotmail.com wrote:

Not so. If this was considered as a generally acceptable idea Windows NT would have been implemented as a mini-kernel as it was originally thought of.

I can see we’re not going to agree, and that’s fine, but let me point
out that when Windows NT was being developed, a 25MHz 486 was a hot
machine. Processors and I/O buses ran at nearly the same speed. The
world is very different today. If you were starting over today, you’d
make very different choices.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

I believe the choice would be the same - monolithic optimized kernel . Read L4 or any other modern microkernel story.

During last decade Windows, Linux, Mac OS X/iOS teams have heavily optimized kernels with lock free algorithms, per CPU variables and fine grained locks despite all progress in CPU and memory microarchitecture and semiconductor process.

Uh, try again. None of the factors you list here will be any different between a long lived IRP and a buffer passed from UM to KM and then locked down. If you are thinking about an application that pends multiple IRPs in advance and then the driver fills them as data arrives, then your comments about slow communications might make some sense, but in the case of a single IRP that contains the buffer to be shared and lives ?forever?, they make none at all.

Some additional points to consider:

  1. For applications that are worried about performance at all, the use of IOCP (or the newer thread pool APIs) is a must

  2. For handling bursts of data (not relevant if predictable timing is the key issue, but often for network services), the use of buffers that are large enough to contain several ?packets? and or GetQueuedCompletionStatusEx with multiple buffers can reduce the number of KM / UM transitions in exactly the same way as the shared buffer would. Remember that the shared buffer has a maximum size too

  3. No one is suggesting that a common buffer approach is not the one with the least possible overhead for moving data between Km and UM, but rather that:

a. If you are going to implement this design, use a long lived IRP to communicate that buffer. It is easier to implement, costs nothing in performance and is inherently safer

b. For the vast majority of applications, the difference in absolute performance between this design and the use of standard IRPs is at best a second order effect and not worth the cost in terms of development and support time. Even when the performance of the application is important and UM / KM transitions have been measured to be a limiting factor, there exist APIs to reduce the number (scatter / gather IO and GQCSEx) as well as the cost (registered IO & SetFileIoOverlappedRange / SetFileCompletionNotificationModes) of each transition. The correct use of these functions can both work with any driver (as long as it does not complete IO synchronously) as well as greatly alleviate the most common performance concerns with standard IRP based IO.

c. The consistency of the latency between KM and UM for any specific ?packet? is not affected by the choice of this design over standard IRPs in any meaningful way. Remember that Windows, along with all other general purpose pre-emptive multitasking operating systems subject all of the UM processes to theoretically unbounded pre-emption delays. No design for UM / KM data transfer can prevent or work around this and this is a key reason why neither Windows nor *NIX operating systems are used for life critical or safety critical systems ? consider what would happen if the shuttle took a page fault when executing the re-entry burn. The machine that is presently pumping cytotoxins into my body has the same design constraints but for a different reason ? malpractice might be expensive for doctors, but for suppliers of medical equipment dispensing the wrong quantity or rate of medication in even a singe case represents a likely bankruptcy.

As others here know, at this point I am posting only in the interests of propagating technical excellence on to others ? you can take that for what it may be worth, but don?t bash it without at least thinking about it. I am perfectly willing to be proved wrong with rational and well considered arguments, but you should be too ? that is one of the key points of a professional dialogue.

In the interests of having this thread locked quickly, I will now commence the personal attack you crave:

You suck

Sent from Mailhttps: for Windows 10

From: xxxxx@hotmail.commailto:xxxxx
Sent: August 31, 2016 12:44 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct data path to user space ?



The overhead is much less if you consider CPU caches, PTE and TLB management by CPU and the system. I already outlined this above.

Shared buffer is much better when outburst happens. Consecutive data packets can be processed without entering kernel mode at all.

If you consider a latency for a relatively slow communication it can be the same just because most of the time is spent in waiting. So it is just being amortized by long waiting for data. But the overhead is never the same - the number of CPU instructions being executed is orders of magnitude more for IRP case. This has its toll on CPU caches, TLB and branch prediction table in CPU.


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

Oh, also have a look at the modern implementations of the UM critical section functions and the use of the WaitOnAddress function. Logic like this, or accepting that one or more cores will simply burn time spinning can eliminate even the need for an event in a shared buffer design ? with obvious knock on effects to the rest of the system.

Sent from Mailhttps: for Windows 10

From: Marion Bondmailto:xxxxx
Sent: August 31, 2016 8:56 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE: [ntdev] IOCTL as an alternative to WINSOCK for the most direct data path to user space ?

Uh, try again. None of the factors you list here will be any different between a long lived IRP and a buffer passed from UM to KM and then locked down. If you are thinking about an application that pends multiple IRPs in advance and then the driver fills them as data arrives, then your comments about slow communications might make some sense, but in the case of a single IRP that contains the buffer to be shared and lives ?forever?, they make none at all.

Some additional points to consider:

1) For applications that are worried about performance at all, the use of IOCP (or the newer thread pool APIs) is a must

2) For handling bursts of data (not relevant if predictable timing is the key issue, but often for network services), the use of buffers that are large enough to contain several ?packets? and or GetQueuedCompletionStatusEx with multiple buffers can reduce the number of KM / UM transitions in exactly the same way as the shared buffer would. Remember that the shared buffer has a maximum size too

3) No one is suggesting that a common buffer approach is not the one with the least possible overhead for moving data between Km and UM, but rather that:

a. If you are going to implement this design, use a long lived IRP to communicate that buffer. It is easier to implement, costs nothing in performance and is inherently safer

b. For the vast majority of applications, the difference in absolute performance between this design and the use of standard IRPs is at best a second order effect and not worth the cost in terms of development and support time. Even when the performance of the application is important and UM / KM transitions have been measured to be a limiting factor, there exist APIs to reduce the number (scatter / gather IO and GQCSEx) as well as the cost (registered IO & SetFileIoOverlappedRange / SetFileCompletionNotificationModes) of each transition. The correct use of these functions can both work with any driver (as long as it does not complete IO synchronously) as well as greatly alleviate the most common performance concerns with standard IRP based IO.

c. The consistency of the latency between KM and UM for any specific ?packet? is not affected by the choice of this design over standard IRPs in any meaningful way. Remember that Windows, along with all other general purpose pre-emptive multitasking operating systems subject all of the UM processes to theoretically unbounded pre-emption delays. No design for UM / KM data transfer can prevent or work around this and this is a key reason why neither Windows nor *NIX operating systems are used for life critical or safety critical systems ? consider what would happen if the shuttle took a page fault when executing the re-entry burn. The machine that is presently pumping cytotoxins into my body has the same design constraints but for a different reason ? malpractice might be expensive for doctors, but for suppliers of medical equipment dispensing the wrong quantity or rate of medication in even a singe case represents a likely bankruptcy.

As others here know, at this point I am posting only in the interests of propagating technical excellence on to others ? you can take that for what it may be worth, but don?t bash it without at least thinking about it. I am perfectly willing to be proved wrong with rational and well considered arguments, but you should be too ? that is one of the key points of a professional dialogue.

In the interests of having this thread locked quickly, I will now commence the personal attack you crave:

You suck

Sent from Mailhttps: for Windows 10

From: xxxxx@hotmail.commailto:xxxxx
Sent: August 31, 2016 12:44 PM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] IOCTL as an alternative to WINSOCK for the most direct data path to user space ?



The overhead is much less if you consider CPU caches, PTE and TLB management by CPU and the system. I already outlined this above.

Shared buffer is much better when outburst happens. Consecutive data packets can be processed without entering kernel mode at all.

If you consider a latency for a relatively slow communication it can be the same just because most of the time is spent in waiting. So it is just being amortized by long waiting for data. But the overhead is never the same - the number of CPU instructions being executed is orders of magnitude more for IRP case. This has its toll on CPU caches, TLB and branch prediction table in CPU.


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:></mailto:xxxxx></mailto:xxxxx></https:>