Inverted call - decide some particularities

Hi,

I am quite new to driver development so I would like to clarify some aspects related to inverted call model and types of IRP-s to be used with that

I one of the nice samples used in the newer article related to inverted call I see that from user mode it is used DeviceIoControl in order to send a device control IRP to the driver in order to pass to the driver a buffer that gets filled with data at some later point. So basically that IRP is pending until something interesting happens in the driver (e.g. - in my case I wait for network packets so that I can fill some user mode passed buffers)

Key to this technique seems the fact that I have to maintain a large enough of pending control IRP-s from user mode that get completed as network traffic data comes to the driver and driver finishes those IRP-s.

On a legacy driver (WDM) I think I saw that this passing of data is made by using from user mode ReadFile (with overlapped, completion ports).

  1. the sample used only GetQueuedCompletionStatus and was able to read in overlapped parameter the storage that was initially passed by user mode (output buffer). Are there some performance issues related to using this or using ReadFile ?

  2. Can I have only a pending IRP (of what type ?) in driver and be able to receive in user mode packets as they came along using completion ports ? The sample used multiple control IRP-s that were pending until data came in driver (by queuing them to an internal queue) an I wonder if I should worry on the need to recreate enough of them so they are pending, waiting for driver to fill them

I would skip the Io completion port and use the thread pool APIs instead. They use iocp underneath, but without the complexity. You can pend reads or ioctls, it doesn’t matter. How many depends on data rate. Do NOT write or use a wdm driver sample, start with kmdf even if it is not a pnp driver

d

Bent from my phone


From: xxxxx@gfi.commailto:xxxxx
Sent: ?5/?19/?2015 8:02 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: [ntdev] Inverted call - decide some particularities

Hi,

I am quite new to driver development so I would like to clarify some aspects related to inverted call model and types of IRP-s to be used with that

I one of the nice samples used in the newer article related to inverted call I see that from user mode it is used DeviceIoControl in order to send a device control IRP to the driver in order to pass to the driver a buffer that gets filled with data at some later point. So basically that IRP is pending until something interesting happens in the driver (e.g. - in my case I wait for network packets so that I can fill some user mode passed buffers)

Key to this technique seems the fact that I have to maintain a large enough of pending control IRP-s from user mode that get completed as network traffic data comes to the driver and driver finishes those IRP-s.

On a legacy driver (WDM) I think I saw that this passing of data is made by using from user mode ReadFile (with overlapped, completion ports).

1. the sample used only GetQueuedCompletionStatus and was able to read in overlapped parameter the storage that was initially passed by user mode (output buffer). Are there some performance issues related to using this or using ReadFile ?

2. Can I have only a pending IRP (of what type ?) in driver and be able to receive in user mode packets as they came along using completion ports ? The sample used multiple control IRP-s that were pending until data came in driver (by queuing them to an internal queue) an I wonder if I should worry on the need to recreate enough of them so they are pending, waiting for driver to fill them


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer</mailto:xxxxx></mailto:xxxxx>

@Doron:

I don’t think tread pool will preserve the order of completions. This is not the case where a thread pool would be beneficial. You have to handle the packets in sequence.

xxxxx@gfi.com wrote:

  1. the sample used only GetQueuedCompletionStatus and was able to read in overlapped parameter the storage that was initially passed by user mode (output buffer). Are there some performance issues related to using this or using ReadFile ?

Your question doesn’t really make sense. When you submit an
asynchronous I/O request (ReadFile, WriteFile or DeviceIoControl),
somehow you have to wait for that request to complete. That’s what
GetOverlappedResult and GetQueuedCompletionStatus are for. They don’t
replace the ReadFile call.

GetQueuedCompletionStatus is used with I/O completion ports, which is
one of the better ways of handling asynchronous I/O in Windows.

  1. Can I have only a pending IRP (of what type ?) in driver and be able to receive in user mode packets as they came along using completion ports ? The sample used multiple control IRP-s that were pending until data came in driver (by queuing them to an internal queue) an I wonder if I should worry on the need to recreate enough of them so they are pending, waiting for driver to fill them

Once your driver completes a request, it’s gone. You can’t touch it any
more. If more data comes in before another user-mode request arrive,
you either have to buffer the data or drop it on the floor. Whether you
need to queue multiple requests or not depends on whether you intend to
implement buffering, and how fast your data will arrive.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

I don’t think the iocp on its own guarantees order either, otherwise the tp would inherit that behavior. If you want order you can do any of the following
1 send one request at a time
2 use an ioctl with an input buffer that had an index with which you can resequence, sending multiple requests
3 put the OVERLAPPED in a larger structure which has the index where you resequence as in 2

d

Bent from my phone

-----Original Message-----
From: “xxxxx@broadcom.com
Sent: ‎5/‎19/‎2015 9:37 AM
To: “Windows System Software Devs Interest List”
Subject: RE:[ntdev] Inverted call - decide some particularities

@Doron:

I don’t think tread pool will preserve the order of completions. This is not the case where a thread pool would be beneficial. You have to handle the packets in sequence.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

@Doron:

Considering I have large amounts of data (think all network traffic - including payload that gets to a windows server machine - all that should get to user mode and re-injected until I figure out what it represents). You said something about correlating buffers with data rate. ? What did you have in mind regarding that ?

@Alex Grig

Even if I want to preserve order packets come out of order in my filter driver. And processing being done in user mode I would have to reorder them in user mode.

This sounds more like wfp than anything else. You have to figure out the data rates and how many requests to pend

d

Bent from my phone


From: xxxxx@gfi.commailto:xxxxx
Sent: ?5/?19/?2015 10:33 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] Inverted call - decide some particularities

@Doron:

Considering I have large amounts of data (think all network traffic - including payload that gets to a windows server machine - all that should get to user mode and re-injected until I figure out what it represents). You said something about correlating buffers with data rate. ? What did you have in mind regarding that ?


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer</mailto:xxxxx></mailto:xxxxx>

xxxxx@gfi.com wrote:

Considering I have large amounts of data (think all network traffic - including payload that gets to a windows server machine - all that should get to user mode and re-injected until I figure out what it represents).

Reinjected where? Are you trying to implement some kind of real-time
virus scanning? Have you done the math on this? You don’t have time
for all of the user/kernel transitions that are required to keep up with
a full modern network pipe. The Windows Filtering Platform is split
between user-mode and kernel-mode so that it can keep up.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

@Tim Roberts

Reinjected where? Are you trying to implement some kind of real-time
virus scanning? Have you done the math on this? You don’t have time
for all of the user/kernel transitions that are required to keep up with
a full modern network pipe.

In order to minimize user/kernel tranzitions I consider getting data (and afterwards reinjecting) in “quite large bulks”. Say I have user mode allocated buffers for 1000 packets in user mode that get filled by the driver. Any time those are filled (they represent a single IRP) we complete the IRP -> or at least 100 times a second -> that way we don’t have more than a kernel/user mode transition per 10 ms for example. Obviously I have to figure out if only some large IRP-s like these are enough to handle all traffic and I don’t incur delays larger than x ms in order not to disrupt traffic.

After user mode analyzes I consider injecting the packet back, this way I don’t have to buffer the packets in kernel only the copy from user mode is needed. I can make the injection in quite large bulk of packets or I can inject packet by packet. I just wonder about performance :slight_smile:

>worry on the need to recreate enough of them so they are pending, waiting for driver to fill them

You need to recreate enough of them so that the driver path is never frozen/failed due to lack of next free buffer to fill.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

> I don’t think tread pool will preserve the order of completions.

Nothing in Windows ever maintained the completion order. You need to number your OVERLAPPED structures.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

On 20-May-2015 11:22, Maxim S. Shatskih wrote:

> I don’t think tread pool will preserve the order of completions.

Nothing in Windows ever maintained the completion order. You need to number your OVERLAPPED structures.

+1. This has been discussed already in this list. At least twice IIRC.

– pa

> +1. This has been discussed already in this list. At least twice IIRC.

Well, I guess 95+% of topics have been already discussed a countless number of times. Probably, instead of ruminating over the same stuff, it makes sense just to say “search the archives.EOD”…

I think it would be a good idea for OSR guys to make a FAQ section with NUMEROUS links to corresponding threads. The examples of topics that may be covered are “Inverted calls vs Events”; “Shared memory vs multiple IOCTL”; “Diskfilter and Bo Branten”; “KMDF vs WDM”; “Unsupported approaches to Windows security”;etc,etc,etc…

Anton Bassov

Thanks for all your remarks.

Regarding completion order I know I cannot guarantee. The only thing I have to cover is that for every packet I have to look at its connection “history” in driver and signal a corresponding IRP that get’s processed in user mode on the same thread every time.

This way on user mode I have the guarantee that a connection packets (be it out of order) are being processed by the same thread and all state regarding that connection does not have to be needlessly synchronized.

I wonder if for that maintaining different completion ports is a solution (it has to scale) - and processing from a single thread in user mode all completed IRP-s that pertain to the completion port.

@ MM

Again, the typical design here is to have a series of queue of work, and then to
signal worker threads to process that work in order. The scope of each queue
should be as granular as possible to ensure maximum parallelism. When you get
an IO completion (via IOCP or any other method) you add work into the right
queue and then decide if a worker thread is already processing that queue or
another needs to be dispatched

I try to understand exactly what you mean regarding threaded packets processing.
I know I cannot guarantee from driver’s perspective the order in which packets for a connection are seen, and even so if I queue them to different completion ports queues in driver than they end up being read in user mode from different IO threads (that happens even if I use my own completion ports threads or uses the special thread pool IO - I guess that is it’s name)

In this case as you said the driver does not have to do anything special just queue pending IRP-s to a single completion port queue.

  1. If the need comes to use (performance) multiple completion queues in driver that in driver I have to decide again which of those competing pending IRP-s to chose ? (e.g. In user mode some of the packets from a connection take longer to process than packets from another connection/thread).

  2. Do I need any kind of load balancing algs on driver side or just let everything being done in user mode ? If I let user mode decide additionally queuing packets read by the thread pool IO to additional queues (one more queuing level is introduced here besides the completion port queue) then my user mode needs additionally another batch of processing threads that process those intermediate queues. This additionally puts a question oh how do you feed the feedback loop so that if processing takes too much time we don’t initiate new IRP-s from IO thread pool - giving the chance to driver to drop some of the packets to the floor -> resulting in a hopeful TCP rate downfall (cannot say too much about other protocols though)

This kind of load balancing described that guarantees same packet connection on same thread property seems to double number of threads needed (we need threads for exclusive IO polling/de-queueing) and an additional level of queues needed.

  1. I cannot see how I can make different packets from same connection be seen at different times (with full memory fence needed) if I allow packets to migrate threads (which would be very good from load balancing perspective).

@ MM

Thanks for your elaborate answer. I was just looking more closely over Vista io tread pool support and it seems quite good for exactly this purpose.

<< If you are worried about memory fences, then don?t. Whatever synchronization
you use to get the packets into a per connection queue will necessarily have
some kind of fence that will protect you

  1. If I use manual created connection processing threads (not the thread pool) I can make sure indeed that I associate every queue(1 queue - contains multiple connections) with a single processing thread.The part that queues packets to a particular “connection” queue must be synchronized of course (concurrent queue) also because queuing will be made from arbitrary IO thread pool threads.

  2. Maintaining connection queues - how many of them, how do I throttle packets(feedback has to go back down to the driver by lowering pending IRP-s or even not making any more IRPs) if “connection” queues get larger and larger become a problem though.

… and just for clarification: “connection” is the normal definition that a tcp connection would have, or an equivalent definition that could be given to an udp one

> 1. If the need comes to use (performance) multiple completion queues

There will never be such a need.

What are the gains of this? None.

Even if though wrong order of the received packets can occur, it will occur rarely, so this is insignificant performance-wise.

Since you cannot (in a scalable system) have a per-connection thread, you must live with the fact that the incoming data traffic will be handled by different threads (usually from the pool). So, locking is a must.

About initial submit of the async IRPs to sockets: note that the thread cannot exit while it have uncomplete async IRPs - this will cancel the IRPs. So, this must be some never-exit thread, you cannot exit it due to any load balancing algorithm.

Spreading the CPU-intensive stuff across cores is well-done by the usual Windows dispatcher.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

> 1. If I use manual created connection processing threads (not the thread pool) I can make sure

indeed that I associate every queue(1 queue - contains multiple connections) with a single
processing thread.

…and this thread will become the bottleneck, since N connections is waiting for it. You will narrow the scalability of your connections.

And what is the bonus? no locks on packet reassembly. Nothing else. Too tiny.

The part that queues packets to a particular “connection” queue must be synchronized of course

Yes, the queue itself has the proper sync stuff in it.

  1. Maintaining connection queues - how many of them, how do I throttle packets

Do not throttle. When you CPUs will hit 100%, throttling will occur anyway :slight_smile:


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com