async reads and IO manager ordering guarantee

MBond2 · June 28, 2020, 12:23am

Let me try to explain more

Look at this from another point of view. TCP packets arrive out of order, fragmented etc. That is a fact of life. Someone in the system will have to stich them back in order at some level – that is also a fact of life. It is also a fact of life that it has to be done twice. Once from network packets to TCP stream at the NIC / TCP.sys layer (depending on the offloads in effect), and again by the UM or KM socket client reading chunks of that TCP stream via multiple pending IRPs – before anyone objects, I will expand on this later.

The problem that I object to is that in the NDIS stack, lots of care and attention has been taken to hold fine grained locks to promote parallelism etc., but to be effective in this situation, the UM or KM socket client has to hold a big coarse grained lock so that it can know which chunk of the TCP stream it has. The lower level protocol knows this because it knows the order that the IRP was queued (block order) and it also knows the TCP sequence (byte order). I am glossing over many details, but the point is that a big lock has to be used that over arches the many small fine grained or lock free ones employed at a lower level in order to assure something that could simply be output from the lowest level of locking / completion. Either kind of sequence information would be equally effective – block sequence or byte sequence – as long as it was included somehow in the data read from the TCP stream and sent back to the caller in UM or KM.

Addressing the question above – why is it necessary to pend multiple reads on the same TCP socket. The answer is simply performance. And not overall performance for a server that has many sockets it needs to receive data from, but specifically for a server that has few (or one) TCP socket(s) that it needs to receive data from. The delay caused by completing a single pending IRP and posting a new one materially affects the single socket throughput. The only solution is to ensure that the TCP stack has lots of buffers immediately available to fill. But thread pre-emption at PASSIVE or in UM means that call or IOCP extraction order is unreliable and so some other kind of sequence is needed. Hence the need for that big ugly lock to force it, or an output parameter of some kind that tells the higher level callers in which order things were done.

I don’t think there is an easy way that this could be accommodated in the present Windows APIs, so I don’t hold out hope that it will happen. The larger purpose is to highlight that while the whole IRP + OVERLAPPED paradigm is a massive improvement over the previous sync IO or select loop concept, there are paradigms of IO that it does not perfectly service.

Peter_Viscarola_OSR · June 28, 2020, 6:05pm

AH! I think I see your issue.

To restate what I hear as the problem:

In network transfers, in to be sure that Application-Layer data is processed in order, that application layer data needs to include within it information to allow the receiver to properly sequence the application layer data.

(end of my attempted restatement of the problem)

Assuming my restatement is accurate:

This is true.

But: Is it not the case that even if you had a system wherein the transport-layer packet sequence was passed-up or preserved when presenting the higher layer data, you STILL have no way of knowing that the data was originated in a way that the transport-layer sequence accurately reflected the application layer data?

One would need to have a system in which application order was guaranteed to be “married” to transport order at BOTH the transmitter AND the receiver. And then you’d need to trust it entirely.

Again… I contend that in the “real world” (in my admittedly limited experience writing end-to-end networking stuff, especially in the past 20 or so years) what people mostly do is optimize the “usual case”… and that case is providing a sequence number, “guessing” that the application layer data is going to arrive at the receiver in order, and reordering it if it does not.

Peter

MBond2 · June 30, 2020, 2:00am

Peter - thank you for looking into this at all.

Consider that when a UM app like SQL server issues a read from a file, it knows exactly at what file offset that read that should happen. It does not care about the order in which reads to lots of different addresses in that file complete. This was probably the reason that OVERLAPED IO was created, but that was many years ago now

Now let’s talk about by TCP server problem. The client does not know in advance what offset into the stream (TCP sequence / packet count) any given read will return. It can’t. but the driver that fulfills the read does. It must needs know, or else it can’t do this in any reliable way.

The point is about how that information can become available to other actors at higher levels. If only one read is pended per socket, then it is irrelevant. But if a single read per handle cannot handle the throughput requirements, then it becomes so and the only resent method is the ‘giant lock and hope the drive does the right thing’ one

Peter_Viscarola_OSR · June 30, 2020, 12:07pm

the driver that fulfills the read does. It must needs know, or else it can’t do this in any reliable way

The driver that fulfills the receive knows the Transport layer order of the packet. It does not, and cannot, know the application layer order of the packet. Because at the sender, there’s no guarantee that application layer data sent asynchronously will be placed in transmit packets in the order in which that application data packets was “sent”. You have the same problem on transmit that you have on receive.

Transport layer sequence numbers guarantee transport layer sequencing… not necessarily the order of any data above the transport layer. You need sequencing at the application layer, you need application layer sequence numbers. I really don’t know any way around this.

Peter

MBond2 · July 1, 2020, 3:22am

Bu transport layer sequence it is exactly that information that the caller wants to know. Because the completion order does not match the transport layer sequence due to thread pre-emption etc. Unless there is only one order possible (because there is only one IOP per socket), and since higher level protocol ‘messages’ can arbitrarily span chunks of the TCP stream returned by single RECV calls, transport order is the critical factor is the critical factor in the ‘transcoding’ between a TCP stream and the next higher level protocol – whether that protocol has its own sequence or not

Rourke · July 1, 2020, 6:17am

@Dejan_Maksimovic said:
Why use ASYNCHRONOUS I/O if you want synchronous order? (overlapped = async)
The whole and only point of async I/O is for the file system/disk
driver to get the best possible performance by rearranging what gets
read/written first.

But you do get increased performance by overlapping but not reordering requests. Overlapping the packet up down path with the device data transfer increases performance. This is an improvement over synchronous I/O where you sit doing nothing waiting for the data transfer to complete and the packet to bubble back up before bubbling down the next request. There are also many performance efficiencies at the device level when you overlap that are doable combining requests, buffering, and so on.

I don’t understand what application writer would expect or desire writes to a file to be shuffled and randomized to a form that will be recorded differently than the order sent down. Even the storage hardware preserves order unless you tell it that it doesn’t need to. I also have yet to see the fact order is not guaranteed in windows clearly documented anywhere making it a gotcha many unexpectedly run into.

Peter_Viscarola_OSR · July 1, 2020, 11:56am

what application writer would expect or desire writes to a file to be shuffled and randomized to a form that will be recorded differently than the order sent down

Any application writer who understands how computers work. No, seriously. Noobs might get a pass, but beyond that…

We’re not going to debate asynchronous I/O and the storage stack here. There’s nothing to be gained, and we’ve had that discussion enough.

Mr. Bond’s contention/question is limited in scope to networking, where at least the topic warrants discussion, even if only as an academic exercise at this point.

Even the storage hardware preserves order

Sorry, but this is fundamentally incorrect, and has been incorrect for a very long time… at least back to IDE tagged queuing. For the most recent example, see how NVMe works.

Peter

Peter_Viscarola_OSR · July 1, 2020, 12:00pm

Because the completion order does not match the transport layer sequence due to thread pre-emption etc

Unless the application layer sequence isn’t the same as the transport layer sequence, right?

If I asynch send A, B, C… do those packets necessarily get put in that order into transport layer packets 1, 2, and 3 ? I don’t think so.

Peter

Rourke · July 2, 2020, 12:24am

@“Peter_Viscarola_(OSR)” said:
Any application writer who understands how computers work

Dream on. A question about overlapped write order on the excellent resource stackoverflow was asked. No one has been able to answer it. Imagine that whole community of experts and no one can answer the question. But maybe you can. But then again without links to official Microsoft sources it might not be accepted, especially if your only reference material is you claim that you know how computers work.

https://stackoverflow.com/questions/33650899/does-asynchronous-file-appending-in-windows-preserve-order

@“Peter_Viscarola_(OSR)” said:
Sorry, but this is fundamentally incorrect [the storage hardware preserves], and has been incorrect for a very long time… at least back to IDE tagged queuing

Well duh, of course SRB_SIMPLE_TAG_REQUEST style tagged command queuing is unordered, that’s the whole point for using it! But the hardware also has ordered overlap. Even old IDE and SCSI drives (the latter of which windows was designed around) had write back caching where the device buffers non-tagged writes and returns success status before recording the data. Order preserved.

And by the way I am glad you didn’t respond to my claim that overlapped I/O has certain performance benefits even without reordering. That was the main point of my post and I assume you agree with it. There is a common misconception that overlap with order preservation has no benefit.

MBond2 · July 2, 2020, 12:53am

There are several key points here and it will take some time to elucidate them here. As with your seminars for learning driver development, this form is probably not the easiest way to communicate, but let’s try

MBond2 · July 2, 2020, 1:19am

First, completion order never matches transport order for recv or send no matter what. That would never be true on Windows even with the legacy UP kernels.

Next, to be clear, file IO does not matter (the higher level initiator knows what block he is writing to / reading from and does not need the FS driver to tell him). IRPs to arbitrary devices do not matter (though some might benefit). Within the scope of network IO, the send path does not matter regardless of protocol (concurrent writes to the same stream always need to be guarded with a higher level synch object unless they could be guaranteed atomic and the higher level protocols don’t contain sequence numbers). And the recv path of UDP and other protocols that return one packet per IRP do not matter (as the order is fundamentally not guaranteed from the network, so any extra changes in order in the OS don’t matter). And within the scope of TCP based protocols, the only ones that matter are those that can multi-plex requests within a single TCP stream.

Classic protocols such as SMTP and HTTP 1 and even TDS have IO patterns where the client sends a command to the server, then waits for the response and then sends another one. The IRP times are dwarfed by the network round trip times and if the TCP server has to issue 2 or 3 or 10 reads to read the whole command it has no significant impact on the overall performance of the system. That performance limit is based principally on the bandwidth latency product issue. It can be extended by using multiple connections, but there are situations when multiple connections are problematic (guaranteed in order delivery) and even when they can be used the total system performance will be less since routers switches and firewalls along the way will have to maintain larger NAT, XLATE and other tables as well as process the extra packets for the connection setup and teardown. Add in any amount of packet loss / reordering and any amount of latency and performance can suffer terribly. That will be true in any case, but when that’s not true, we can do a lot better.

More modern protocols like HTTP 2, SSH 2 and most importantly to me FIX (Financial Information eXchange) allow a pipeline of requests to be in progress on a single TCP connection. Most of the incoming requests at this higher level protocol encoded in the TCP stream are orthogonal from one another, and so can be effectively processed on multiple worker threads, but that can’t be determined until the TCP stream can be reassembled in TCP stream order – which is different that IRP completion order. And because the order within the TCP stream of each socket can only be known by caller in one of two ways (have one recv pending per socket, or use a big lock to make sure a sequence number you assign will be right) that choice has to be made. Typically it is the one recv per socket. But even with large per IRP buffers that model with throttle the per TCP socket performance based on window size stalls caused by premption or starvation.

A much more effective strategy to deal with large changes in the volume of data (as happen routinely with the stock market etc.) is to have a significant number of pre-allocated and pended buffers ready to take up the slack. There is a limit to everything, but if you can plan to handle a peak to typical ratio of 1000:1, in my experience it will work well

So within all of that rambling diatribe, what’s the question? The TCP stack uniquely has the knowledge of which chunk of data it delivers back to the callers of recv etc. belongs where within that stream. That’s it chief function and classically it assumes that applications are comparatively dumb in the way that that stream gets read. But if the application (UM or KM) has to be better than that to achieve performance requirements, it provides no intrinsic opportunity to avoid a heavy handed approach. And that’s where I see the opportunity to improve.

There are many levels of technology in play, but a possible solution would be to set the offset and offset high values in the OVERLAPPED to the TCP sequence of the block returned via a recv call? I won’t get my hopes up on this any more than I have been presuming a support case over crashes in the ODBC client driver during SQL AOAC failover for the last two years either. I think that someone at MSFT forgot that there is a C API and just let the C++ exceptions float on through anyways.

I am sure I am leaving out critical information from my analysis. And I am equally sure that I am the worst one to identify it since I read what I think I said and not what the paper says of course

MBond2 · July 3, 2020, 12:29am

I should mention that the Microsoft person I have been working over the last two years (Shon Knoblauch) has been excellent. Extremely understanding and responsive to my issues; but the problems are hard. Based on my feedback they have looked into, found any fixed several issues – just not the penultimate one(s) that causes crashes in my UM processes. That work is ongoing, but I want to give credit where it is due, rather than to just rip on MSFT.

The problem I am looking at is one that arises only because in Windows we can do better than a select loop, but also don’t do thinks like core corralling – where *nix kernels are modified to do reserve specific cores to handle specific work in essentially real mode. We already handle this well – but I can clearly see a way to avoid a large lock and have the information sent back up from a more fine grained one lower down. And that’s what this is all about.

Again the key part that makes this different is that the TCP stack will complete RECV on arbitrary byte boundaries whereas normal drivers complete on natural protocol boundaries – whatever they may be

MBond2 · July 8, 2020, 12:38am

So either I knocked this one out of the park, and everything I say must be unassailably true, or else I have so muddled it that no one can understand what I’m talking about.

Either way, I would appreciate someone saying something. Even if it is just a high 5 for a problem well stated, of a WTF are you going on about

pindumb · July 8, 2020, 7:27pm

Completion order is guaranteed for a TCP socket. The only problem is completion notification order. And that is only relevant if you use multiple IOCP threads.

See https://www.codeproject.com/Articles/2610/Handling-Multiple-Pending-Socket-Read-and-Write-Op

“Although operations using the IO completion port will always complete in the order that they were submitted, thread scheduling issues may mean that the actual work associated with the completion is processed in an undefined order.”

To avoid a “big lock” you can implement something similar to a strand of Boost Asio library:

https://www.boost.org/doc/libs/1_59_0/doc/html/boost_asio/overview/core/strands.html

Peter_Viscarola_OSR · July 8, 2020, 8:52pm

We all agree we’re ignoring file I/O. Which is where my primary expertise lies. I haven’t worked on networks (except some naïve user-mode programming) in… a long while. Like back when OSI was still “a thing.”

Well, for MY part… I have no follow-up because I realized that I apparently do not appreciate the problem. I thought I did… but perhaps I do not. So, this not being my primary area of practice, I just decided to take your write-up, Mr. Bond, as a lesson.

Peter

MBond2 · July 10, 2020, 2:38am

Thanks Peter for the vote of confidence.

I have been focused on this kind of stuff since around 2005, but I am always nervous that there is some larger issue that I have missed. It seems hard to transition even with all this time to the status of an expert. Even when there are obvious things that I might rail against or for

MBond2 · July 10, 2020, 2:43am

I think the suggestion of a ‘strand’ demonstrates a complete lack of understanding.

pindumb · July 10, 2020, 5:21am

A ‘strand’ does not solve the correct ordering on itself, but it avoids the big lock around WSARecv when assigning sequence numbers. With a ‘strand’ the lock is only on the queue.

See also: http://www.lenholgate.com/blog/2014/07/efficient-multi-threading.html

MBond2 · July 11, 2020, 12:38am

Can you elaborate further? I don’t think we are using a consistent terminology but am interested in anything that may help

David_R_Cattley · July 11, 2020, 2:16pm

I have been following this closely since I happen to be working on an boost ASIO based socket reader and have been looking to extend the worker part of the processing to multi-threaded to spread the (independent) loads across cores.

That said, I do not have a TCP stream to deal with but a UDP stream that already has sequence numbers in the payload and so delivery “de-ordering” by multiple readers/workers on the underlying IOCP is not actually a problem I was worried about but one that was just another reason the UDP packets might arrive “out of sequence” by the time my processing was “done” and needed to merge the results.

I may be wrong (as I often am) but my understanding in the TCP case is that the I/O Manager (or more importantly, the TCP stack) does guarantee the completion order of read requests. The issue at hand as I read it is that a set of UM threads picking up these ordered completions themselves have no “run ordering” guarantee.

I also understand that TCP and the I/O Manager will complete read requests in the order of submission. That would seem then that tagging the request with a sequence number would be sufficient to give the re-ordering hint on the completion side. If multiple threads are refilling the receive request pool then a lock is needed over the entire operation sequence of:

allocating the next sequence number (ticket)
assigning the sequence number to the request
posting the request

This in my understanding aligns with the terse comment that “only a lock around the queue is required” and yes, the boost ASIO strand abstraction applied to the posting (enqueing) side seems like it would at least get sequence numbers assigned with minimum contention.

The dispatch side of the queue however if serviced by multiple workers must already have a work-plan that understands that the work will need to be merged to regain completion order. Otherwise why dispatch across multiple workers?

So after re-reading this thread multiple times I am at either that

(1) there really is no problem here, just a design challenge

or -
(2) MBond2 is actually claiming that TCP is broken in Windows
or -
(3) MBond2 is wishing that TCP had put some more information in the completion like the TCP Sequence Number value of the first octet in the receive buffer so that tagging the OVERLAPPED would be unnecessary as a means of tracking completion order.

So helpful and interesting illumination of a challenging design problem, bug, or feature request?

-dave