First, completion order never matches transport order for recv or send no matter what. That would never be true on Windows even with the legacy UP kernels.
Next, to be clear, file IO does not matter (the higher level initiator knows what block he is writing to / reading from and does not need the FS driver to tell him). IRPs to arbitrary devices do not matter (though some might benefit). Within the scope of network IO, the send path does not matter regardless of protocol (concurrent writes to the same stream always need to be guarded with a higher level synch object unless they could be guaranteed atomic and the higher level protocols don’t contain sequence numbers). And the recv path of UDP and other protocols that return one packet per IRP do not matter (as the order is fundamentally not guaranteed from the network, so any extra changes in order in the OS don’t matter). And within the scope of TCP based protocols, the only ones that matter are those that can multi-plex requests within a single TCP stream.
Classic protocols such as SMTP and HTTP 1 and even TDS have IO patterns where the client sends a command to the server, then waits for the response and then sends another one. The IRP times are dwarfed by the network round trip times and if the TCP server has to issue 2 or 3 or 10 reads to read the whole command it has no significant impact on the overall performance of the system. That performance limit is based principally on the bandwidth latency product issue. It can be extended by using multiple connections, but there are situations when multiple connections are problematic (guaranteed in order delivery) and even when they can be used the total system performance will be less since routers switches and firewalls along the way will have to maintain larger NAT, XLATE and other tables as well as process the extra packets for the connection setup and teardown. Add in any amount of packet loss / reordering and any amount of latency and performance can suffer terribly. That will be true in any case, but when that’s not true, we can do a lot better.
More modern protocols like HTTP 2, SSH 2 and most importantly to me FIX (Financial Information eXchange) allow a pipeline of requests to be in progress on a single TCP connection. Most of the incoming requests at this higher level protocol encoded in the TCP stream are orthogonal from one another, and so can be effectively processed on multiple worker threads, but that can’t be determined until the TCP stream can be reassembled in TCP stream order – which is different that IRP completion order. And because the order within the TCP stream of each socket can only be known by caller in one of two ways (have one recv pending per socket, or use a big lock to make sure a sequence number you assign will be right) that choice has to be made. Typically it is the one recv per socket. But even with large per IRP buffers that model with throttle the per TCP socket performance based on window size stalls caused by premption or starvation.
A much more effective strategy to deal with large changes in the volume of data (as happen routinely with the stock market etc.) is to have a significant number of pre-allocated and pended buffers ready to take up the slack. There is a limit to everything, but if you can plan to handle a peak to typical ratio of 1000:1, in my experience it will work well
So within all of that rambling diatribe, what’s the question? The TCP stack uniquely has the knowledge of which chunk of data it delivers back to the callers of recv etc. belongs where within that stream. That’s it chief function and classically it assumes that applications are comparatively dumb in the way that that stream gets read. But if the application (UM or KM) has to be better than that to achieve performance requirements, it provides no intrinsic opportunity to avoid a heavy handed approach. And that’s where I see the opportunity to improve.
There are many levels of technology in play, but a possible solution would be to set the offset and offset high values in the OVERLAPPED to the TCP sequence of the block returned via a recv call? I won’t get my hopes up on this any more than I have been presuming a support case over crashes in the ODBC client driver during SQL AOAC failover for the last two years either. I think that someone at MSFT forgot that there is a C API and just let the C++ exceptions float on through anyways.
I am sure I am leaving out critical information from my analysis. And I am equally sure that I am the worst one to identify it since I read what I think I said and not what the paper says of course