Let me try to explain more
Look at this from another point of view. TCP packets arrive out of order, fragmented etc. That is a fact of life. Someone in the system will have to stich them back in order at some level – that is also a fact of life. It is also a fact of life that it has to be done twice. Once from network packets to TCP stream at the NIC / TCP.sys layer (depending on the offloads in effect), and again by the UM or KM socket client reading chunks of that TCP stream via multiple pending IRPs – before anyone objects, I will expand on this later.
The problem that I object to is that in the NDIS stack, lots of care and attention has been taken to hold fine grained locks to promote parallelism etc., but to be effective in this situation, the UM or KM socket client has to hold a big coarse grained lock so that it can know which chunk of the TCP stream it has. The lower level protocol knows this because it knows the order that the IRP was queued (block order) and it also knows the TCP sequence (byte order). I am glossing over many details, but the point is that a big lock has to be used that over arches the many small fine grained or lock free ones employed at a lower level in order to assure something that could simply be output from the lowest level of locking / completion. Either kind of sequence information would be equally effective – block sequence or byte sequence – as long as it was included somehow in the data read from the TCP stream and sent back to the caller in UM or KM.
Addressing the question above – why is it necessary to pend multiple reads on the same TCP socket. The answer is simply performance. And not overall performance for a server that has many sockets it needs to receive data from, but specifically for a server that has few (or one) TCP socket(s) that it needs to receive data from. The delay caused by completing a single pending IRP and posting a new one materially affects the single socket throughput. The only solution is to ensure that the TCP stack has lots of buffers immediately available to fill. But thread pre-emption at PASSIVE or in UM means that call or IOCP extraction order is unreliable and so some other kind of sequence is needed. Hence the need for that big ugly lock to force it, or an output parameter of some kind that tells the higher level callers in which order things were done.
I don’t think there is an easy way that this could be accommodated in the present Windows APIs, so I don’t hold out hope that it will happen. The larger purpose is to highlight that while the whole IRP + OVERLAPPED paradigm is a massive improvement over the previous sync IO or select loop concept, there are paradigms of IO that it does not perfectly service.