I have mentioned this here before, but IMHO this is one of the biggest issues with the Windows IO model. I don’t want to call it a design flaw per se, but more of a fundamental performance limitation
High performance UM applications that achieve high IO performance by using overlapped IO and pending multiple read and write operations on the same handle concurrently, have to use UM synchronization to ensure that they know the exact order in with they get pended when order matters. Whereas an output parameter of some kind with the ‘sequence’ would remove the requirement for redundant coarse grained synchronization
For example, if an application, such as SQL server is performing file IO and issues multiple concurrent overlapped reads / writes at specific locations within a .mdf, as long as it keeps track of not issuing multiple operations to the same location, the order that they enter the disk stack and are completed back to UM (IOCP) is irrelevant. Thread pre-emption, context switches, interrupts can all happen at will since there is not inherent order to these operations. And if SQL wants to both read in the contents of a block of the file and write to it at the same time, well that’s exactly what the buffer pool is designed to avoid – if you already have the page in memory, why are you reading it from the disk; and transactional consistently means that you never can write to it concurrently either
But take a network application for example. Suppose you implement a UDP server like DNS. The same arguments apply since it does not really matter whether a certain UDP packet gets processed before any other UDP packet. But lest say we switch to a TCP server like SMB. Now order does matter and there is no way to control it on the UM side. Each read will read the next chunk of the TCP stream, and any application must process those chunks in the correct order to function properly. It is easy enough to say, limit yourself of one pending read per socket, but the reality is that only performs well if there are many sockets and each one sends little data. What happens where there are few (or one) socket sending all the data? The only way the TCP stack in Windows (and any other OS) can cope with this is to either
- Reduce the window size so that transmission speed slows
- Have the application have pended enough buffers so that the TCP stack immediately has space to copy the data into and then complete
That means that to approach wire speed TCP performance on a relatively loss less network applications have to know in which order those chunks should be stitched back together. Because of thread pre-emption etc. the completion order via IOCP is not enough. The current standard solution is to ‘lock the handle’ when sending the read down the stack and record a sequence number in UM. But the UM code does not actually care if the order that it makes the calls to WSARecv or ReadFile etc. hit the stack in the same order that they were made. It only cares in which order they were queued / completed by the driver in question – because that is the order in which they need to be processed to stich the TCP data back together. So we end up using a big lock to guard an entire UM / KM transition + all the work going down the stack until the IRP gets queued or completed, when all we really need is the smallest lock at the end when the driver queues or completes the IRP