Perhaps some additional context will help.
If the nature of your IO is that each operation is independent, then
carrying the state with the request is obviously the best choice and, as you
say, frees you from any additional synchronization beyond whatever the
underlying drivers have done. A good example of this would be a DNS server.
Each DNS request is completely contained in a UDP datagram and the response
is completely contained in a single datagram. There is no relationship
whatsoever between any requests and every IO operation can be handled
stateless. The only synchronization needed in this server would be on the
in memory DNS data (ideally a shared reader, single writer lock).
Another example, which is not completely stateless, but can effectively use
this model, is a batched log writer. Here some data to be logged is saved
in a memory buffer and periodically flushed to disk. There is some synch
needed when issuing the IO to ensure the correct file offset for each
operation and switch the incoming data stream to the next buffer, but on
completion, the buffer can be freed or returned to a standby list without
regard to what order it completed or whether there are several others in
progress.
If however, your IO is of a nature that each operation is not independent,
then carrying the state from your application is not sufficient. Consider
the example of a TCP socket server: IO issued on independent connections is
not dependent in any way, but IOOPs on the same socket are. For writes, the
application must use synchronization to enforce the consistency of whatever
protocol it is using, so while there may be multiple pending IOOPs, the
order in which they entered the drivers write queue must be the same as the
order that the thread(s) intended so an exclusive lock per socket is needed.
For example, thread A acquires the lock and sends three buffers, then thread
B acquires the lock and sends two. The lock guarantees the order in which
the writes are queued but the order in which the completions are handled is
irrelevant because the completion handler need only free or reuse the
resources for the write. This case is exactly like the log writer above.
It is for reads that the problem I was referring to exists. If the
application uses a single read operation per socket, then there is no
ambiguity with respect to order of operations because all of the IOOPs are
independent. This model works well if there are a large number of low
bandwidth connections, but does not achieve high performance if the
connections are high bandwidth or have bursts of traffic because additional
buffering and copying in KM is required when there is no read pending.
Keeping some number of reads pending on every socket at all times, so that
as soon as the TCP stack receives data it can copy it into a user buffer and
complete the read, greatly improves the performance for reading data on high
bandwidth or ‘bursty’ connections. But as these reads are not all
independent, ambiguity in completion order becomes a problem. Unlike
writes, where the completion routine is simply freeing resources, for reads
the completion routine must start or perform the protocol decode and
whatever action the server should take on receipt of this data. In order
for the application to process the TCP stream, it needs to do so by
gathering data from the buffers in the same order as they were filled in KM.
Because the thread execution order is indeterminate (context switches +
multiple processors), the obvious choice is to add a sequence to each read
(with a scope of the socket) and to use that plus some pointer swapping
magic to reassemble the stream in a way that one of the completion routines
can begin processing (the one that completed the read representing the next
block is usually a good choice, though this is not required necessarily; a
forward progress guarantee is all that is needed).
All of this leads me to the problem or limitation or annoyance that I
mentioned in the beginning; that in order for an application to ensure that
these sequence numbers are assigned in the same order as the underlying
drivers queue the reads, it must use a lock to ensure that a read is fully
queued (ReadFile or WSARecv fail with pending or succeed immediately) before
issuing another read whereas if the API returned a per connection, per
direction sequence, then the lock could be avoided. It is this lack of
elegance in this one element of a design paradigm that otherwise has nothing
to object to that has irked me, albeit only moderately, and provoked my
comments.
wrote in message news:xxxxx@ntdev…
To me, the great power of the IOCP is that I do NOT need to
synchronizeanything; I just handle the completions in the order they are
delivered, with no concern at all for the order in which they were
submitted. Each transaction carries its own state along with it. If you
care about fifoness then you cannot have ore than one thread waiting for
the IOCP, but even then, you only get them in the order they appear in the
IOCP, which may not be the order in which they were submitted, or even the
order in which they are completed (there was a thread about a year ago
when a number of experts told me that completion order and IOCP order are
explicitly not guaranteed to be the same order!)
To get the full benefit of the IOCP model, you have to assume I/O
completes in some opportunistic order, and therefore the disconnect
between completion order and IOCP order becomes irrelevant.
To carry the state along, embed the OVERLAPPED structure in a
state-carrying structure, e.g.,
typedef struct {
OVERLAPPED ovl;
BYTE buffer[somesize]; // or LPBYTE buffer and allocate it yourself
DWORD count; // actual buffer space used
MYREQUESTORINF inf; // whatever you want, either individual
// fields or some other struct that carries your state
} MYSTATE, * PMYSTATE;
You pass the address of ovl to your I/O operation, and in your IOCP
receiving thread you do appropriate casts to get it back to a PMYSTATE.
So there is never a need to do any kind of synchronization. If your app
truly requires FIFO responses based on submittal order, redesign your app
to remove this requirement, because we are already told that the
completions are not guaranteed to appear in submission, or even
completion, order.
It is important when considering concurrency to handle evything as
asynchronous opportunistic completion events. No good ever comes of
trying to preserve the dead single-thread sequential behavior model. You
have to shift paradigms, big time,or you are doomed.
joe
That sounds reasonable, except for the small matter that most applications
actually care about which order the ‘Hello world’ and ‘Goodbye world’ get
written to the device. Unless each IOOP is stateless, as in a UDP server
like DNS, the context of one IOOP with respect to another is important.
For
applications using sync IO, typically only a single thread will be issuing
IO per handle, but if that is not the case, then the synchronization means
that the programmer can expect the IOOPs to complete in the same identical
order that the UM code issued them. This is largely unhelpful because of
context switches, but at least it is comprehensible to a UM programmer who
has no idea about IRPs or how the IO completes in the kernel.
IIMO the one limitation of overlapped IO & IOCP is that it requires the
app
to synchronize the calls to pend reads because there is no way for the app
to determine the sequence in which a particular IOOP was ‘seen’ by the
kernel. The IOCP will dequeue in order, but one instruction later, the
thread may be prempted by the thread that dequeues the next completion and
the sequence is lost. This overhead is still small compared with the
benefit of multiple pending reads for many applications however; it is
more
of the lack of elegance that bothers me than anything else
wrote in message news:xxxxx@ntdev…
Pavel,
Ok, we have two threads issuing WriteFile(“Hello world!\n”) and
WriteFile(“Goodbye world!\n”) simultaneously. Finally it is all about IRP
(I
don’t consider FastIO now). At some point we need serialization since we
have only one output device. But isn’t it better to do it in driver to
reduce lock’s granularity?
NTDEV is sponsored by OSR
For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars
To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer