Sockets, I/O Completion ports, TCP_NODELAY, SO_RCVBUF and SO_SNDBUF

I have a server application that uses I/O completion ports
to do socket I/O.

I notice that unless I set the SO_SNDBUF to 0, it is still
copying the data to an internal buffer (I notice this because
unless I set the SNDBUF to 0, I get the notification that my
final write has finished, close the socket, and the final write
hasn’t actually made it onto the wire and disappears.)

So, this got me thinking…

If I set SO_SNDBUF to 0, does the Nagel algorithm still
function? For example, does the TCP stack use a gather
write from multiple buffers to assemble the packet?
Or does setting SO_SNDBUF to 0 effectively disable
Nagel as well (effectively set TCP_NODELAY.)

Also, is setting SO_RCVBUF and SO_SNDBUF to 0 microsoft
recommended practice for a high performance server using
completion ports?

Thanks,

Joseph

I hate it when I forget half the question :frowning:

Also, with TCP_NODELAY set (nagel disabled), does
one WriteFile() always generate at least one TCP
packet? Or, if I fill the TCP window so that
data is queued in the stack, will it coalesce
multiple writes into a single packet?

Thanks,

Joseph

Joseph Galbraith wrote:

I have a server application that uses I/O completion ports
to do socket I/O.

I notice that unless I set the SO_SNDBUF to 0, it is still
copying the data to an internal buffer (I notice this because
unless I set the SNDBUF to 0, I get the notification that my
final write has finished, close the socket, and the final write
hasn’t actually made it onto the wire and disappears.)

So, this got me thinking…

If I set SO_SNDBUF to 0, does the Nagel algorithm still
function? For example, does the TCP stack use a gather
write from multiple buffers to assemble the packet?
Or does setting SO_SNDBUF to 0 effectively disable
Nagel as well (effectively set TCP_NODELAY.)

Also, is setting SO_RCVBUF and SO_SNDBUF to 0 microsoft
recommended practice for a high performance server using
completion ports?

Thanks,

Joseph


Questions? First check the Kernel Driver FAQ at
http://www.osronline.com/article.cfm?id=256

You are currently subscribed to ntdev as: xxxxx@vandyke.com
To unsubscribe send a blank email to xxxxx@lists.osr.com

> I notice that unless I set the SO_SNDBUF to 0, it is still

copying the data to an internal buffer (I notice this because
unless I set the SNDBUF to 0, I get the notification that my
final write has finished, close the socket, and the final write
hasn’t actually made it onto the wire and disappears.)

Yes. This is how zero-copy is activated in Windows.

Note: TCP may need to retransmit the data, and, with zero-copy, the only source
of the data is the app’s IRP+MDL. So, TCP/AFD has no other choices then pend
the send request till all ACKs will arrive. With zero-copy, it is so.

Note: TDI_SEND is always pended till all ACKs will arrive, this is due to TCP
doing no buffering (buffering is the task of AFD). So, with SO_SNDBUF != 0, AFD
does the buffering. With SO_SNDBUF == 0, AFD does no buffering, and just
stupidly converts socket writes to TDI_SENDs.

If I set SO_SNDBUF to 0, does the Nagel algorithm still
function?

No for sure. If you will have lots of tiny sends with zero-copy - then each
send will be blocked till all ACKs will arrive, and the next send will not be
able to proceed. This gives Nagle send coalescing no chances, since TCP will
have only 1 pending send a time.

Not bad anyway. Nagle is for tiny sends, while zero-copy is for huge ones. You
can switch SO_SNDBUF on and off (yes, this is fast, faster then data sends, the
emotional prejudice about option changes being more expensive then the traffic
is wrong) on the fly.

Like - send the tiny header with Nagle, then set zero-copy on, send the huge
data, then reset zero-copy off, and so on. Very beneficial, BTW, tried this in
practice.

For example, does the TCP stack use a gather
write from multiple buffers to assemble the packet?

With zero-copy, TCP has only 1 outstanding send in a time (usually, unless
you will send lots of overlapped sends to the socket).

Or does setting SO_SNDBUF to 0 effectively disable
Nagel as well (effectively set TCP_NODELAY.)

Not directly, only effectively. I believe that, if you use lots of overlapped
sends and zero copy, then Nagle will be active.

Nagle is TCP’s behaviour, while zero-copy is AFD’s (TCP is always zero-copy).

Also, is setting SO_RCVBUF and SO_SNDBUF to 0 microsoft
recommended practice for a high performance server using
completion ports?

This means - zero-copy.

SO_RCVBUF = 0 will force TCP to close the receiver window immediately on next
incoming packet if there is no pending reads on the socket (very degenerate
case). But, if there are pending reads, then SO_RCVBUF = 0 will do zero-copy
(well, single-copy - from NDIS_PACKET to the app’s MDL, I believe this is done
in AFD’s TdiClientEventReceive callback called by TCP). Otherwise, there will
be 2 copies - from NDIS_PACKET to AFD’s buffer and from AFD’s buffer to the
app.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

> Also, with TCP_NODELAY set (nagel disabled), does

one WriteFile() always generate at least one TCP
packet?

Yes.

Nagle is TCP’s (not AFD’s) option, which governs when TCP is allowed to send
the next portions down to NDIS.

With Nagle on, TCP can send only if a) if the amount of pending data was zero
before this TDI_SEND arrival b) if the ACK arrived from the network c) if the
amount of pending data hitted the packet size. If TCP is not allowed to send -
then it accumulates the data by coalescing several TDI_SENDs together (and
pending them till all ACKs will arrive for the whole cumulative portion).

With Nagle off, TCP can send each time TDI_SEND arrives. So, it just breaks
TDI_SEND to packets according to max packet size, and sends them down to NDIS.

Disabling Nagle is a very good idea for transactional request/response traffic.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Maxim S. Shatskih wrote:

>Also, with TCP_NODELAY set (nagel disabled), does
>one WriteFile() always generate at least one TCP
>packet?

Yes.

Nagle is TCP’s (not AFD’s) option, which governs when TCP is allowed to send
the next portions down to NDIS.

With Nagle on, TCP can send only if a) if the amount of pending data was zero
before this TDI_SEND arrival b) if the ACK arrived from the network c) if the
amount of pending data hitted the packet size. If TCP is not allowed to send -
then it accumulates the data by coalescing several TDI_SENDs together (and
pending them till all ACKs will arrive for the whole cumulative portion).

With Nagle off, TCP can send each time TDI_SEND arrives. So, it just breaks
TDI_SEND to packets according to max packet size, and sends them down to NDIS.

Disabling Nagle is a very good idea for transactional request/response traffic.

Maxim,

Thank you for your excellent explanation. This has helped me
understand the ramifications of turning off Nagle (the product
is a SSH server which is doing both request/response traffice
in the SFTP protocol module and traffic-proxying/forwarding
in the port forwarding module.) And understand how this
interacts w/ the 0 copy.

To make sure I understand all this:

  1. Nagel’s delay is not time based (as I had previously thought)
    but event based.

  2. In your a, b, c list above, if any one of those events
    occurs, TCP is allowed to send. (It’s an or list not an
    and list.)

So w/ nagel on, if I do a send and there is no un-ack’d data
on the wire, TCP will send immediately?

I’m not sure I understand what you mean by b; the only
intereptation I can think of makes it the same as a.

Oh… maybe: Nagel has been holding my writes and accumulating
data. When an ACK arrive from the peer (causing case a to
now be true) the data is sent. (This event is asynchronous
to anything my application is doing.)

  1. With SO_SNDBUF == 0 and Nagel on, if I pend lots of writes
    TCP would probably send the first one immediately, and then
    coalesce the remaining writes into a single packet (taking
    into account the max packet size of course.)

I would not see completion for any of the writes until
the data was acked from the peer.

  1. With nagel off, no coalescing is done even if I present
    the stack with enough data so that it can’t be sent
    immediately (and could concievable be coalesced since
    they can’t be sent immediately.)

Tahnks,

Joseph

> 1. Nagel’s delay is not time based (as I had previously thought)

but event based.

Yes.

  1. In your a, b, c list above, if any one of those events
    occurs, TCP is allowed to send. (It’s an or list not an
    and list.)

OR.
Any of the event will trigger a send.

So w/ nagel on, if I do a send and there is no un-ack’d data
on the wire, TCP will send immediately?

Yes.

I’m not sure I understand what you mean by b; the only
intereptation I can think of makes it the same as a.

If Nagle is on, and the app sends more - TCP pends the send (unless the total
amount of pending data is < packet size) and does not send.

This state is quitted on either a) ACK arrival - this triggers a send OR b) the
next app’s sends accumulating the data up to the packet size OR c) connection
broken due to ACK timeout.

Oh… maybe: Nagel has been holding my writes and accumulating
data. When an ACK arrive from the peer (causing case a to
now be true) the data is sent. (This event is asynchronous
to anything my application is doing.)

Yes. And, if the ACK will not arrive, then the connection will be broken by the
timer.

  1. With SO_SNDBUF == 0 and Nagel on, if I pend lots of writes

This is possible with overlapped IO only.

TCP would probably send the first one immediately, and then
coalesce the remaining writes into a single packet (taking
into account the max packet size of course.)

Yes. Nagle is for coalescing.

I would not see completion for any of the writes until
the data was acked from the peer.

Yes, this is for zero-copy.

  1. With nagel off, no coalescing is done even if I present
    the stack with enough data so that it can’t be sent

I think that TCP will coalesce in this case. With Nagle off, it will not
delay sending to provide better chances for coalescing, but coalesce itself
is done.

Nagle ruins the performance on request-request-response traffic. On
request-response, it is not this bad - until you will want to have several
pending request-response talks over the same socket.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

Thanks again for the great answers.

Maxim S. Shatskih wrote:

>3. With SO_SNDBUF == 0 and Nagel on, if I pend lots of writes

This is possible with overlapped IO only.

This is exactly what I’m doing – which is why I’m curious
about the behavior :slight_smile:

Nagle ruins the performance on request-request-response traffic. On
request-response, it is not this bad - until you will want to have several
pending request-response talks over the same socket.

Yes… in my case I’ve either got a.) traffic that has already been
coalesced by nagel (proxy traffic) or b.) multiple
request-request-response traffic patterns or c.) a mixture of a.
and b.

I do get a significant performance boost turning nagel off… but I
wasn’t confident I had a good handle on the entire significance of
doing so. Thanks again.

One more question:

Is there any way for usermode to query the max packet size on the
connection? How about the amount of available window?

I have a couple of issues I deal with in SSH. SSH multiplexes mutliple
independant data-streams across the single TCP connection, using
a packetized, windowed protocol. (SSH lingo calls them channels.)

However, I get in a state where I ping-pong between having no
channel-window and having no tcp-window, resulting in many
small packets sent. Either I fragment the data (because of limitted
channel-window) or TCP fragments the data (because of limitted
TCP window.)

So I’m looking for a good way to try and stay in sync with the
underlying TCP stack for window.

Thanks,

Joseph

> Is there any way for usermode to query the max packet size on the

connection? How about the amount of available window?

Look at getsockopt() docs. Don’t remember off-head.

Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com