even more on: WinSock send() question

At 03:09 PM 07/13/2000 -0700, I wrote:

Further research has revealed that, when the send() call returns with the
error, WSAGetLastError returns 10060 (WSAETIMEDOUT). Quinn/Shute (p589)
report that “this error is relevant to connect(), but not to send() or
sendto()”. Apparently no one mentioned that to Microsoft, because we’re
getting it from send() anyway.

MSDN reports the following relationship between send() and WSAETIMEDOUT:

“The connection has been dropped, because of a network failure or because
the system on the other end went down without notice.”

Since other sockets continue to exchange data from the same two machines
over the same wire, it’s safe to say that the network hasn’t “failed” and
that the system on the other end (in this case, the server) didn’t “go down
without notice”. Perhaps there’s some condition under which NT’s TCP/IP
stack orphans a connected socket, while all others continue to operate
normally? Any ideas out there?

Do you have Network Monitor setup to capture the data when this happens ?
If not, I think its a good idea to set it up for it might reveal more
information.

Also, when I was working on WinSock, I found the following newsgroup very
helpful: comp.os.ms-windows.programmer.tools.winsock

Hope this helps.
Puja

On 07/13/00, “Richard Hartman ” wrote:
> At 02:17 PM 07/13/2000 -0700, you wrote:
> >I can’t figure out what would cause a TCP connection to simply “stall” like
> >that.
>
> Further research has revealed that, when the send() call returns with the
> error, WSAGetLastError returns 10060 (WSAETIMEDOUT). Quinn/Shute (p589)
> report that “this error is relevant to connect(), but not to send() or
> sendto()”. Apparently no one mentioned that to Microsoft, because we’re
> getting it from send() anyway.
>
> Hopefully this gives a clue to someone…?

At 06:32 PM 07/13/2000, you wrote:

Do you have Network Monitor setup to capture the data when this happens ?
If not, I think its a good idea to set it up for it might reveal more
information.

The trouble is, we’re running multiple sockets between those machines with
as much data as possible on each connection. The connections can run for
hours and hours without any errors whatsoever, and when one socket finally
exhibits the problem the others continue to scream data at full speed. I’m
not sure how to trigger, and capture, meaningful data in that environment.

At 12:19 AM 07/14/2000 +0100, John Sullivan wrote:

On Thursday 13 July 2000 you wrote:
In the first case, the client’s stack has wedged (at least on that one
socket). In the second, the server’s stack has wedged.

What means “wedged”?

In the third
case, either stack could be doing something truly bizarre, but more
likely is that random packet loss on the wire is preventing the
reliable transmission of data. (How many clients active at once? What
volume of data for each one? What speed network? Two machines is ok,
but get to about 50% saturation with several clients (10? 20?) and the
performance dies as collisions take over.)
If you’re using multiple sockets with blocking send(), you presumably
have multiple threads? In which case are you being *very* careful not
to access a socket from more than one thread at a time? MSDN warns
against this.

Two machines, one client and one server. The client is running four copies
of a single-threaded test program which opens a single socket and uses
blocking calls on it. Zero chance a socket is being accessed by more than
one thread.

One last possibility, are you absolutely sure that the server is
reading the data to free up buffer space? If the server stops reading
for any reason, the TCP window will be exhausted and the client will
block indefinitely. (Log bytes in and out of send/recv calls at both
ends. Comparing the last couple of totals during a wedge should tell
you whether there’s more data than you expected “stuck in the pipe”.)

The test data is as follows: The client sends a “request”, in this case
around 22K in size, by composing it in a contiguous block in memory and then
using a compliance loop around a send() call to transmit it. The server
works on it for a few hundred milliseconds, then sends a “response” of a
couple KB. The client then starts again with another, identical request. The
same socket is used throughout, opened at client start and held open the
entire time.

I’ve altered the client to detect the send() error and report various
things, including how many bytes of the request have been sent and how many
remain to be sent. When the error occurs, zero bytes of the new request
have been sent. In other words, for the new request there have been zero
successful send() calls… the very first one reports the error. Note that
the client has just finished receiving the response from the previous
iteration. More significantly, the last successful send() will have occurred
just a few hundred milliseconds ago… and we know that send() was
successful because otherwise this client wouldn’t have received the response
it just finished receiving.

The error can take hundreds, and in some cases tens of thousands, of
iterations to show up. And when one of the four copies of the test client
has experienced the error, the other three continue to run just fine. The
fourth client can be restarted and it, too, will resume running without
problems.

Thanks!

Thank you for this thread of information. I run NT SP6a and occasionally
get a wedged ftp session. It doesn’t always wedge at the same place, but
often it’s the same file (usually large). Changing to a mirror site
usually allows a successful transfer. After ~12 minutes of being wedged
I normally receive a connection closed by peer message.

Thanks again for characterising the failure in a manner which should
encourage its resolution. It certainly puts my mind at rest!

Cheers

Don Sharp

Richard Hartman wrote:

I’ve done a lot of WinSock programming, but this one has me stumped.

A client opens multiple synchronous (i.e. blocking send and recv calls) TCP
sockets to a remote server without incident. Data is transferred back and
forth, successfully, for some time on all sockets. Then, for some reason,
one of the client’s send calls “stalls”… no data is transferred, but the
blocking send call doesn’t initially come back with an error. Eventually,
the server’s code (which is specifically designed to do this) recognizes
that the socket has been quiet for too long and forcefully closes it with
shutdown/closesocket. The client receives that notification, and its send
call returns with an error at that time. Meanwhile, the other sockets
continue to function normally, and a replacement socket can be obtained and
used without problems.

I can’t figure out what would cause a TCP connection to simply “stall” like
that. The connection between the two machines is known to be good because
the other sockets continue to operate perfectly. The TCP/IP stacks on both
machines don’t appear to be “damaged” because a replacement TCP connection
can be obtained and used. I would think that one or the other had lost
awareness of the connection, but if so I wouldn’t expect the client’s send
call to return when the server closes the socket; that proves both ends are
still aware of the connection.

Any ideas gratefully accepted!


You are currently subscribed to ntdev as: xxxxx@dddandr.octacon.co.uk
To unsubscribe send a blank email to $subst(‘Email.Unsub’)

This may or may not be related, but I have seen this behavior after exactly
two hours - which is, coincidentally, the default time for Keep Alives being
sent on a socket. As a result of this, I have disabled keep-alives on
sockets and not seen this problem re-occur. I haven’t been able to
reproduce this problem reliably, so this may be a red herring …

On a related note, when using asynchronous sends on NT and Win 2K, if I send
with a buffer size of 4 or 32 Kb, the send completes almost immediately. If
a send is done with 8K buffers (and 16K on Win 2K - not tested on NT), on
both Win NT and 2K the send pends for exactly .2 seconds - using either
overlapped I/O with Winsock 2, or asynchronous sends with Winsock 1.
Needless to say, this makes transfers very slow. This was tested by
transferring data between two sockets on the same PC; not across a LAN. If
anybody has any thoughts on this, I’d love to hear them.

Ed

----- Original Message -----
From: Don Sharp
To: NT Developers Interest List
Sent: Friday, July 14, 2000 4:26 AM
Subject: [ntdev] Re: WinSock send() question

> Thank you for this thread of information. I run NT SP6a and occasionally
> get a wedged ftp session. It doesn’t always wedge at the same place, but
> often it’s the same file (usually large). Changing to a mirror site
> usually allows a successful transfer. After ~12 minutes of being wedged
> I normally receive a connection closed by peer message.
>
> Thanks again for characterising the failure in a manner which should
> encourage its resolution. It certainly puts my mind at rest!
>
> Cheers
>
> Don Sharp
>
> Richard Hartman wrote:
> >
> > I’ve done a lot of WinSock programming, but this one has me stumped.
> >
> > A client opens multiple synchronous (i.e. blocking send and recv calls)
TCP
> > sockets to a remote server without incident. Data is transferred back
and
> > forth, successfully, for some time on all sockets. Then, for some
reason,
> > one of the client’s send calls “stalls”… no data is transferred, but
the
> > blocking send call doesn’t initially come back with an error.
Eventually,
> > the server’s code (which is specifically designed to do this) recognizes
> > that the socket has been quiet for too long and forcefully closes it
with
> > shutdown/closesocket. The client receives that notification, and its
send
> > call returns with an error at that time. Meanwhile, the other sockets
> > continue to function normally, and a replacement socket can be obtained
and
> > used without problems.
> >
> > I can’t figure out what would cause a TCP connection to simply “stall”
like
> > that. The connection between the two machines is known to be good
because
> > the other sockets continue to operate perfectly. The TCP/IP stacks on
both
> > machines don’t appear to be “damaged” because a replacement TCP
connection
> > can be obtained and used. I would think that one or the other had lost
> > awareness of the connection, but if so I wouldn’t expect the client’s
send
> > call to return when the server closes the socket; that proves both ends
are
> > still aware of the connection.
> >
> > Any ideas gratefully accepted!
> >
> > —
> > You are currently subscribed to ntdev as:
xxxxx@dddandr.octacon.co.uk
> > To unsubscribe send a blank email to $subst(‘Email.Unsub’)
>
> —
> You are currently subscribed to ntdev as: xxxxx@midcore.com
> To unsubscribe send a blank email to $subst(‘Email.Unsub’)
>

At 02:00 AM 07/14/2000 +0100, you wrote:

By compliance loop, I presume you mean that you feed send() the whole
buffer initially, but monitor the return value to deal with a partial
send occuring, like:

Exactly. Just as with disk writes, I never presume that all of the data will
be transmitted. And debugging has shown this to be the case.

How does the server read loop work? Multi-threading with blocking
recv()s? select() calls? WSAEventSelect() or WSAAsyncSelect()?

Async overlapped sockets using an I/O completion port. Docs state that you
are guaranteed a completion packet for each completed receive operation, so
it’s hard to imagine how the thread could “miss” a receive notification.

Errm. How long are you waiting to time out the client? Surely by the
time the timeout occurs, it will be *much* longer than a few hundred
milliseconds since the last successful send. I find it can often take
the windows stack several tens of seconds to recover from a bout of
lost packets.

The server permits 120 seconds of inactivity before presuming the client has
died and proactively closing the connection with shutdown/closesocket.

Do these additional details yield more clues?

Thanks!

RLH

At 08:25 AM 07/14/2000 -0400, you wrote:

> but the
> blocking send call doesn’t initially come back with an error.

When does it “come back”? Does it come back immediately, or
after a short period of time?

I have added code to the client test program to determine that, but haven’t
seen the error again since that time.

> Eventually,
> the server’s code (which is specifically designed to do this)
> recognizes
> that the socket has been quiet for too long and forcefully
> closes it with
> shutdown/closesocket.

Not that it matters, but how long is this “eventually” period?

120 seconds.

> The client receives that notification,
> and its send
> call returns with an error at that time.

I expect you mean subsequent send calls? Or are you saying the
initial send that failed only returns after this occurs?

The initial send. For testing purposes, the client’s request is composed
once in memory and reused for each iteration. It’s around 22K in size. The
entire contiguous request is passed to a blocking send() call, and when the
error occurs that first and only send() call returns with an error, and
WSAGetLastError reports WSAETIMEDOUT.

RLH

At 11:15 PM 07/14/2000 +0100, John Sullivan wrote:

On Friday 14 July 2000 you wrote:
> Async overlapped sockets using an I/O completion port. Docs state that you
> are guaranteed a completion packet for each completed receive operation, so
> it’s hard to imagine how the thread could “miss” a receive notification.
(Personally, I never use overlapped IO on sockets. Always
WSAEventSelect or blocking mode.)

One thing which could cause missed events, is attempting to share
event objects: each event in an overlapped structure should be unique
to that structure, and a single WSARecv() call.

IOCP’s entirely eliminate the need for event objects. The only notification
you need, and the only one you receive, is the completion packet on the
IOCP. The event object handle field in the Overlapped structure is set to
NULL at all times. (The docs explicitly state to do this when using sockets
with IOCP’s.)

RLH

WRT to the 0.2 seconds delay.

I suspect that you have not disabled the TCP NAGLE algorithm, which can be
done globally (don’t recommend this), or per individual connection with
setsockopt(TCP_NODELAY).

----- Original Message -----
From: Ed Lau
To: NT Developers Interest List
Sent: Friday, July 14, 2000 2:08 PM
Subject: [ntdev] Re: WinSock send() question

> This may or may not be related, but I have seen this behavior after
exactly
> two hours - which is, coincidentally, the default time for Keep Alives
being
> sent on a socket. As a result of this, I have disabled keep-alives on
> sockets and not seen this problem re-occur. I haven’t been able to
> reproduce this problem reliably, so this may be a red herring …
>
> On a related note, when using asynchronous sends on NT and Win 2K, if I
send
> with a buffer size of 4 or 32 Kb, the send completes almost immediately.
If
> a send is done with 8K buffers (and 16K on Win 2K - not tested on NT), on
> both Win NT and 2K the send pends for exactly .2 seconds - using either
> overlapped I/O with Winsock 2, or asynchronous sends with Winsock 1.
> Needless to say, this makes transfers very slow. This was tested by
> transferring data between two sockets on the same PC; not across a LAN. If
> anybody has any thoughts on this, I’d love to hear them.
>
> Ed
>
>
> ----- Original Message -----
> From: Don Sharp
> To: NT Developers Interest List
> Sent: Friday, July 14, 2000 4:26 AM
> Subject: [ntdev] Re: WinSock send() question
>
>
> > Thank you for this thread of information. I run NT SP6a and occasionally
> > get a wedged ftp session. It doesn’t always wedge at the same place, but
> > often it’s the same file (usually large). Changing to a mirror site
> > usually allows a successful transfer. After ~12 minutes of being wedged
> > I normally receive a connection closed by peer message.
> >
> > Thanks again for characterising the failure in a manner which should
> > encourage its resolution. It certainly puts my mind at rest!
> >
> > Cheers
> >
> > Don Sharp
> >
> > Richard Hartman wrote:
> > >
> > > I’ve done a lot of WinSock programming, but this one has me stumped.
> > >
> > > A client opens multiple synchronous (i.e. blocking send and recv
calls)
> TCP
> > > sockets to a remote server without incident. Data is transferred back
> and
> > > forth, successfully, for some time on all sockets. Then, for some
> reason,
> > > one of the client’s send calls “stalls”… no data is transferred, but
> the
> > > blocking send call doesn’t initially come back with an error.
> Eventually,
> > > the server’s code (which is specifically designed to do this)
recognizes
> > > that the socket has been quiet for too long and forcefully closes it
> with
> > > shutdown/closesocket. The client receives that notification, and its
> send
> > > call returns with an error at that time. Meanwhile, the other sockets
> > > continue to function normally, and a replacement socket can be
obtained
> and
> > > used without problems.
> > >
> > > I can’t figure out what would cause a TCP connection to simply “stall”
> like
> > > that. The connection between the two machines is known to be good
> because
> > > the other sockets continue to operate perfectly. The TCP/IP stacks on
> both
> > > machines don’t appear to be “damaged” because a replacement TCP
> connection
> > > can be obtained and used. I would think that one or the other had lost
> > > awareness of the connection, but if so I wouldn’t expect the client’s
> send
> > > call to return when the server closes the socket; that proves both
ends
> are
> > > still aware of the connection.
> > >
> > > Any ideas gratefully accepted!
> > >
> > > —
> > > You are currently subscribed to ntdev as:
> xxxxx@dddandr.octacon.co.uk
> > > To unsubscribe send a blank email to $subst(‘Email.Unsub’)
> >
> > —
> > You are currently subscribed to ntdev as: xxxxx@midcore.com
> > To unsubscribe send a blank email to $subst(‘Email.Unsub’)
> >
>
>
> —
> You are currently subscribed to ntdev as: xxxxx@netlexis.com
> To unsubscribe send a blank email to $subst(‘Email.Unsub’)

Back in December, I was having a similar problem, but at the
TDI level. I would notice that under load I would get a
TDI_DISCONNECT_ABORT; other traffic to the remote host would
continue along unabated. This sounds similar to what you are
seeing.

I posted some questions on this list, but unfortunately
never got an explanation for why this disconnect was occurring.
I ended up having to put in some fairly messy
reconnect handlers to restart the connection if the TDI aborts
it like this.

I would be very interested in finding out what (if anything)
you discover!


Peter Lawthers Phone: (858) 792-5549
Prestant Technology, Inc Fax: (858) 350-7630
13682 Nogales Dr. Email: xxxxx@prestant.com
Del Mar, CA 92014