How WinSock works

> > So does AFD. So, AFD is still an analog of sockfs in Windows.

Since when can you use ReadFile() and WriteFile() on sockets,

???

Well, OK, looks like you need some education on what is NT’s WinSock.

WinSock is the flexible polymorphic implementation of Berkeley sockets API,
which is implemented in both user and kernel mode, and which has the addition
of MS-specific calls which implement the NT’s overlapped IO on sockets.

WinSock APIs are in is wsock32.dll/ws2_32.dll, which in turn looks in the
registry and load the proper provider DLL. This allows you to implement your
own address families fully in user mode, or as a mix of your own proprietary
user+kernel modules with your own interface between user and kernel part.

Actually, WinSock API calls do the following - “get the provider’s function
table by socket handle value” and then “call the provider’s function”.

This is OK for all Berkeley and WSAxxx calls.

But note that ReadFile and WriteFile are always supported on a socket handle -
in NT, SOCKET is just a kernel file handle. If we are speaking about send() -
then send() is in WinSock, and WinSock is free to implement any semantics on
it. But, if we are speaking about ReadFile, then sorry, this API is not in
WinSock and the standard NT API which knows nothing on sockets, then only way
of customizing ReadFile is to customize its kernel part.

To support ReadFile on user-mode WinSock provider DLLs (user-mode address
families), there is an auxiliary driver called ws2ifsl.sys. To employ this
driver, the provider DLL must call the WinSock’s provider interface function
“create IFS handle” or such. This call will create a pair of file handles on
ws2ifsl, will create the necessary thread pool for inverted calls, will
associate the slave handle with the provider’s function table and will return
the master handle to the provider DLL. Then the provider DLL returns this
handle from its WSPSocket.

When the app calls Read/WriteFile on this handle, the calls go directly (no
WinSock!) to ws2ifsl.sys in kernel. This module transfers the call to the slave
end of its conceptual “pipe”, and the thread pool in ws2_32.dll will consume
the call (yes, inverted call) and execute it by calling some WSPXxx in the
provider DLL.

But this is not the typical scenario of socket address family implementation.
The typical scenario is that the address family package has the kernel part,
which automatically guarantees that the socket handle will be the file handle
on this kernel part. Such packages use “register IFS handle” instead of “create
IFS handle”, their WSPSocket path first does CreateFile on their kernel part,
and then “register IFS handle”. The second step is needed for functions like
send() to be dispatched to this provider’s WSPSend. Read/WriteFile are
automatically delivered to the kernel part.

Now note that many address families have lots of common in them - buffering,
listen backlog, lingering closes to say a few. So, the common layer which
implements all of this was created, and also this same layer serves as default
kernel-mode WinSock provider. This module is called AFD.SYS.

So, if the address family implementor needs a kernel part, then the existing
AFD.SYS can be reused as a framework. To reuse it, one must program to the
lower-edge interface of AFD, which is called TDI.

TDI is much more low-level then socket calls. For instance, the TDI transports
usually (surely this is true on TCPIP) have no buffering at all. So, TDI_SEND
operation is kept pending till all ACKs will arrive. The reason is that,
while the ACKs have not arrive yet, there is a possibility that there will be a
need for retransmit. Now note that the transport does no buffering, no data
copies, so, if it would complete the original send request - it will have the
data for the retransmit no more. So, TDI_SEND on unbuffered (the usual way,
TCPIP’s too) transport pends till all ACKs will arrive, and retransmits are
handled off the same send request’s buffer.

On receive and accept, TDI uses the 2phase interaction - first is
ClientEventReceive/Connect about incoming connection offer or incoming data
portion, second is the TDI_ACCEPT or TDI_RECEIVE completion routine. On accept,
this allows (and requires) the client to create the new TDI endpoint (accept’s
target) itself, and then associate it with this particular incoming connection
offer. This allows to implement listen backlog above the transport, and also
allows to extend this “accept to specified pre-created socket” feature to user
mode as overlapped AcceptEx API.

On receive, this allows the client to own the memory buffers for the received
data, no need to allocate them in transport. Also this allows to first receive
the header, access it, determine the data portion size which will follow (like
SMB WRITE transaction), then get this data in nonblocking way with only 1 copy.

Also there are other operation modes in TDI like chained receive etc.

If the provider of some address family (like IrDA) is implemented as
kernel-mode TDI transport - then it automatically reused AFD.SYS layer, which
a) exposes it to user-mode WinSock b) implements buffers/listen
backlog/lingering close.

The second part of this “default WinSock provider” is the user mode provider
DLL, called MSAFD.DLL, which is nearly totally consists of DeviceIoControl
calls to AFD.SYS.

Surely default WinSock provider cannot support protocol-dependent stuff like
socket options other then SOL_SOCKET. To implement them, MSAFD requires
address-family-specific helper DLL, which is - for TCPIP - given as the
WSHSMPLE sample in the DDK.

The differences with Linux:

  • Berkeley calls are syscalls in Linux, but are user-mode wrappers around
    DeviceIoControl in Windows, MSAFD.DLL is a wrapper (and, if you use only TCPIP,
    all WinSock userland is a wrapper), AFD.SYS is the kernel module where they
    arrive. Also it was so in SunOS 5.2/Solaris 2.x with their /dev/nit - so, AFD
    is the same as “nit” kernel module in SunOS, which is in turn the same as
    “sockfs” in Linux.

  • in Linux and FreeBSD, kernel-mode clients talking directly to TCP without
    sockets - with their own listen backlog and buffering - are not permitted. They
    are permitted in Windows. Instead, socket API is absent in kmode in Windows at
    all, though implementable as wrapper around TDI.

You cannot use select() in Windows on anything but sockets, since select() in
Windows is not a generic kernel notion, but a socket-only notion, possibly
built around WSAEventSelect. You also cannot use select with 3 NULLs as sleep.

But you can use Read/WriteFile on a socket.


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com

>> TDI is much more low-level then socket calls. For instance, the TDI

> transports usually (surely this is true on TCPIP) have no buffering
> at all. So, TDI_SEND operation is kept pending till all ACKs will
> arrive
. The reason is that, while the ACKs have not arrive yet,
> there is a possibility that there will be a need for retransmit. Now
> note that the transport does no buffering, no data copies, so, if it
> would complete the original send request - it will have the data for
> the retransmit no more. So, TDI_SEND on unbuffered (the usual way,
> TCPIP’s too) transport pends till all ACKs will arrive, and retransmits are handled off the same send request’s buffer.

I call TdiBuildSend with TDI_SEND_NO_RESPONSE_EXPECTED | TDI_SEND_NON_BLOCKING in InFlags, but it looks like still waiting for a acknowledgment of the send from the remote node until timeout, is this InFlags able to avoid the Nagle algorithm or not?

Thanks,

Tao


>acknowledgment of the send from the remote node until timeout, is this InFlags

able to avoid the Nagle algorithm or not?

Doubts. I will not be surprised if TCP does not care about some of the flags.

Try switching Nagle off by IOCTL_TCP_SET_INFORMATION_EX


Maxim Shatskih, Windows DDK MVP
StorageCraft Corporation
xxxxx@storagecraft.com
http://www.storagecraft.com